Concept Drift Detection for High-Dimensional Big Data
Abstract
Learning algorithms are commonly used in predictive analytics of Big Data. These algorithms
assume that the distribution of the underlying data is static over time. However, this assumption
may not hold for Big Data-data with high volume, velocity, and variety. Concept drift detection
algorithms can be used to detect such change in distribution. There have been no studies yet on
dimensional scalability of concept drift detection algorithms. This thesis presents the results of a
performance study of dimensional scalability of commonly used supervised and unsupervised
concept drift detection algorithms. The results indicate that Friedman and Rafsky’s algorithm-an
unsupervised algorithm-was unable to scale to 100,000 dimensions. This thesis presents a map
reduce version of Friedman and Rafsky’s algorithm that scales to arbitrary dimensions on a
cluster