Published on Fri Oct 24 2014

Dimensionality Reduction for k-Means Clustering and Low Rank Approximation

Michael B. Cohen, Sam Elder, Cameron Musco, Christopher Musco, Madalina Persu

We show how to approximate a data matrix $A$ with a much smaller sketch $A~$ that can be used to solve a general class of constrained k-rank approximation problems to within $(1+ϵ)$ error. Importantly, this class of problems includes $k$-means clustering and unconstrained low rank approximation (i.e. principal component analysis). By reducing data points to just $O(k)$ dimensions, our methods generically accelerate any exact, approximate, or heuristic algorithm for these ubiquitous problems. For $k$-means dimensionality reduction, we provide $(1+ϵ)$ relative error results for many common sketching techniques, including random row projection, column selection, and approximate SVD. For approximate principal component analysis, we give a simple alternative to known algorithms that has applications in the streaming setting. Additionally, we extend recent work on column-based matrix reconstruction, giving column subsets that not only `cover' a good subspace for **Error**, but can be used directly to compute this subspace. Finally, for $k$-means clustering, we show how to achieve a $(9+ϵ)$ approximation by Johnson-Lindenstrauss projecting data points to just $O(gk/ϵ_{2})$ dimensions. This gives the first result that leverages the specific structure of $k$-means to achieve dimension independent of input size and sublinear in $k$.

0

0

0