RandNLA for LS: A Brief Introduction – Part I

Dimensionality reduction is a topic that has governed our (the 2017 BGSE Data Science cohort) last three months. At the heart of topics such as penalized likelihood estimation (Lasso, Ridge, Elastic Net, etc.), principal component analysis and best subset selection lies the fundamental trade-off between complexity, generalizability and computational feasibility.

David Rossell taught us that even if we have found a methodology to compare across models, there is still the problem of enumerating all models to be compared. At this point the bias-variance trade-off turns into a question of computational power. Evaluating every single model using Bayes Factors or other performance measures just becomes way to expensive if the number of potential predictors exceeds 30 ({2^p} models). Instead of going the brute force way, one usually makes use of computational heuristics, which are clearly associated with short-comings. But what if we could sketch each model performance measure up to a certain degree of certainty?

Randomized Numerical Linear Algebra (RandNLA) is a fairly new and hot research field, which intends to speed up fundamental linear algebra operations by making use of either random sampling or random projections. The ultimate aim is to compress dimensionality in a sensible manner while obtaining an ({1 \pm \epsilon})-approximation of the objective. The top names in this field are Drineas, Kannan and Mahoney (2006 a,b,c). All of them have done pioneering work in the field of randomized algorithms for large matrices. Their work covers a whole branch of computational approximations such as randomized versions of matrix-matrix multiplication, least squares (LS), fast leverage score approximations and sparse matrix completion. The following two-part-series is mainly based on lecture notes of Mahoney (2016) and intends to introduce the audience to an alternative randomized perspective on solving the standard LS problem. Therefore, this article is going to review two simple (and fairly “slow”) but powerful ideas which simplify linear systems of equations with a large number of constraints (the classical {n>>p} setting of tall data). The next article is going to improve upon these two algorithms by widening the computational bottlenecks. Thereby we are able to speed up computational complexity from {O(nd^2)} to {O(ndlog(r))}. So lets get started!

The Translation of the LS Problem

The standard LS problem can be summarized in the following optimized objective function (RSS)

\displaystyle RSS_{min} = \min_{\beta \in \mathbb{R}^d}||y - X\beta||_2^2

where {y \in \mathbb{R}^n}, {X \in \mathbb{R}^{n \times d}} and {\beta \in \mathbb{R}^d}. After “doing the math” (multiplying out, taking the gradient, yada yada) LS boils down to solving the following normal equations

\displaystyle (X'X) \beta_{LS} = X'y

Usually, this system is solved using Cholesky, QR decomposition or the SVD. The complexity of all methods is {O(nd^2)} (different constant factors – depending on how well-conditioned {X} is).

Randomized algorithms intend to work on the tall {n}-side of things (although it is also easily possible to switch sides). Instead of crunching the full {n \times d} matrix {X} and n-dimensional response vector {y}, we want to do something smart: Construct a sketch of {X} and {y}, {\tilde{X} \in R^{r \times d}} and {\tilde{y} \in R^{r}}, which drastically reduces the amount of rows to {r << n}. Let’s write the transformed objective function as \displaystyle \widetilde{RSS}_{min} = \min_{\beta \in \mathbb{R}^d}||\Pi (y - X\beta)||_2^2 where {\Pi \in \mathbb{R}^{r \times n}} denotes some {(r \times n)} transformation matrix. Afterwards, we solve this reduced system and obtain the vector {\tilde{\beta}_{LS}}. There are two objectives which we intend to achieve:

\displaystyle \textit{Solution Certificate:} \ \tilde{\beta}_{LS} \approx \beta_{LS}

\displaystyle \textit{Objective Function:} RSS_{min} \approx \widetilde{RSS}_{min}

Clearly, the second objective is more relaxed than the first. The solution space that fulfills the second objective is a subset of the solution space for the more strict first condition. Usually, it only kicks in if the solution certificate is hard to achieve in the first place. The natural question that pops up is: What are reasonable structural requirements to an approximation scheme?

Mahoney (2016, 57) states the following two: Let X = {U_X \Sigma_X V_X^T} denote the singular value decomposition of {X}. Furthermore, let {y^\perp = U_X U_X^T y} denote the residual vector, resulting from the fact that not all of {y} can be “reached” by {span(X)} (Note {RSS_{min} = ||y^\perp||_2^2}).

\displaystyle \textit{Rotation/Isometry Requirement:} \ \sigma_{min} (\Pi U_X) \geq \frac{1}{\sqrt{2}}

The first structural condition states that we require a lower bound on the singular values of {\Pi U_X}. This basically means that we want {\Pi} to be an approximate isometry/rotation after reducing the dimensionality to r  (Mahoney (2016, 57)) . Hence, the transform is not allowed to influence the original problem and therefore the fit too much.

\displaystyle \textit{Subspace Embedding:} ||U_X \Pi^T \Pi y^\perp||_2^2 \leq \frac{\epsilon}{2} RSS_{min}

Standard LS minimizes the RSS by orthogonally projecting y onto the column space of X. The orthogonal projection is described by a linear combination defined by {\beta}. The second structural condition therefore requires that this orthogonality approximately “survives” the transformation. Hence, {\Pi y^\perp} has to be approximately orthogonal to {\Pi U_X}. All of the following algorithms are going to fulfill these conditions (for proofs see Mahoney (2016)). One can already tell that the number of dimensions on which we reduce, has to be related with the quality of our approximation. All proofs essentially work out the {\epsilon}-r-trade-off by making use of matrix concentration inequalities.

A First Slow Random Sampling Algorithm

{\Pi} can be viewed in two different ways: Either as a sampling matrix or as a projection matrix. Let us first consider the former case: This clearly has to be a non-data-agnostic way of constructing {\Pi} (we have to do computations with X to obtain some sampling reference). Each row of {\Pi} contains one non-zero entry, which represents a rescaled sampled row (Note: We sample with replacement). Intuitively we want to sample observations that highly influence the original LS fit more often. A natural measure of influence of each data point on the in-sample fit of itself is the statistical leverage. It tells us how much a specific data point pulls the linear model fit towards itself. A data point with leverage 1 “occupies” one complete parameter/degree of freedom. Hence, we are interested in computing the diagonal elements of the hat matrix and sample according to exactly these leverage score probabilities. It can be shown that {H_{ii} = \frac{1}{d} ||U_{(i)}||_2^2 = \frac{1}{d} ||Q_{(i)}||_2^2} and hence we have a first potential randomized algorithm:

Mahoney (2016, 65) – “Slow” Random Sampling Algorithm for LS

Input: LS problem with {X \in \mathbb{R}^{n \times d}}, {y \in \mathbb{R}^{n}} and an {\epsilon}-level of approximation 
Output: Approximate LS solution, {\tilde{\beta}} 
--------------------------------------------------------------------------
Compute the importance sampling distribution {p_i = \frac{1}{d} ||U_{(i)}||_2^2 = \frac{1}{d} ||Q_{(i)}||_2^2, \ \forall i=1,...,n}, 
where {U} and {Q} are the orthogonal matrices from either the SVD or the QR. 

Randomly sample {r = O(\frac{d log(d)}{\epsilon})} rows of {X} and {y}. Rescale them by {\frac{1}{rp_{i}}} and form {\tilde{X} \in R^{r \times d}, \tilde{y} \in R^{r}}. 

Solve {(\tilde{X}'\tilde{X}) \tilde{\beta} = \tilde{X}'\tilde{y}} using any of the above mentioned methods. 

Return {\tilde{\beta}}, an {\epsilon}-approximation of {\beta_{LS}}.

As described above, we basically construct a sampling distribution by computing the exact leverage scores. The problem with this algorithm is that computing the SVD of {X} is already almost as expensive as solving the full problem in the first place. Hence, we need a fast way to compute an approximate leverage score. Before addressing this issue in the following article, let us take a look at a different point of view: dimensionality reduction through random projections.

A First Slow Random Projection Algorithm

Random projections are based on one key result (and many different versions), the Johnson-Lindenstrauss Lemma and the resulting transformation.

Lemma – Soft Gaussian-version of Johnson-Lindenstrauss Lemma (Mahoney (2016, 45))

Given n points {\{u_i\}_{i=1}^n}, each of which is in {\mathbb{R}^d}, let {P \in \mathbb{R}^{d \times k}} be such that 
{P_{ij} = \frac{1}{\sqrt{k}} N(0,1)} and let {\{v_i\}_{i=1}^n} be points in {\mathbb{R}^k} defined as {v_i = u_i P}.  
Then if {k \geq 9 \frac{log(n)}{\epsilon^2 - \epsilon^3}}, for some {\epsilon \in (0,0.5)}, then with probability at least 0.5, 
all point-wise distances are preserved, {\forall \ i, i'} we have 

\displaystyle (1-\epsilon) ||u_i - u_{i'}||_2^2 \leq ||v_i - v_{i'}||_2^2 \leq (1+\epsilon) ||u_i - u_{i'}||_2^2

Pause for thought – this is an amazing result! Instead of having to precompute any metric before being able to do anything, there is an easy way to obtain a lower dimensional representation of our original input, that preserves the original length and point-wise distances. The proving argument is based on a combination of the Chernoff bounds and Boole’s inequality and can again be found in Mahoney (2016, 46 et seq.).

Definition – Johnson-Lindenstrauss Transform (Mahoney (2016, 68))

Given an {\epsilon >0 } and {n} points {\{x_i\}_{i=1}^n}, where {x_i \in \mathbb{R}^d}, an {\epsilon}-JLT ({\epsilon}-
Johnson-Lindenstrauss Transform), denoted {\Pi \in \mathbb{R}^{r \times d}} is a projection of the points 
into {\mathbb{R}^r} such that

 \displaystyle (1-\epsilon) ||x_i||^2_2 \leq ||\Pi x_i||^2_2 \leq (1+\epsilon) ||x_i||^2_2 \ \ \ \ \ (1)

The Johnson-Lindenstrauss Transform maps the n-dimensional row vectors into a r-dimensional subspace, while preserving their length and their point-wise distances up to an {\epsilon}-proportion!

Most of the constructions of {\Pi} are either based on Gaussian {N(0,1)} random samples or on binary coin tosses (Achlioptas, 2003). In terms of computational efficiency, binary coin tosses are a lot faster! This is due to the fact that simulating Gaussian samples (e.g. Box-Mueller algorithm for transformation of uniform samples) is usually a lot harder than just tossing a fair coin. The general version of this algorithm looks like this:

Mahoney (2016, 65) – “Slow” Random Projection Algorithm for LS

Input: LS problem with {X \in \mathbb{R}^{n \times d}}, {y \in \mathbb{R}^{n}} and an {\epsilon}-level of approximation 
Output: Approximate LS solution, {\tilde{\beta}} 
--------------------------------------------------------------------------
Construct a random projection matrix {\Pi \in \mathbb{R}^{r \times d}} of Gaussians or binary {\pm 1} coin flips. 

Randomly project onto {r = O(\frac{d log(d)}{\epsilon})} rows of {X, y} and form {\tilde{X} \in R^{r \times d}, \tilde{y} \in R^{r}}. 

Solve {(\tilde{X}'\tilde{X}) \tilde{\beta} = \tilde{X}'\tilde{y}} using any of the above mentioned methods.

Return {\tilde{\beta}}, an {\epsilon}-approximation of {\beta_{LS}}.

Sadly, also this algorithm has computational downsides associated to the computation of the matrix-matrix product of {\Pi} and {X} which already has a complexity of {O(ndr)}.

Computational Bottlenecks and an Outlook

The presented randomized algorithms provide an {(1 \pm \epsilon})-approximation to the LS objective function value and an {\epsilon}-approximation to the solution vector. But the ultimate goal of all our effort is to speed up computation time. Sadly, both of the proposed algorithms do not improve on the {O(nd^2)} complexity of standard LS algorithms.

In the random sampling algorithm computing the SVD to obtain exact leverage scores is almost as expensive as solving the whole problem LS in the first place. Computing the matrix-matrix product of {\Pi X} in the projection scenario already has the complexity of {O(ndr)} which is usually larger than {O(nd^2)}. Widening these bottlenecks is going to be part of the next article (Hint: In both cases we are going to have to make use of Fast Fourier or Fast Johnson-Lindenstrauss-Transforms (Ailon and Chazelle, 2006).

The process is going to heavily rely on smart data-preprocessing and the use of fast projection methods. This will enable us to either construct fast approximate leverage scores and to sample according to this importance sampling distribution (Fast Random Sampling). Or to approximately uniformize the leverage scores by projection, which will allow us to sample uniformly at random (Fast Random Projection). If you are interested in an R implementation of the algorithms in this blog post please check out my github repo (here).

References

Achlioptas, Dimitris. “Database-friendly random projections: Johnson-Lindenstrauss with binary coins.” Journal of computer and System Sciences 66.4 (2003): 671-687.

Ailon, Nir, and Bernard Chazelle. “Approximate nearest neighbors and the fast Johnson-Lindenstrauss transform.” Proceedings of the thirty-eighth annual ACM symposium on Theory of computing. ACM, 2006.

Drineas, Petros, Ravi Kannan, and Michael W. Mahoney. “Fast Monte Carlo algorithms for matrices I: Approximating matrix multiplication.” SIAM Journal on Computing 36.1 (2006 a): 132-157.

Drineas, Petros, Ravi Kannan, and Michael W. Mahoney. “Fast Monte Carlo algorithms for matrices II: Computing a low-rank approximation to a matrix.” SIAM Journal on Computing 36.1 (2006 b): 158-183.

Drineas, Petros, Ravi Kannan, and Michael W. Mahoney. “Fast Monte Carlo algorithms for matrices III: Computing a compressed approximate matrix decomposition.” SIAM Journal on Computing 36.1 (2006): 184-206.

Mahoney, Michael W. “Lecture Notes on Randomized Linear Algebra.” arXiv preprint arXiv:1608.04481 (2016).

About the Author

I (Robert Lange) graduated last year from the University of Cologne with a BSc in Economics. Reading Milton Friedman’s “Essays in positive economics” and a passion for computational statistics led me to pursue a more empirically driven path. I quickly realized that there is only one Master’s  program in this field that is suited to fulfill the needs of the 21st century: Data Science at the Barcelona GSE. Ultimately, I want to combine data-analytic knowledge with problems that change the way  we pursue of lives (e.g. Bionformatics and Neuroscience). If you are interested in me, check out my homepage.

Advertisements

2 thoughts on “RandNLA for LS: A Brief Introduction – Part I

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s