RandNLA for LS: A Brief Introduction – Part II

In the previous article we took a first look at the pioneering work of Michael Mahoney and Petros Drineas in Randomized Numerical Linear Algebra. Our ultimate goal was to speed up the computation of the ordinary least-squares solution vector, {\beta_{LS}}, from {O(nd^2)} time to at least {o(nd^2)} time. We reviewed two simple and fairly “slow” but powerful algorithms which simplified linear systems of equations with a large number of constraints (the classical {n>>p} setting of tall data). The main idea was to preprocess the input matrix {X \in {\mathbb R}^{n \times d}} in such a way that we obtained a “representative” sketch in a lower dimensional space ({\tilde{X} \in {\mathbb R}^{r \times d}}). The term “representative” could be defined in two different ways and always with respect to the ultimate goal: a good approximation to the least squares (LS) solution. First, a good representation in a lower dimensional space could be interpreted as a subspace embedding that maintains pairwise distance up to an {\epsilon}-degree (via Johnson-Lindenstrauss transforms). This idea was the fundament of a random projection algorithm. Second, “representative” could be defined in terms of “leverage” on the LS fit. A natural measure was the so-called statistical leverage score, which could be extracted from the diagonal elements of the hat matrix or from the euclidean row norm of the orthonormal basis {U} for {range(X)} computed from the QR decomposition or SVD.

The main difference between the two methods was that the projection approach was data-agnostic. This meant that we did not have to compute any statistics in the first place. Sampling on the other hand involved some serious preprocessing in the form of establishing an importance sampling distribution. Both methods have advantages and disadvantages: Computing the exact leverage score (random sampling) was associated with a full singular value decomposition. Computing the SVD is usually (disregarding sparse variants) already as expensive as solving the complete LS problem in the first place. Furthermore, it is unclear whether or not leverage is a “robust” measure of being a representative sample from the underlying data-generating process. Eliminating one high-leverage observation might dramatically affect the complete sampling distribution. On the other hand, computing the ordinary Johnson-Lindenstrauss transform in the random projection variant using a dense matrix {\Pi \in {\mathbb R}^{r \times n}} would take at least {O(ndr)} time (matrix-matrix multiplication). Since usually {r > d}, we would not obtain an improvement.

In todays article we are going to introduce the Fast Johnson Lindenstrauss Transform (FJLT). This result is going to be the fundament of two very important concepts which speed up the computation of an {\epsilon}-approximation to the LS objective function and the target vector: First, the FJLT can be used to quickly project the original data matrix into an {r}-dimensional subspace, in which the leverage scores are uniformized. In this way we are able to sample rows uniformly at random. Secondly, it is also going to allow us to approximate the actual leverage scores in a fast way. Both concepts rely on Fast Fourier (or Hadamard-based) transforms. Ultimately, this is going to speed up computational complexity from {O(nd^2)} to {O(ndlog(r))}.

The following text is structured in the following way: First, I introduce the concept of a Fast Johnson Lindenstrauss subspace embedding. Afterwards, I will outline a fast random sampling algorithm for LS. In the end, we will analyze complexity and discuss further research questions as well as possible extensions/improvements. Again, the main reference is Mahoney (2016).

1. The Fast Johnson Lindenstrauss Transform

Remember that the goal of all our efforts is to approximate the solution of a system of linear equations {\beta_{LS} \in \mathbb{R}^d} by a vector {\tilde{\beta}} and to do this in a reasonably fast way (faster than {O(nd^2)}). The main ingredient for this is going to be the fast brother of the regular Johnson Lindenstrauss transform: The fast subspace Johnson Lindenstrauss transform. Mahoney (68, 2016) defines it in the following way:

Definition 1 (Fast Johnson Lindenstrauss Transform – FJLT) Given an {\epsilon >0 } and an orthogonal matrix {U \in {\mathbb R}^{n \times d}} viewed as d vectors in {{\mathbb R}^n}. A FJLT projects vectors fro {{\mathbb R}^n \rightarrow R^r} such that the orthogonality of U is preserved, and it does it quickly. I.e., {\Pi \in {\mathbb R}^{r \times n}} is an {\epsilon}-FJLT if

  • Orthogonality preservation: {||I_d - U^T \Pi^T \Pi U||_2 \leq \epsilon }
  • Fast: {\forall X \in {\mathbb R}^{n \times d}} we can compute {\Pi X} in {O(ndlog(r))} time.

The first condition requires us to maintain the orthogonality of the orthonormal basis U even after processing it with {\Pi}. So how can we define such a matrix? Ailon et al (2006) proposed the following Hadamard-based construction, which relies on fast Fourier/Hadamard preprocessing of the input matrix:

\displaystyle \Pi_{FJLT} = PHD

  • {P \in {\mathbb R}^{r \times n}} – Sparse JL matrix/Uniform sampling matrix (Achlioptas (2003), Frankl et al (1988)), where {q = \min\{1, \Theta (\frac{log^2 (n)}{d} )\}}
    \displaystyle P_{ij} =\left\{ 0 \quad \text{with prob. 1-q and} \quad N(0, q^{-1}) \quad \text{with prob. q} \right\}
  • {H \in {\mathbb R}^{n \times n}} – DFT or normalized Hadamard transform matrix: Structured so that Fast Fourier methods can be applied to compute them quickly. Setting {\tilde{H}_1 = 1} it is defined for all {n} that are a power of two in a recursive fashion :\displaystyle \tilde{H}_{2n} = \begin{bmatrix} \tilde{H}_{n} & \tilde{H}_{n} \\[0.3em] \tilde{H}_{n} & -\tilde{H}_{n} \end{bmatrix} \text{ and } H_n = \frac{\tilde{H}_n}{\sqrt{n}} 
  • {D \in{\mathbb R}^{n \times n}} – Takes values {\pm 1} with probability {\frac{1}{2}} each: Preprocesses bad cases by randomization.

The matrix product {HD} fulfills two very important tasks (Drineas et al, 3449, 2012): First, it essentially flattens spiky vectors (“spreads out its energy” – Drineas et al (3449, 2012)). Mathematically speaking, this means that we can bound the sup-norm of all transformed vectors by a quantity that is inversely proportional to {d}. This ultimately uniformizes the leverage scores (leverage scores will equal {r/n}). If {r} is large enough, this smoothing in combination with the sampling matrix {P} allows us to construct an FJLT with high probability. Furthermore, computing the {r} vectors of the FJLT transformed matrix can be done in {O(ndlog(r))}, as desired. Hence, {\Pi_{FJLT} = PHD} can be viewed as a subsampled randomized Hadamard transform (Drineas et al, 3449, 2012) or as a FJLT with high probability.

In this way, we obtain two potential algorithms for solving the above LS problem (Mahoney, 71, 2016):

  • (i) Random Projection Approach II: Uniformize leverage scores and use sparse projection matrix (or sample uniformly)
  • (ii) Random Sampling Approach: Use FJLT to compute approximate leverage scores and sample accordingly

In the next section we will focus on the second approach. There is no particular reason, except that we are going to be able to introduce a new idea in approximating the quantity that allows us to approximate our objective by random sampling.

2. Doing things faster

Before we are able to outline the full random sampling algorithm we need to find a fast way to compute the leverage scores. We saw that the exact computation was done by computing the SVD. But this did not speed things up. We can fasten this process by applying the previous trick a second time: We will not calculate the exact leverage slowly but calculate an approximation (with {\epsilon}-slack) in a fast way. Drineas et al (2012) first proposed the following algorithm to approximate the leverage scores:

Drineas et al (3451, 2012) – FJLT approximation for leverage scores

Input: {X \in \mathbb{R}^{n \times d}} with SVD {X = U^{(X)} \Sigma V^T} and an {\epsilon}-level
Output: Approximate leverage scores, {\tilde{l}_i, i=1,...,n} 
-----------------------------------------------------------------------------------
Let {\Pi_1 \in {\mathbb R}^{r_1 \times n}} be an {\epsilon}-FJLT for {U^{(X)}} with {r_1 = \Omega \left( \frac{dlog(n)}{\epsilon^2} log\left( \frac{dlog(n)}{\epsilon^2}\right) \right)}. 

Compute {\Pi_1 X} and its SVD/QR where {R^{(\Pi_1 X)}=\Sigma^{(\Pi_1 X)} V^{(\Pi_1 X)T}}. 

View the rows of {XR^{-1} \in {\mathbb R}^{n \times d}} as n vectors in {{\mathbb R}^d}. Let {\Pi_2 \in {\mathbb R}^{d \times r_2}} be an {\epsilon}-JLT for {n^2} vectors, with {r_2 = O(\frac{log(n)}{\epsilon^2})}. 

Return {\tilde{l}_i = ||(XR^{-1}\Pi_2)_i||_2^2}, an {\epsilon}-approximation of {l_i}.

It relies on first applying a FJLT to {X}. Afterwards one has to compute the SVD of the transformed matrix, before applying another regular JLT to {XR^{-1}}. The complexity of the algorithm can be decomposed in the following way:

  • {\Pi_1 X} takes {O(ndlog(r_1))} since {\Pi_1} is an {\epsilon}-FJLT
  • SVD of {\Pi_1 X} takes {O(r_1 d^2)}
  • {R^{-1} \Pi_2} takes {O(r_2d^2)} since {\Pi_2} is an {\epsilon}-JLT
  • Premultiplying by X takes {O(ndr)} (plain matrix-matrix multiplication)
  • {\implies} Overall: {O(ndlog(r_1) + r_1 d^2 + r_2 d^2 + ndr_2)}

By choosing the parameters appropriately ({r_1 = O(\epsilon^{-2}dlog(n)(log(\epsilon^{-2}dlog(n))))}, {r_2 = O(\epsilon^{-2} log(n))}, {\epsilon} constant, {\delta = 0.1}, {d \leq n \leq e^d}), it follows that the algorithm has complexity {O(ndlog(d/\epsilon) + nd\epsilon^{-2}log(n) + d^3\epsilon^{-2}log(n)log(d\epsilon^{-1}))}.

Using these fast leverage score approximations, we can construct an importance sampling distribution and apply the following algorithmic leveraging algorithm first introduced by Drineas et al (2011):

Drineas et al (2011) – “Fast” Random Sampling Algorithm for LS

Input: LS problem with {X \in \mathbb{R}^{n \times d}}, {y \in \mathbb{R}^{n}} and an {\epsilon}-level of approximation 
Output: Approximate LS solution, {\tilde{\beta}} 
---------------------------------------------------------------------------------
Compute the importance sampling distribution {p_i = \frac{1}{d} ||U_{(i)}||_2^2 = \frac{1}{d} ||Q_{(i)}||_2^2, \ \forall i=1,...,n}, where {U} and {Q} are the orthogonal matrices from either the SVD or the QR. 

Randomly sample {r = O(\frac{d log(d)}{\epsilon})} rows of {X} and {y}. Rescale them by {\frac{1}{rp_{i}}} and form {\tilde{X} \in R^{r \times d}, \tilde{y} \in R^{r}}.  

Solve {(\tilde{X}'\tilde{X}) \tilde{\beta} = \tilde{X}'\tilde{y}} using any of the above mentioned algorithms. 

Return {\tilde{\beta}}, an {\epsilon}-approximation of {\beta_{LS}}.

So what makes this algorithm work in practice? Given that {r} is large enough, the two crucial structural conditions on {\Pi} (see below) are going to be fulfilled by the subsampled randomized Hadamard transform:

\displaystyle \textbf{Rotation/Isometry Requirement:} \ \ \sigma_{min}^2 (\Pi U^{(X)}) \geq \frac{1}{\sqrt{2}} \ \ \ \ \ (1)

\displaystyle \textbf{Subspace Embedding:} \ \ ||U^{(X)} \Pi^T \Pi y^\perp||_2^2 \leq \frac{\epsilon}{2} RSS_{min} \ \ \ \ \ (2)

Both of them are needed to obtain a proof for the following quality-of-approximation result such as in Drineas et al (2011). They essentially allow us to find explicit bounds on {||U^{(X)T} \Pi^T \Pi y^\perp||_2^2}, which in turn can be used to bound the randomized residual sum of squares (see Drineas et al (2011) for complete proof). This again bounds the possible deviation the randomized estimator can take from the least-squares solution:

Theorem 1 (Least Squares Quality of Approximation) The returned vector {\tilde{\beta}} of algorithms 1 and 3 is such that with probability at least 0.8

\displaystyle \textbf{Solution Certificate:} \ \ ||\tilde{\beta} - \beta_{LS}||_2 \leq \sqrt{\epsilon} \kappa(X) \sqrt{\gamma^{-2} - 1} ||\beta_{LS}||_2 \ \ \ \ \ (3)

\displaystyle \textbf{Objective Function:} \ \ ||X\tilde{\beta} - y||_2 \leq (1 + \epsilon) ||X \beta_{LS} - y||_2 \ \ \ \ \ (4)

where {\kappa(X)} is the condition number of {X} and {\gamma} denotes the fraction of the norm of {y} that lies in the column space of {X}:

\displaystyle \gamma = \frac{||U^{(X)}U^{(X)T} b||_2}{||b||_2}

In order to asses the statistical properties of this estimator, I construct Monte Carlo experiments which are displayed in the figure below. The left figures plot the ratio of the euclidean norm approximation error of the random sampling estimator (compared with the true data generating vector) and the right-hand-side of the quality-of-approximation result for the solution certificate. In 1000 simulation runs I draw samples of independently normal distributed data ({X \in R^{1000 \times 5}}). Afterwards I multiply with a five dimensional vector {\beta_0} and add some {N(0,1)} noise to obtain {y}. For each one of the 1000 samples I compute one random sampling estimator {\tilde{\beta}_{RS}} and take the euclidean norm of the difference. The resulting figure plots the histogram and a corresponding kernel estimator. The right figures, on the other hand, plots a comparison of a simple Cholesky decomposition based least squares estimator, {\beta_{LS}^{CHOL}}. The estimators in the first row randomly sample according to the leverage scores. We can directly see that the above theorem holds. More than 0.8 of the probability mass lies below 1. Hence, in at least 80 percent of the cases {||\tilde{\beta} - \beta_{LS}||_2 \leq \sqrt{\epsilon} \kappa(X) \sqrt{\gamma^{-2} - 1} ||\beta_{LS}||_2 } holds. Furthermore, it becomes apparent that the random sampling estimator does a way better job at approximating the estimator instead of the true data-generating vector {\beta_0}. Intuitively, this makes a lot of sense: One of LS well-known weaknesses is its sensitivity to outliers. Outliers often times have large leverage scores which translate into a higher weight in the sampling procedure, which underlies the randomized estimator. Hence, the weakness of LS is going to be forwarded to the randomized estimator. We can conclude, that the random sampling estimator does a very good job at what it is intended to do: Provide an approximation to the least-squares estimator.

estimators_comp

In order to obtain a more robust estimator one may use influence scores instead of leverage scores. Influence scores are supposed to be a more insensitive measure and are defined in the following way:

\displaystyle I_i = ||U^{(X,Y)}_i||_2^2 = L_i + \frac{\hat{e}_i}{\sum_{i=1}^n \hat{e}_i}

While the leverage scores completely ignored any information in the response vector {y}, influence scores incorporate this information. We easily compute the influence scores by appending {y} to the design matrix and computing the diagonal elements of the hat matrix of this new matrix in {{\mathbb R}^{n \times (d+1)}}. This seems only natural since we are constructing sketches of not only {X} but also {y}. The second row of the figure displays the empirical distribution of the quality of approximation result for the estimator based on the influence scores. The estimator clearly performs better in both situations!

3. Open Research Questions and a Bunch of Potential Applications

As we have seen, RandNLA is a powerful framework that allows us to speed up solutions to fundamental linear algebra problems. It can basically be applied to any ultra-high-dimensional matrix-matrix multiplication and linear systems of equation problem, that you can imagine. Therefore, RandNLA has a huge potential. My master thesis, for example, is going to deal with an application to Generalized Linear Models (GLMs) and the family of exponentially linked models. Furthermore, one could think of applications in deep learning. Many architectures already make use of randomization in form of stochastic gradient descent and dropout. The main reason for this is, that it helps avoiding to get stuck in local optima. Maybe one could also obtain the similar results by working with random sketches and exact gradients. If you are interested and want to chat about it, feel free to contact me! Again, the code can be found on my GitHub.

References

Achlioptas, Dimitris. “Database-friendly random projections: Johnson-Lindenstrauss with binary coins.” Journal of computer and System Sciences 66.4 (2003): 671-687.

Ailon, Nir, and Bernard Chazelle. “Approximate nearest neighbors and the fast Johnson-Lindenstrauss transform.” Proceedings of the thirty-eighth annual ACM symposium on Theory of computing. ACM, 2006.

Drineas, Petros, et al. “Fast approximation of matrix coherence and statistical leverage.” Journal of Machine Learning Research 13.Dec (2012): 3475-3506.

Drineas, Petros, et al. “Faster least squares approximation.” Numerische Mathematik 117.2 (2011): 219-249.

Frankl, Peter, and Hiroshi Maehara. “The Johnson-Lindenstrauss lemma and the sphericity of some graphs.” Journal of Combinatorial Theory, Series B 44.3 (1988): 355-362.

Mahoney, Michael W. “Lecture Notes on Randomized Linear Algebra.” arXiv preprint arXiv:1608.04481 (2016).

Advertisements

One thought on “RandNLA for LS: A Brief Introduction – Part II

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s