RandNLA for LS: A Brief Introduction – Part II

In the previous article we took a first look at the pioneering work of Michael Mahoney and Petros Drineas in Randomized Numerical Linear Algebra. Our ultimate goal was to speed up the computation of the ordinary least-squares solution vector, {\beta_{LS}}, from {O(nd^2)} time to at least {o(nd^2)} time. We reviewed two simple and fairly “slow” but powerful algorithms which simplified linear systems of equations with a large number of constraints (the classical {n>>p} setting of tall data). The main idea was to preprocess the input matrix {X \in {\mathbb R}^{n \times d}} in such a way that we obtained a “representative” sketch in a lower dimensional space ({\tilde{X} \in {\mathbb R}^{r \times d}}). The term “representative” could be defined in two different ways and always with respect to the ultimate goal: a good approximation to the least squares (LS) solution. First, a good representation in a lower dimensional space could be interpreted as a subspace embedding that maintains pairwise distance up to an {\epsilon}-degree (via Johnson-Lindenstrauss transforms). This idea was the fundament of a random projection algorithm. Second, “representative” could be defined in terms of “leverage” on the LS fit. A natural measure was the so-called statistical leverage score, which could be extracted from the diagonal elements of the hat matrix or from the euclidean row norm of the orthonormal basis {U} for {range(X)} computed from the QR decomposition or SVD.

The main difference between the two methods was that the projection approach was data-agnostic. This meant that we did not have to compute any statistics in the first place. Sampling on the other hand involved some serious preprocessing in the form of establishing an importance sampling distribution. Both methods have advantages and disadvantages: Computing the exact leverage score (random sampling) was associated with a full singular value decomposition. Computing the SVD is usually (disregarding sparse variants) already as expensive as solving the complete LS problem in the first place. Furthermore, it is unclear whether or not leverage is a “robust” measure of being a representative sample from the underlying data-generating process. Eliminating one high-leverage observation might dramatically affect the complete sampling distribution. On the other hand, computing the ordinary Johnson-Lindenstrauss transform in the random projection variant using a dense matrix {\Pi \in {\mathbb R}^{r \times n}} would take at least {O(ndr)} time (matrix-matrix multiplication). Since usually {r > d}, we would not obtain an improvement.

In todays article we are going to introduce the Fast Johnson Lindenstrauss Transform (FJLT). This result is going to be the fundament of two very important concepts which speed up the computation of an {\epsilon}-approximation to the LS objective function and the target vector: First, the FJLT can be used to quickly project the original data matrix into an {r}-dimensional subspace, in which the leverage scores are uniformized. In this way we are able to sample rows uniformly at random. Secondly, it is also going to allow us to approximate the actual leverage scores in a fast way. Both concepts rely on Fast Fourier (or Hadamard-based) transforms. Ultimately, this is going to speed up computational complexity from {O(nd^2)} to {O(ndlog(r))}.

The following text is structured in the following way: First, I introduce the concept of a Fast Johnson Lindenstrauss subspace embedding. Afterwards, I will outline a fast random sampling algorithm for LS. In the end, we will analyze complexity and discuss further research questions as well as possible extensions/improvements. Again, the main reference is Mahoney (2016).

1. The Fast Johnson Lindenstrauss Transform

Remember that the goal of all our efforts is to approximate the solution of a system of linear equations {\beta_{LS} \in \mathbb{R}^d} by a vector {\tilde{\beta}} and to do this in a reasonably fast way (faster than {O(nd^2)}). The main ingredient for this is going to be the fast brother of the regular Johnson Lindenstrauss transform: The fast subspace Johnson Lindenstrauss transform. Mahoney (68, 2016) defines it in the following way:

Definition 1 (Fast Johnson Lindenstrauss Transform – FJLT) Given an {\epsilon >0 } and an orthogonal matrix {U \in {\mathbb R}^{n \times d}} viewed as d vectors in {{\mathbb R}^n}. A FJLT projects vectors fro {{\mathbb R}^n \rightarrow R^r} such that the orthogonality of U is preserved, and it does it quickly. I.e., {\Pi \in {\mathbb R}^{r \times n}} is an {\epsilon}-FJLT if

  • Orthogonality preservation: {||I_d - U^T \Pi^T \Pi U||_2 \leq \epsilon }
  • Fast: {\forall X \in {\mathbb R}^{n \times d}} we can compute {\Pi X} in {O(ndlog(r))} time.

The first condition requires us to maintain the orthogonality of the orthonormal basis U even after processing it with {\Pi}. So how can we define such a matrix? Ailon et al (2006) proposed the following Hadamard-based construction, which relies on fast Fourier/Hadamard preprocessing of the input matrix:

\displaystyle \Pi_{FJLT} = PHD

  • {P \in {\mathbb R}^{r \times n}} – Sparse JL matrix/Uniform sampling matrix (Achlioptas (2003), Frankl et al (1988)), where {q = \min\{1, \Theta (\frac{log^2 (n)}{d} )\}}
    \displaystyle P_{ij} =\left\{ 0 \quad \text{with prob. 1-q and} \quad N(0, q^{-1}) \quad \text{with prob. q} \right\}
  • {H \in {\mathbb R}^{n \times n}} – DFT or normalized Hadamard transform matrix: Structured so that Fast Fourier methods can be applied to compute them quickly. Setting {\tilde{H}_1 = 1} it is defined for all {n} that are a power of two in a recursive fashion :\displaystyle \tilde{H}_{2n} = \begin{bmatrix} \tilde{H}_{n} & \tilde{H}_{n} \\[0.3em] \tilde{H}_{n} & -\tilde{H}_{n} \end{bmatrix} \text{ and } H_n = \frac{\tilde{H}_n}{\sqrt{n}} 
  • {D \in{\mathbb R}^{n \times n}} – Takes values {\pm 1} with probability {\frac{1}{2}} each: Preprocesses bad cases by randomization.

The matrix product {HD} fulfills two very important tasks (Drineas et al, 3449, 2012): First, it essentially flattens spiky vectors (“spreads out its energy” – Drineas et al (3449, 2012)). Mathematically speaking, this means that we can bound the sup-norm of all transformed vectors by a quantity that is inversely proportional to {d}. This ultimately uniformizes the leverage scores (leverage scores will equal {r/n}). If {r} is large enough, this smoothing in combination with the sampling matrix {P} allows us to construct an FJLT with high probability. Furthermore, computing the {r} vectors of the FJLT transformed matrix can be done in {O(ndlog(r))}, as desired. Hence, {\Pi_{FJLT} = PHD} can be viewed as a subsampled randomized Hadamard transform (Drineas et al, 3449, 2012) or as a FJLT with high probability.

In this way, we obtain two potential algorithms for solving the above LS problem (Mahoney, 71, 2016):

  • (i) Random Projection Approach II: Uniformize leverage scores and use sparse projection matrix (or sample uniformly)
  • (ii) Random Sampling Approach: Use FJLT to compute approximate leverage scores and sample accordingly

In the next section we will focus on the second approach. There is no particular reason, except that we are going to be able to introduce a new idea in approximating the quantity that allows us to approximate our objective by random sampling.

2. Doing things faster

Before we are able to outline the full random sampling algorithm we need to find a fast way to compute the leverage scores. We saw that the exact computation was done by computing the SVD. But this did not speed things up. We can fasten this process by applying the previous trick a second time: We will not calculate the exact leverage slowly but calculate an approximation (with {\epsilon}-slack) in a fast way. Drineas et al (2012) first proposed the following algorithm to approximate the leverage scores:

Drineas et al (3451, 2012) – FJLT approximation for leverage scores

Input: {X \in \mathbb{R}^{n \times d}} with SVD {X = U^{(X)} \Sigma V^T} and an {\epsilon}-level
Output: Approximate leverage scores, {\tilde{l}_i, i=1,...,n} 
-----------------------------------------------------------------------------------
Let {\Pi_1 \in {\mathbb R}^{r_1 \times n}} be an {\epsilon}-FJLT for {U^{(X)}} with {r_1 = \Omega \left( \frac{dlog(n)}{\epsilon^2} log\left( \frac{dlog(n)}{\epsilon^2}\right) \right)}. 

Compute {\Pi_1 X} and its SVD/QR where {R^{(\Pi_1 X)}=\Sigma^{(\Pi_1 X)} V^{(\Pi_1 X)T}}. 

View the rows of {XR^{-1} \in {\mathbb R}^{n \times d}} as n vectors in {{\mathbb R}^d}. Let {\Pi_2 \in {\mathbb R}^{d \times r_2}} be an {\epsilon}-JLT for {n^2} vectors, with {r_2 = O(\frac{log(n)}{\epsilon^2})}. 

Return {\tilde{l}_i = ||(XR^{-1}\Pi_2)_i||_2^2}, an {\epsilon}-approximation of {l_i}.

It relies on first applying a FJLT to {X}. Afterwards one has to compute the SVD of the transformed matrix, before applying another regular JLT to {XR^{-1}}. The complexity of the algorithm can be decomposed in the following way:

  • {\Pi_1 X} takes {O(ndlog(r_1))} since {\Pi_1} is an {\epsilon}-FJLT
  • SVD of {\Pi_1 X} takes {O(r_1 d^2)}
  • {R^{-1} \Pi_2} takes {O(r_2d^2)} since {\Pi_2} is an {\epsilon}-JLT
  • Premultiplying by X takes {O(ndr)} (plain matrix-matrix multiplication)
  • {\implies} Overall: {O(ndlog(r_1) + r_1 d^2 + r_2 d^2 + ndr_2)}

By choosing the parameters appropriately ({r_1 = O(\epsilon^{-2}dlog(n)(log(\epsilon^{-2}dlog(n))))}, {r_2 = O(\epsilon^{-2} log(n))}, {\epsilon} constant, {\delta = 0.1}, {d \leq n \leq e^d}), it follows that the algorithm has complexity {O(ndlog(d/\epsilon) + nd\epsilon^{-2}log(n) + d^3\epsilon^{-2}log(n)log(d\epsilon^{-1}))}.

Using these fast leverage score approximations, we can construct an importance sampling distribution and apply the following algorithmic leveraging algorithm first introduced by Drineas et al (2011):

Drineas et al (2011) – “Fast” Random Sampling Algorithm for LS

Input: LS problem with {X \in \mathbb{R}^{n \times d}}, {y \in \mathbb{R}^{n}} and an {\epsilon}-level of approximation 
Output: Approximate LS solution, {\tilde{\beta}} 
---------------------------------------------------------------------------------
Compute the importance sampling distribution {p_i = \frac{1}{d} ||U_{(i)}||_2^2 = \frac{1}{d} ||Q_{(i)}||_2^2, \ \forall i=1,...,n}, where {U} and {Q} are the orthogonal matrices from either the SVD or the QR. 

Randomly sample {r = O(\frac{d log(d)}{\epsilon})} rows of {X} and {y}. Rescale them by {\frac{1}{rp_{i}}} and form {\tilde{X} \in R^{r \times d}, \tilde{y} \in R^{r}}.  

Solve {(\tilde{X}'\tilde{X}) \tilde{\beta} = \tilde{X}'\tilde{y}} using any of the above mentioned algorithms. 

Return {\tilde{\beta}}, an {\epsilon}-approximation of {\beta_{LS}}.

So what makes this algorithm work in practice? Given that {r} is large enough, the two crucial structural conditions on {\Pi} (see below) are going to be fulfilled by the subsampled randomized Hadamard transform:

\displaystyle \textbf{Rotation/Isometry Requirement:} \ \ \sigma_{min}^2 (\Pi U^{(X)}) \geq \frac{1}{\sqrt{2}} \ \ \ \ \ (1)

\displaystyle \textbf{Subspace Embedding:} \ \ ||U^{(X)} \Pi^T \Pi y^\perp||_2^2 \leq \frac{\epsilon}{2} RSS_{min} \ \ \ \ \ (2)

Both of them are needed to obtain a proof for the following quality-of-approximation result such as in Drineas et al (2011). They essentially allow us to find explicit bounds on {||U^{(X)T} \Pi^T \Pi y^\perp||_2^2}, which in turn can be used to bound the randomized residual sum of squares (see Drineas et al (2011) for complete proof). This again bounds the possible deviation the randomized estimator can take from the least-squares solution:

Theorem 1 (Least Squares Quality of Approximation) The returned vector {\tilde{\beta}} of algorithms 1 and 3 is such that with probability at least 0.8

\displaystyle \textbf{Solution Certificate:} \ \ ||\tilde{\beta} - \beta_{LS}||_2 \leq \sqrt{\epsilon} \kappa(X) \sqrt{\gamma^{-2} - 1} ||\beta_{LS}||_2 \ \ \ \ \ (3)

\displaystyle \textbf{Objective Function:} \ \ ||X\tilde{\beta} - y||_2 \leq (1 + \epsilon) ||X \beta_{LS} - y||_2 \ \ \ \ \ (4)

where {\kappa(X)} is the condition number of {X} and {\gamma} denotes the fraction of the norm of {y} that lies in the column space of {X}:

\displaystyle \gamma = \frac{||U^{(X)}U^{(X)T} b||_2}{||b||_2}

In order to asses the statistical properties of this estimator, I construct Monte Carlo experiments which are displayed in the figure below. The left figures plot the ratio of the euclidean norm approximation error of the random sampling estimator (compared with the true data generating vector) and the right-hand-side of the quality-of-approximation result for the solution certificate. In 1000 simulation runs I draw samples of independently normal distributed data ({X \in R^{1000 \times 5}}). Afterwards I multiply with a five dimensional vector {\beta_0} and add some {N(0,1)} noise to obtain {y}. For each one of the 1000 samples I compute one random sampling estimator {\tilde{\beta}_{RS}} and take the euclidean norm of the difference. The resulting figure plots the histogram and a corresponding kernel estimator. The right figures, on the other hand, plots a comparison of a simple Cholesky decomposition based least squares estimator, {\beta_{LS}^{CHOL}}. The estimators in the first row randomly sample according to the leverage scores. We can directly see that the above theorem holds. More than 0.8 of the probability mass lies below 1. Hence, in at least 80 percent of the cases {||\tilde{\beta} - \beta_{LS}||_2 \leq \sqrt{\epsilon} \kappa(X) \sqrt{\gamma^{-2} - 1} ||\beta_{LS}||_2 } holds. Furthermore, it becomes apparent that the random sampling estimator does a way better job at approximating the estimator instead of the true data-generating vector {\beta_0}. Intuitively, this makes a lot of sense: One of LS well-known weaknesses is its sensitivity to outliers. Outliers often times have large leverage scores which translate into a higher weight in the sampling procedure, which underlies the randomized estimator. Hence, the weakness of LS is going to be forwarded to the randomized estimator. We can conclude, that the random sampling estimator does a very good job at what it is intended to do: Provide an approximation to the least-squares estimator.

estimators_comp

In order to obtain a more robust estimator one may use influence scores instead of leverage scores. Influence scores are supposed to be a more insensitive measure and are defined in the following way:

\displaystyle I_i = ||U^{(X,Y)}_i||_2^2 = L_i + \frac{\hat{e}_i}{\sum_{i=1}^n \hat{e}_i}

While the leverage scores completely ignored any information in the response vector {y}, influence scores incorporate this information. We easily compute the influence scores by appending {y} to the design matrix and computing the diagonal elements of the hat matrix of this new matrix in {{\mathbb R}^{n \times (d+1)}}. This seems only natural since we are constructing sketches of not only {X} but also {y}. The second row of the figure displays the empirical distribution of the quality of approximation result for the estimator based on the influence scores. The estimator clearly performs better in both situations!

3. Open Research Questions and a Bunch of Potential Applications

As we have seen, RandNLA is a powerful framework that allows us to speed up solutions to fundamental linear algebra problems. It can basically be applied to any ultra-high-dimensional matrix-matrix multiplication and linear systems of equation problem, that you can imagine. Therefore, RandNLA has a huge potential. My master thesis, for example, is going to deal with an application to Generalized Linear Models (GLMs) and the family of exponentially linked models. Furthermore, one could think of applications in deep learning. Many architectures already make use of randomization in form of stochastic gradient descent and dropout. The main reason for this is, that it helps avoiding to get stuck in local optima. Maybe one could also obtain the similar results by working with random sketches and exact gradients. If you are interested and want to chat about it, feel free to contact me! Again, the code can be found on my GitHub.

References

Achlioptas, Dimitris. “Database-friendly random projections: Johnson-Lindenstrauss with binary coins.” Journal of computer and System Sciences 66.4 (2003): 671-687.

Ailon, Nir, and Bernard Chazelle. “Approximate nearest neighbors and the fast Johnson-Lindenstrauss transform.” Proceedings of the thirty-eighth annual ACM symposium on Theory of computing. ACM, 2006.

Drineas, Petros, et al. “Fast approximation of matrix coherence and statistical leverage.” Journal of Machine Learning Research 13.Dec (2012): 3475-3506.

Drineas, Petros, et al. “Faster least squares approximation.” Numerische Mathematik 117.2 (2011): 219-249.

Frankl, Peter, and Hiroshi Maehara. “The Johnson-Lindenstrauss lemma and the sphericity of some graphs.” Journal of Combinatorial Theory, Series B 44.3 (1988): 355-362.

Mahoney, Michael W. “Lecture Notes on Randomized Linear Algebra.” arXiv preprint arXiv:1608.04481 (2016).

BGSE Data Talks – 001 – Ioannis Kosmidis

In this series we interview professors and contributors to our fields of passion: computational statistics, data science and machine learning. The first interviewee is Dr. Ioannis Kosmidis. He is Senior Lecturer at the University College London and during our first term he taught a workshop on GLMs and statistical learning in the n>>p setting. Furthermore, his research interests also focus on data-analytic applications for sport sciences and health.

Nandan: First of all thank you very much for taking the time to talk to us. We are very eager to learn more from you, so lets get started with our first question: Which papers have fundamentally changed the way you think?

Kosmidis: There many papers that have deeply affected me, but let me give you my top 3:

The first one is a paper by Bradley Efron from 1975. Statisticians have known that procedures and models that are linear (e.g. canonically-linked exponential family models) have nice properties in learning and inference problems.  In his paper Efron provided a framework that describes of how much one can deviate from these linear cases before these good properties start deteriorating. It’s a beautiful paper, which concentrates elements from 70 years worth of statistical thinking. Furthermore, it is accompanied by an amazing discussion, including topics on the construction of confidence intervals, second-order efficiency and unbiased estimators. All of these concepts are put under the same umbrella, and quantified by one key quantity: the statistical curvature. This is one of the papers that helped me a lot in my own research and made me think hard on the models I use and the impact they have on what I want to achieve with an analysis.

The second paper is well known to graduate students in Data Science: Leo Breiman’s paper about the two cultures in statistics and machine learning – inference and prediction. In that paper, Breiman has a  pro-prediction attitude, arguing that the relevant problems are prediction-based. One of my favorite parts in that paper is actually Cox’s commentary. Cox tries to balance Breiman’s enthusiasm for prediction in a very elegant way. He argues that optimal prediction, as a target is often dangerous; in many settings in statistics you are faced with the following problem. Data can only be collected under certain conditions but predictions are only useful under a different set of conditions. For example, think of modeling the spread of a new dangerous virus.  Of course you wouldn’t have past data on its behavior. So without any idea of the dynamics you are trying to model and by using a black box prediction algorithm, you might get strong predictive results, which are actually useless for the question at hand. One of the things that  a statistician ought to be good at is to recover data-generating mechanisms, test scientific hypotheses, and develop the means to do so.

And finally, there is a wonderful report by Donoho, which is called ’50 Years of Data Science’. Starting with the fundamental work by John Tukey, Donoho outlines the history of our craft. He asks fundamental questions about what defines a “Data Scientist” and how Data Science should be taught. He comes up with a list of courses, which a Data Science program should include. For example, I fully agree with his proposal to include a course on “Legal, Policy, and Ethical Considerations for Data Scientists”. Furthermore, a course, which takes an in-depth look at the connection between causality and experimental design should find a home in every Data Science program. Clever experimental design is key for good inference.

It is really important to be clear about which question you are trying to answer and to stay open whenever a new data analytics setting comes in. You can’t answer everything with a neural network. In some cases it is much better to have a model with only few parameters, which actually explains something. In this way subject matter experts are able to improve their work.  This is much harder with complex black boxes.

Robert: Successful companies such as Google DeepMind, SpaceX and Alphabet Inc. are largely focused on optimizing certain prediction tasks by building complex models (Deep Reinforcement Learning, etc.). Do you think that this hype could pose a potential danger for the state of inference?

Kosmidis: No, not at all. Such developments are hugely important and very inspirational. In order to understand these systems we need complex models. But this is not all. These extremely non-parametric models often give you a baseline, something to try to outperform with a model that encodes common sense and scientific knowledge, and can be used for inference on relevant hypotheses.

Robert: What do you think about p-value manipulation and the fact that so far (almost) only significant effects are being published in economics, social sciences and psychology? Does scientific evolution really apply to those fields?

Kosmidis: In 2011 Jonathan Schooler proposed in a Nature paper to introduce a repository of negative or non-important results. Whenever you get a non-important result, there are two things that might have happened: There is actually no effect or you have used the wrong methods to try to discover an effect. If we had a repository of all those analysis, one could try and re-attempt them.

A move towards  raising awareness of the issues you are referring to is the ASA’s statement on p-values from 2016. The statement outlines 6 things p-values are not and is signed by well-known scientists.

Nandan: Coming to one of your key research areas – Data analytics for health and fitness. How do you imagine our data driven health system evolve in the next 10 to 20 years?

Kosmidis: I imagine that there is going to be a huge revolution in data analytic solutions for health and fitness. All of us carry many devices that can now communicate in a very efficient way with each other. They collect data on a regular schedule and can update us on the potential actions we should take to improve, for example, our fitness. In a sense, the algorithms will be able to map the data that we generate to our idea of the future, an putting together inference and prediction, you as a user will be able to intervene to yourself, online.

Nandan: Jumping to another topic: What are the most valuable, annoying or essential attributes of Phd students that you have supervised or hope to supervise in the past or future? And is there any final advice you can give to young and aspiring Data Science students?

Kosmidis: I believe that the most important quality for a PhD student is curiosity.

My advice: Keep an open mind!

Robert and Nandan: That wraps it up. Again, thank you very much for your time and thoughts. It was a pleasure to learn from you.

REFERENCES

Breiman, Leo. “Statistical modeling: The two cultures (with comments and a rejoinder by the author).” Statistical Science 16.3 (2001): 199-231.

Donoho, David. “50 years of Data Science.” Princeton NJ, Tukey Centennial Workshop. 2015.

Efron, Bradley. “Defining the curvature of a statistical problem (with applications to second order efficiency).” The Annals of Statistics (1975): 1189-1242.

Schooler, Jonathan. “Unpublished results hide the decline effect.” Nature – 470.7335 (2011): 437.

Wasserstein, Ronald L., and Nicole A. Lazar. “The ASA’s statement on p-values: context, process, and purpose.” The American Statistician (2016).

RandNLA for LS: A Brief Introduction – Part I

Dimensionality reduction is a topic that has governed our (the 2017 BGSE Data Science cohort) last three months. At the heart of topics such as penalized likelihood estimation (Lasso, Ridge, Elastic Net, etc.), principal component analysis and best subset selection lies the fundamental trade-off between complexity, generalizability and computational feasibility.

David Rossell taught us that even if we have found a methodology to compare across models, there is still the problem of enumerating all models to be compared. At this point the bias-variance trade-off turns into a question of computational power. Evaluating every single model using Bayes Factors or other performance measures just becomes way to expensive if the number of potential predictors exceeds 30 ({2^p} models). Instead of going the brute force way, one usually makes use of computational heuristics, which are clearly associated with short-comings. But what if we could sketch each model performance measure up to a certain degree of certainty?

Randomized Numerical Linear Algebra (RandNLA) is a fairly new and hot research field, which intends to speed up fundamental linear algebra operations by making use of either random sampling or random projections. The ultimate aim is to compress dimensionality in a sensible manner while obtaining an ({1 \pm \epsilon})-approximation of the objective. The top names in this field are Drineas, Kannan and Mahoney (2006 a,b,c). All of them have done pioneering work in the field of randomized algorithms for large matrices. Their work covers a whole branch of computational approximations such as randomized versions of matrix-matrix multiplication, least squares (LS), fast leverage score approximations and sparse matrix completion. The following two-part-series is mainly based on lecture notes of Mahoney (2016) and intends to introduce the audience to an alternative randomized perspective on solving the standard LS problem. Therefore, this article is going to review two simple (and fairly “slow”) but powerful ideas which simplify linear systems of equations with a large number of constraints (the classical {n>>p} setting of tall data). The next article is going to improve upon these two algorithms by widening the computational bottlenecks. Thereby we are able to speed up computational complexity from {O(nd^2)} to {O(ndlog(r))}. So lets get started!

The Translation of the LS Problem

The standard LS problem can be summarized in the following optimized objective function (RSS)

\displaystyle RSS_{min} = \min_{\beta \in \mathbb{R}^d}||y - X\beta||_2^2

where {y \in \mathbb{R}^n}, {X \in \mathbb{R}^{n \times d}} and {\beta \in \mathbb{R}^d}. After “doing the math” (multiplying out, taking the gradient, yada yada) LS boils down to solving the following normal equations

\displaystyle (X'X) \beta_{LS} = X'y

Usually, this system is solved using Cholesky, QR decomposition or the SVD. The complexity of all methods is {O(nd^2)} (different constant factors – depending on how well-conditioned {X} is).

Randomized algorithms intend to work on the tall {n}-side of things (although it is also easily possible to switch sides). Instead of crunching the full {n \times d} matrix {X} and n-dimensional response vector {y}, we want to do something smart: Construct a sketch of {X} and {y}, {\tilde{X} \in R^{r \times d}} and {\tilde{y} \in R^{r}}, which drastically reduces the amount of rows to {r << n}. Let’s write the transformed objective function as \displaystyle \widetilde{RSS}_{min} = \min_{\beta \in \mathbb{R}^d}||\Pi (y - X\beta)||_2^2 where {\Pi \in \mathbb{R}^{r \times n}} denotes some {(r \times n)} transformation matrix. Afterwards, we solve this reduced system and obtain the vector {\tilde{\beta}_{LS}}. There are two objectives which we intend to achieve:

\displaystyle \textit{Solution Certificate:} \ \tilde{\beta}_{LS} \approx \beta_{LS}

\displaystyle \textit{Objective Function:} RSS_{min} \approx \widetilde{RSS}_{min}

Clearly, the second objective is more relaxed than the first. The solution space that fulfills the second objective is a subset of the solution space for the more strict first condition. Usually, it only kicks in if the solution certificate is hard to achieve in the first place. The natural question that pops up is: What are reasonable structural requirements to an approximation scheme?

Mahoney (2016, 57) states the following two: Let X = {U_X \Sigma_X V_X^T} denote the singular value decomposition of {X}. Furthermore, let {y^\perp = U_X U_X^T y} denote the residual vector, resulting from the fact that not all of {y} can be “reached” by {span(X)} (Note {RSS_{min} = ||y^\perp||_2^2}).

\displaystyle \textit{Rotation/Isometry Requirement:} \ \sigma_{min} (\Pi U_X) \geq \frac{1}{\sqrt{2}}

The first structural condition states that we require a lower bound on the singular values of {\Pi U_X}. This basically means that we want {\Pi} to be an approximate isometry/rotation after reducing the dimensionality to r  (Mahoney (2016, 57)) . Hence, the transform is not allowed to influence the original problem and therefore the fit too much.

\displaystyle \textit{Subspace Embedding:} ||U_X \Pi^T \Pi y^\perp||_2^2 \leq \frac{\epsilon}{2} RSS_{min}

Standard LS minimizes the RSS by orthogonally projecting y onto the column space of X. The orthogonal projection is described by a linear combination defined by {\beta}. The second structural condition therefore requires that this orthogonality approximately “survives” the transformation. Hence, {\Pi y^\perp} has to be approximately orthogonal to {\Pi U_X}. All of the following algorithms are going to fulfill these conditions (for proofs see Mahoney (2016)). One can already tell that the number of dimensions on which we reduce, has to be related with the quality of our approximation. All proofs essentially work out the {\epsilon}-r-trade-off by making use of matrix concentration inequalities.

A First Slow Random Sampling Algorithm

{\Pi} can be viewed in two different ways: Either as a sampling matrix or as a projection matrix. Let us first consider the former case: This clearly has to be a non-data-agnostic way of constructing {\Pi} (we have to do computations with X to obtain some sampling reference). Each row of {\Pi} contains one non-zero entry, which represents a rescaled sampled row (Note: We sample with replacement). Intuitively we want to sample observations that highly influence the original LS fit more often. A natural measure of influence of each data point on the in-sample fit of itself is the statistical leverage. It tells us how much a specific data point pulls the linear model fit towards itself. A data point with leverage 1 “occupies” one complete parameter/degree of freedom. Hence, we are interested in computing the diagonal elements of the hat matrix and sample according to exactly these leverage score probabilities. It can be shown that {H_{ii} = \frac{1}{d} ||U_{(i)}||_2^2 = \frac{1}{d} ||Q_{(i)}||_2^2} and hence we have a first potential randomized algorithm:

Mahoney (2016, 65) – “Slow” Random Sampling Algorithm for LS

Input: LS problem with {X \in \mathbb{R}^{n \times d}}, {y \in \mathbb{R}^{n}} and an {\epsilon}-level of approximation 
Output: Approximate LS solution, {\tilde{\beta}} 
--------------------------------------------------------------------------
Compute the importance sampling distribution {p_i = \frac{1}{d} ||U_{(i)}||_2^2 = \frac{1}{d} ||Q_{(i)}||_2^2, \ \forall i=1,...,n}, 
where {U} and {Q} are the orthogonal matrices from either the SVD or the QR. 

Randomly sample {r = O(\frac{d log(d)}{\epsilon})} rows of {X} and {y}. Rescale them by {\frac{1}{rp_{i}}} and form {\tilde{X} \in R^{r \times d}, \tilde{y} \in R^{r}}. 

Solve {(\tilde{X}'\tilde{X}) \tilde{\beta} = \tilde{X}'\tilde{y}} using any of the above mentioned methods. 

Return {\tilde{\beta}}, an {\epsilon}-approximation of {\beta_{LS}}.

As described above, we basically construct a sampling distribution by computing the exact leverage scores. The problem with this algorithm is that computing the SVD of {X} is already almost as expensive as solving the full problem in the first place. Hence, we need a fast way to compute an approximate leverage score. Before addressing this issue in the following article, let us take a look at a different point of view: dimensionality reduction through random projections.

A First Slow Random Projection Algorithm

Random projections are based on one key result (and many different versions), the Johnson-Lindenstrauss Lemma and the resulting transformation.

Lemma – Soft Gaussian-version of Johnson-Lindenstrauss Lemma (Mahoney (2016, 45))

Given n points {\{u_i\}_{i=1}^n}, each of which is in {\mathbb{R}^d}, let {P \in \mathbb{R}^{d \times k}} be such that 
{P_{ij} = \frac{1}{\sqrt{k}} N(0,1)} and let {\{v_i\}_{i=1}^n} be points in {\mathbb{R}^k} defined as {v_i = u_i P}.  
Then if {k \geq 9 \frac{log(n)}{\epsilon^2 - \epsilon^3}}, for some {\epsilon \in (0,0.5)}, then with probability at least 0.5, 
all point-wise distances are preserved, {\forall \ i, i'} we have 

\displaystyle (1-\epsilon) ||u_i - u_{i'}||_2^2 \leq ||v_i - v_{i'}||_2^2 \leq (1+\epsilon) ||u_i - u_{i'}||_2^2

Pause for thought – this is an amazing result! Instead of having to precompute any metric before being able to do anything, there is an easy way to obtain a lower dimensional representation of our original input, that preserves the original length and point-wise distances. The proving argument is based on a combination of the Chernoff bounds and Boole’s inequality and can again be found in Mahoney (2016, 46 et seq.).

Definition – Johnson-Lindenstrauss Transform (Mahoney (2016, 68))

Given an {\epsilon >0 } and {n} points {\{x_i\}_{i=1}^n}, where {x_i \in \mathbb{R}^d}, an {\epsilon}-JLT ({\epsilon}-
Johnson-Lindenstrauss Transform), denoted {\Pi \in \mathbb{R}^{r \times d}} is a projection of the points 
into {\mathbb{R}^r} such that

 \displaystyle (1-\epsilon) ||x_i||^2_2 \leq ||\Pi x_i||^2_2 \leq (1+\epsilon) ||x_i||^2_2 \ \ \ \ \ (1)

The Johnson-Lindenstrauss Transform maps the n-dimensional row vectors into a r-dimensional subspace, while preserving their length and their point-wise distances up to an {\epsilon}-proportion!

Most of the constructions of {\Pi} are either based on Gaussian {N(0,1)} random samples or on binary coin tosses (Achlioptas, 2003). In terms of computational efficiency, binary coin tosses are a lot faster! This is due to the fact that simulating Gaussian samples (e.g. Box-Mueller algorithm for transformation of uniform samples) is usually a lot harder than just tossing a fair coin. The general version of this algorithm looks like this:

Mahoney (2016, 65) – “Slow” Random Projection Algorithm for LS

Input: LS problem with {X \in \mathbb{R}^{n \times d}}, {y \in \mathbb{R}^{n}} and an {\epsilon}-level of approximation 
Output: Approximate LS solution, {\tilde{\beta}} 
--------------------------------------------------------------------------
Construct a random projection matrix {\Pi \in \mathbb{R}^{r \times d}} of Gaussians or binary {\pm 1} coin flips. 

Randomly project onto {r = O(\frac{d log(d)}{\epsilon})} rows of {X, y} and form {\tilde{X} \in R^{r \times d}, \tilde{y} \in R^{r}}. 

Solve {(\tilde{X}'\tilde{X}) \tilde{\beta} = \tilde{X}'\tilde{y}} using any of the above mentioned methods.

Return {\tilde{\beta}}, an {\epsilon}-approximation of {\beta_{LS}}.

Sadly, also this algorithm has computational downsides associated to the computation of the matrix-matrix product of {\Pi} and {X} which already has a complexity of {O(ndr)}.

Computational Bottlenecks and an Outlook

The presented randomized algorithms provide an {(1 \pm \epsilon})-approximation to the LS objective function value and an {\epsilon}-approximation to the solution vector. But the ultimate goal of all our effort is to speed up computation time. Sadly, both of the proposed algorithms do not improve on the {O(nd^2)} complexity of standard LS algorithms.

In the random sampling algorithm computing the SVD to obtain exact leverage scores is almost as expensive as solving the whole problem LS in the first place. Computing the matrix-matrix product of {\Pi X} in the projection scenario already has the complexity of {O(ndr)} which is usually larger than {O(nd^2)}. Widening these bottlenecks is going to be part of the next article (Hint: In both cases we are going to have to make use of Fast Fourier or Fast Johnson-Lindenstrauss-Transforms (Ailon and Chazelle, 2006).

The process is going to heavily rely on smart data-preprocessing and the use of fast projection methods. This will enable us to either construct fast approximate leverage scores and to sample according to this importance sampling distribution (Fast Random Sampling). Or to approximately uniformize the leverage scores by projection, which will allow us to sample uniformly at random (Fast Random Projection). If you are interested in an R implementation of the algorithms in this blog post please check out my github repo (here).

References

Achlioptas, Dimitris. “Database-friendly random projections: Johnson-Lindenstrauss with binary coins.” Journal of computer and System Sciences 66.4 (2003): 671-687.

Ailon, Nir, and Bernard Chazelle. “Approximate nearest neighbors and the fast Johnson-Lindenstrauss transform.” Proceedings of the thirty-eighth annual ACM symposium on Theory of computing. ACM, 2006.

Drineas, Petros, Ravi Kannan, and Michael W. Mahoney. “Fast Monte Carlo algorithms for matrices I: Approximating matrix multiplication.” SIAM Journal on Computing 36.1 (2006 a): 132-157.

Drineas, Petros, Ravi Kannan, and Michael W. Mahoney. “Fast Monte Carlo algorithms for matrices II: Computing a low-rank approximation to a matrix.” SIAM Journal on Computing 36.1 (2006 b): 158-183.

Drineas, Petros, Ravi Kannan, and Michael W. Mahoney. “Fast Monte Carlo algorithms for matrices III: Computing a compressed approximate matrix decomposition.” SIAM Journal on Computing 36.1 (2006): 184-206.

Mahoney, Michael W. “Lecture Notes on Randomized Linear Algebra.” arXiv preprint arXiv:1608.04481 (2016).

About the Author

I (Robert Lange) graduated last year from the University of Cologne with a BSc in Economics. Reading Milton Friedman’s “Essays in positive economics” and a passion for computational statistics led me to pursue a more empirically driven path. I quickly realized that there is only one Master’s  program in this field that is suited to fulfill the needs of the 21st century: Data Science at the Barcelona GSE. Ultimately, I want to combine data-analytic knowledge with problems that change the way  we pursue of lives (e.g. Bionformatics and Neuroscience). If you are interested in me, check out my homepage.

Competition Report: Monster Legends Datathon

Just a few days after officially starting our classes for the 1st term in the master program, we received an email about a gaming data hackathon being organized by SocialPoint – a big name in mobile gaming. The company is spread across 7 floors of a 10-storey building in the heart of Barcelona with amazing views of the city. Having just finished the brush-ups, we were not quite aware of the kind of data challenges that lay ahead in such competitions but it was surely going to be a learning experience so we went ahead and registered a team.

We arrived at the venue on the morning of 15 Oct to learn about the challenge at hand. We were provided with gameplay data from the company’s most successful game Monster Legends. For our readers with some time at their disposal (not quite the case with most of us doing the master), you should try this game but I shall warn you it can be addictive! The basic aim of the game is to collect the strongest monsters and build up your team of monsters to win the maximum battles against other players. And this brings me to the problem we had to solve: Given the list of monsters currently owned by a player, predict the monster they don’t already have and are most likely to choose next.

We were given:

  • A training dataset of 70,000 players and all the monsters they owned.
  • A test dataset of 30,000 players with one random monster removed from their list of monsters – This removed item was to be predicted
  • Basic information about each of the 100,000 players (country, sex, platform, days_played, payer_dummy etc)
  • Features of each monster available in the game
  • A baseline Item Based Collaborative Filtering (IBCF) model in R which gave an accuracy of 15.5%

There were 2 tracks on which each team would be judged:

  1. Objective evaluation of the precision of the predictions which was accessible through a live leaderboard – it evaluated the submissions on 20% of the solutions to avoid overfitting.
  2. Subjective evaluation of a Business Insights presentation which was supposed to be delivered the next morning with suggestions to SocialPoint about how to possibly increase revenue from the game, given our analysis on the provided data.

There were cash prizes for both tracks – 1st and 2nd prize of €750 and €500 for the accuracy track and a unique prize of €500 for the business insights track.

We would be presenting our results to – Horacio Martos (CEO, SocialPoint), Sharon Biggar (Head of Analytics, SocialPoint), Manuel Bruscas (Director of Analytics, eDreams) and our program co-director Christian Fons-Rosen.

With all this information at our disposal – the countdown timer for 24 hours started and we put on our thinking caps to brainstorm. We began with visualizing as much data as possible to get a general sense of the data provided. Apparently our brains were waiting for some calorie intake as only after the lunch (which was a fancy courtesy of the organizers), we came up with a basic idea of modeling the game as chronological event and predicting the monster for a player based on the player’s current level.

2 of us set on to implement this approach while the other 2 continued tweaking the baseline model in hopes of achieving a better precision. However, we failed to cross the 15% mark after several submissions coupled with hours of stackoverflowing and reading about various techniques.

Much to our disappointment, our chronological approach also didn’t give very optimistic results and DeepYellow (our team) was still lingering around 17% by the time we left the venue on Saturday evening. Some of us stayed up all night trying to implement different ideas – clustering players on features other than level and other techniques like user-based filtering. I believe the competitiveness of the event with regular submissions by all teams even at 3AM and the idea of trying new models kept us going.

After all our efforts, we were still at 19% at the end of 24 hours and clearly not the winning the team – the winners managed to achieve 37% accuracy. Given our unsuccessful attempts at analyzing the data – we were not really motivated about the presentation and we gave a very basic one which was not really very insightful. However, we learned a lot from the presentations of the other teams.

At the end of the event, we were given a tour of the offices and we got back to our lecture slides albeit not with cash prizes but with ideas on practically facing a data problem and hopefully some acquired knowledge. Hackathons are always a fun way of squeezing out the creative side in you in a limited amount of time coupled with the bonus of meeting many like-minded new people.

Our team consisted of Akhil Lohia, Nandan Rao, Robert Lange and Roman Kopso (an exchange student at UPC who was included in our team to assure all teams had 4 members).

About the author (Akhil Lohia):

I graduated with a bachelor’s degree in Economics from the Indian Institute of Technology Kanpur in June 2015. I have been involved in economics research projects in the capacity of a research assistant. My deep fascination towards technology and the ever increasing involvement of data to augment and provide more useful services to users in their context led my decision in choosing the Data Science program at BGSE in a smart city like Barcelona.14796165_679913662161695_15741260_o

Finding an optimal group allocation

The Task

During our first week of class, the BGSE Data Science class of 2017 was asked to form groups of 3 (and one group of 4) for the upcoming group projects of the first term. The question on-hand was how and on what basis to form these groups.

After some discussions, we decided to go with the following approach: Each of the 25 students provides a ranking of fellow students from 1 (most preferred) to 24 (least), considering the people’s’ previous academic background, interests, career aspirations, etc. From this input we would then compute the group allocation that is collectively most favorable, i.e. which minimizes the sum over the group member links, weighted by their ranking.

Finding the optimum solution to this problem turned out to be non-trivial. In the following I shall discuss a possible solution to that problem.

Representation as Graph
First, note that the problem on hand can be formulated as a complete, undirected graph where each student represents a vertex and the edge’s weight between 2 vertices A and B forms the sum of the rank of student A for student B and vice versa. The aggregated rank matrix therefore becomes the adjacency matrix of that graph.

The goal now is to form complete subgraphs of size 3 (and one subgraph with 4 vertices) such that the total weights of used edges gives the minimum.

Brute-Force
A graph with 25 vertices allows for over 2.3\cdot 10^{18} different combinations of group allocations:
{25 \choose 4}{21 \choose 3}{18 \choose 3}{15 \choose 3}{12 \choose 3}{9 \choose 3}{6 \choose 3} = 2308743493056000000

A brute-force approach – i.e. computing all possible combinations and then picking the combination with the minimum total edge weight – can therefore be deemed infeasible.

Literature Review

The next step was to review the literature for similar algorithmic problems and associated efficient solutions.
The stable matching (or marriage) problem considers finding a stable matching between two equally sized sets of elements given an ordering of preferences for each element. It differs from our problem in that each vertex is associated with one of two groups and it only allows for pair-wise matching. Furthermore, the solution to the problem is stable (i.e. no two individuals have an incentive to re-match) but not necessarily optimal from all individuals’ points of view.
In the similar stable roommate problem all elements (roommates) belong to a single pool, however, the problem is still restricted to pair-matching and the efficient algorithm by Irving (1985) does not generalize to larger groups.

Proposed Algorithm
With no ready-to-use algorithm on hand, we were forced to design a new group-matching algorithm.
The proposed algorithm starts from any valid solution and then iteratively improves the solution by swapping unstable pairs of students from different groups. The produced solution is therefore guaranteed to be stable for pairs of individuals but not necessarily for 3 or more individuals trading “under-the-table”. From that follows that the algorithm is likely to find a “good” solution but not necessarily an optimal solution.

The pseudo-code of the algorithm  is given below:

INPUT: adjacency matrix, group_sizes
OUTPUT: group allocation
--------------------------------------------------------------------------
form groups of given group sizes
initial solution: randomly assign vertices to initial groups
--------------------------------------------------------------------------
WHILE solution has changed
     FOREACH pair of vertices (i,j)
        cur_cost <- sum(edge weights of i)
                              + sum(edge weights of j)
        alt_cost <- sum(edge weights of i if in group(j))  
                              + sum(edge weights of j if in group(i))
        IF alt_cost < cur_cost
              swap i and j

Complexity

The algorithm compares {N \choose 2}\approx\frac{N^2}{2}  pairs of students in each WHILE loop. In the worst case, the algorithm could start from the worst possible solution and in each WHILE loop iteration only move to the next-best solution, thus leading to brute-force-like performance.

Tests with an implemented version of the algorithm (in R), however, showed that for graphs of size 25, the algorithm converges after an average of 20 swaps and no more than 3 WHILE loop iterations. Thus, the algorithm is believed to be efficient in the average case.

Further tests with different initializations for the same rank matrix confirmed that the algorithm only converges to a local minimum in the solution space. It is believed that the algorithm produces a good solution when run several times with different initializations and then choosing the solution with lowest total cost, however, by no means it can be guaranteed to have found the global optimum solution.

Final Note
The described group allocation process based on rankings turned out to be not self-enforcing and therefore was not actually used in the end.

References
Irving, Robert W. (1985), “An efficient algorithm for the “stable roommates” problem”, Journal of Algorithms, 6 (4): 577–595, doi:10.1016/0196-6774(85)90033-1

About The Author (Hans-Peter Höllwirth):
I have several years of working experience in the software industry as programmer, analyst, and project manager, and graduated from the University of Edinburgh with a bachelor degree in Computer Science and Management Science in 2016. My interests in machine learning, computing, and finance led me in my decision to join the BGSE Data Science program.

 

 

 

 

return(„Hola mundo“)

Hello and welcome to the 2016/2017 edition of the Data Science Blog!

I know it is already a little late for the first entry but we have big plans for the future: hackathon reports, technical essays, interviews and personal diary entries are only a few of those. Our general aim is to inform you about all the wonderful details of our experience at the Barcelona Graduate School of Economics and the specifics of the Data Science program.

If there are any topics or questions you are particularly interested in, please let Nandan (nandan.rao@barcelonagse.eu) or me (robert.lange@barcelonagse.eu) know.

Next week Hans-Peter Höllwirth is going to write about one of the first practical problems we encountered during our Masters: finding an optimal group allocation for our projects during the first term.

Rob

P.S.: Here is a picture of almost all of us from the Data Science Networking Eveningds_alumni_meeting