Prediction as a Game

Introduction

In this article we provide a general understanding of sequential prediction, with a particular attention to adversarial models. The aim is to provide theoretical foundations to the problem and discuss real life applications.

As concrete examples consider the following three real-life situations:

1. Every Sunday you can choose either to go on trip to Costa Brava or to study in the library. To go on trip you need to buy tickets the day before. You want to be sure that you will enjoy a sunny day. Basing your decision on a single weather forecast might not be a good idea and you consider multiple forecasts. How to take the “best” possible action in the presence of experts’ advise is an issue that we address in this article.

2. After getting too much rain at the beach and studying on sunny days, you gave up on going on trip but to compensate you decide to go out for a dinner every Saturday. Your objective now is to pick the best possible restaurant based on your preferences – or most likely on your girlfriend’s preference – but you know very few restaurants in Barcelona. We will address this issue by studying a strategy for the so called multi-armed bandit problem.

3. Every month some friends are visiting you. Once they arrive you always have to recommend a restaurant where they can go on themselves – while you are studying in the library. Once they go to the restaurant you ask for a feedback, but you always receive partial or even no feedbacks from your lazy friends. This problem can be modeled as a partial monitoring multi-armed bandit problem that we describe in the last session.

For completeness of the article we provided a proof to our statements whenever this would not make the discussion too heavy. The reader might skip the proof if not interested.

Sequential prediction: General Set Up

Before entering into the discussion we provide a general set up. Considers y_1, y_2,... \in \mathcal{Y} being the outcome space. The forecaster chooses an action I_1 , I_2,... \in \mathcal{D} where \mathcal{D} is the decision space and she has a loss at time t, l(I_t, y_t). To be concrete, in Example 1 the loss might be 1 if you either you went to the beach when raining or you were studying with sunny weather, 0 otherwise.

In an oblivious game the environment chooses the outcome regardless of the strategy of the opponent. We define \hat{L}_n = \sum_{t=1}^N l(I_t, y_t) the cumulative loss of the forecaster. Without loss of generality we can impose that l(I_t,y_t) \in [0,1].

The regret at time t for choosing action I_t instead of action i \in \{1,...,M\} is defined as l(I_t,y_t) - l(i,y_t). A natural objective function is the average maximum regret faced by the forecaster, defined as: \frac{1}{n}\sum_{t=1}^n l(I_t, y_t) - min_{i=1,...,M}\frac{1}{n}\sum_{t=1}^nl(i,y_t).

Considering again Example 1, if you went to the beach on a rainy day but all did the same because all experts told you that it was going to be sunny your regret would be simply 0.

Forecasting strategies that guarantee that the average regret goes to zero almost surely for all possible strategies of the environment are defined Hannan consistent.

Prediction with experts advice

We start our discussion first considering the context of regret minimization for prediction with experts advice [3] under a convex loss function. With an abuse of notation we consider the decision space being f_{i,1} , f_{i,2},... \in \mathcal{D} of each expert i \in \{1,...,M\}.

Prediction protocol with convex loss function
For each round t=1,2,...

1. the environment chooses the next outcome y_t \in \{1,...,M\} without reveling it;
2. each expert reveals its prediction to the forecaster;
3. the forecaster chooses \hat{p}_t;
4. the environment reveals y_t;
5. the forecaster suffers the loss l(\hat{p}_t, y_t) and each expert i suffers a loss l(f_{i,t}, y_t)

Consider the following strategy:
\hat{p}_t =\sum_{i=1}^M \frac{W_{i,t}}{W_t}f_{i,t}

with
W_{i,t} = \exp(-\eta \sum_{s=1}^{t-1}l(f_{i,s}, y_s)) = \exp(-\eta L_{i,t-1}), w_{i,0} = 1, W_t = \sum_{i=1}^M W_{i,t}.

Then this strategy is consistent at a rate proportional to n^{-1/2}.

This results show that by assigning exponential weights to each forecaster you will end up with a strategy consistent at optimal rate.

Notice that this result strongly relies on the assumption of convexity on the loss function. In the next session we relax this assumption and show that you can get similar results.

Proof.

Considers log(W_N/W_0) = log(W_N) – log(M) = \text{log}(\sum_{i=1}^M \exp(-\eta L_{i,n})) - \text{log}(M) = z . Then a lower bound can be defined as
z \ge \text{log}(max_{i=1,...,M} \exp(-\eta L_{i,n})) - \text{log}(M) = - \eta min_{i=1,...,M}L_{i,n} - \text{log}(M)
Notice then that \text{log}(\frac{W_N}{W_0}) = \text{log}(\prod_{t=1}^N W_t/W_{t-1}) = \sum_{t=1}^N \text{log}(W_t/W_{t-1}) = \sum_{t=1}^N \text{log}(\sum_{i=1}^M q_{i,t-1} \exp(-\eta l(f_{i,t}, y_t)), with q_{i,t-1} = \frac{w_{i,t-1}}{w_{t-1}}. Given that the loss is bounded between 0,1, by Hoeffding inequality:
\sum_{t=1}^N \text{log}(\sum_{i=1}^M q_{i,t-1} \exp(-\eta l(f_{i,t}, y_t)) \le -\eta \sum_{t=1}^N \frac{w_{i,t-1}}{W_{t-1}} l(f_{i,t}, y_t) + \frac{n \eta^2}{8} = c
By Jensen’s inequality
c \le -\eta l(\sum_{t=1}^N \frac{w_{i,t-1}}{W_{t-1}}f_{i,t}, y_t) +  \frac{n\eta^2}{8} = -\eta l(\hat{p}_t, y_t) +  \frac{n\eta^2}{8}
Therefore we get -\eta \hat{L}_n + \frac{n\eta^2}{8} \ge -\eta  min_{i=1,...,M} L_{i,n} - \text{log}(M). Rearrenging things we get
\hat{L}_n - min_{i=1,...,M} L_{i,n} \le \frac{\text{log}(M)}{\eta} + \frac{n \eta}{8}
By the first order conditions on the upper bound we choose \eta = \sqrt{\frac{8\text{log}(M)}{n}} and by substituting the term we get:
\hat{L}_n - min_{i=1,...,M} L_{i,N} \le \sqrt{\frac{n}{2} \text{log}(M)}

Sharper bounds can be obtained by imposing some structure on the behavior of the loss function.

New Prediction Protocol

For each round t=1,2,...

1. the environment chooses the next outcome y_t \in \{1,...,M\} without reveling it;
2. each expert reveals its prediction to the forecaster;
3. the forecaster chooses a probability vector \mathbf{p}_t over the set of M actions and draws an action I_t \in \{1,...,M\} with
p_{i,t} = \frac{\exp(-\eta \sum_{s=1}^{t-1}l(f_{i,s}, y_s))}{W_{t-1}}f_{i,t}
4. the environment reveals y_t;
5. the forecaster suffers the loss l(I_t, y_t) and each expert i suffers a loss l(i, y_t)

In this case the average regret is bounded by \sqrt{\frac{n}{2} \text{log}(\frac{1}{\delta})} + \sqrt{\frac{n}{2} \text{log}(M)} with probability at least 1 - \delta.

Again the strategy is consistent at optimal rate. The importance of this result relies on the fact that now we have a consistent strategy for deciding which weather forecasts believe to for any possible loss we have in mind! In the next session we go through a technical proof that the reader might prefer to skip if not interested in the details of the theorem.

Proof.

To prove the result we first make use of the following lemma:

Lemma 1. Let X_t \le 1 being a random variable such that E[X_t | \mathcal{F}_{t-1}], where \mathcal{F}_{t-1} is the filtration at time t-1. Then by Hoeffding-Azuma inequality

P(\sum_{s=1}^n X_s - E[X_s] > t) \le \exp(-2t^2/n) \Rightarrow \sum_{s=1}^n X_s - E[X_s] \le \sqrt{\frac{n}{2} \text{log}(1/\delta)} \quad \text{w.p.} \ge 1 - \delta

Proof of Lemma 1. By the Chernoff bound P(\sum_{s=1}^n X_s - E[X_s] > t) \le \frac{E[\exp(\lambda \sum_s X_s)]}{\exp(\lambda t)} for some \lambda > 0. By using the law of iterated expectations and by Hoeffding inequality
E[\exp(\lambda \sum_s^{n-1} X_s)E[\exp(\lambda X_n) | X_1,...X_{n-1}]] \le E[\exp(\lambda \sum_s^{n-1} X_s + \lambda^2/8)]

The argument can be repeated n times and get E[\exp(\lambda \sum_s X_s - \lambda t)] \le \exp(n\lambda^2/8 - \lambda t). Using this result and minimizing over \lambda gives the result shown in the previous expression.

We can now move to the actual proof. Define \bar{l}(\mathbf{p}_t, y_t) = \sum_{t=1}^N p_{i,t}l(i, y_t) = E[l(I_t, y_t)|\mathcal{F}_{t-1}] where \mathcal{F}_{t-1} is the filtration at time t-1. Furthermore, notice that l(I_t, y_t) is a martingale and \sum_{t=1}^n [l(I_t, y_t) - \bar{l}(\mathbf{p}_t, y_t)] = 0 has expectation 0. Using Hoeffding-Azuma
\sum_{t=1}^n [l(I_t, y_t) - \bar{l}(\mathbf{p}_t, y_t)] \le \sqrt{\frac{n}{2}\text{log}(1/\delta)} w.p. \ge 1 - \delta, therefore the loss are concentrated around expectation. Notice now that \bar{l}(\mathbf{p}_t, y_t) is convex (linear in this case) in the first variable. Therefore by the previous result
\sum_{t=1}^N \bar{l}(\mathbf{p}_t, y_t) - min_{i=1,...,M} \sum_{t=1}^n l(i,y_t) \le \sqrt{\frac{n}{2} \text{log}(M)}
By adding and subtracting \sum_{t=1}^N \bar{l}(\mathbf{p}_t, y_t):

\sum_{t=1}^n l(I_t, y_t) - min_{i=1,...,M} \sum_{t=1}^n l(i,y_t) \le \sqrt{\frac{n}{2} \text{log}(1/\delta)} + \sqrt{\frac{n}{2} \text{log}(M)} which concludes the proof.

Multi-Armed Bandit Problem
We now move to the problem of choosing the best restaurant without knowing many restaurants and without the help of TripAdvisor!

Consider the following prediction protocol:

Prediction Protocol: Multi-Armed Bandit Problem

For each round t=1,2,...

1. the environment chooses the next outcome y_t \in \{1,...,M\} without reveling it;
2. the forecaster chooses a probability vector \mathbf{p}_t over the set of M actions and draws an action I_t \in \{1,...,M\};
3. the forecaster suffers the loss l(I_t, y_t);
4. only l(I_t, y_t) is reveled to the forecaster, the loss for all other actions remain unknown.

The objective function of the forecaster remains the regret. Clearly the situation is much more challenging, provided that there is not common knowledge of the loss incurred at every time by each expert. We define the following unbiased estimator:

\tilde{l}(i, y_t) = \frac{l(i, y_t) \mathbf{1}_{I_t = i}}{p_{i,t}}

where p_{i,t} is the probability of choosing action i at time t and \mathbf{1}_x is the indicator variable equal to 1 if x is true, 0 otherwise. Notice that

E_t[\tilde{l}(i,  y_t)] = \sum_{j=1}^M p_{j,t} \frac{l(i, y_t) \mathbf{1}_{i = j}}{p_{i,t}} = l(i,y_t)

A forecasting strategy in Multi-Armed Bandit Problem

We define g(i, y_t) = 1 - l(i, y_t) the gain and \tilde{g}(i, y_t) = \frac{g(i, y_t)}{p_{i,t}}\mathbf{1}_{I_t = i} the estimated unbiased gain. Notice that g(i, y_t) - \tilde{g}(i, y_t) is at most 1, a property used for a martingale-type bound. Choose \eta, \gamma, \beta > 0. Initialize w_{i, 0} = 1, p_{i,1} = 1/M.

For each round t = 1,2,...

1. Select an action I_t according to the probability distribution \mathbf{p}_t;
2. calculate the estimated gain: g'(i, y_t) = \tilde{g}(i, y_t) + \beta/p_{i,t}
3. update the weights w_{i,t} = w_{i,t-1}\exp(\eta g'(i,y_t));
4. update the probabilities p_{i,t+1} = (1 - \gamma)\frac{w_{i,t}}{W_t} + \frac{\gamma}{M} with exponential weights.

Note that by introducing a parameter \beta we give up the unbiasedness of the estimate to guarantee that the estimated cumulative gains are, with
large probability, not much smaller than the actual cumulative gains.
Under conditions of theorem 6.10 [2] the regret is O(\sqrt{n}) again. Therefore even without having no clue about what the loss would have been by going to new restaurants the strategy is consistent at optimal rate!

Discussion

We want to stress that the main ingredients for an optimal rate of convergency in probability are contained in the exploration-exploitation trade off. In fact notice that
p_{i,t} = \frac{\exp(-\eta \sum_{s=1}^{t-1} \tilde{l}(i, y_s))}{\sum_{j=1}^M \exp(-\eta \sum_{s=1}^{t-1} \tilde{l}(j, y_s))}(1 - \gamma) + \frac{\gamma}{M}
then the first term multiplying by 1 - \gamma contains information regarding the losses of the actions taken in the past. The second term instead let the forecaster have non-zero probabilities for exploring new actions. In practice, the strategy give you a guide for how many times you should explore going to new restaurants and how many times you should go to good restaurants where you have already been.


Partial Monitoring Multi-Armed Bandit Problem

As a further motivating example of a partial monitoring regret minimization problem consider the following dynamic pricing model.

A vendor sell a product to customers one by one. She can select a different price for each customer but no barganing is allowed and no further information can be exchanged between the buyer and the seller. Assume that the willingness to pay of each buyer is y_t \in [0,1], the actual price offered to the seller is z_t and the loss incurred by the seller at time t is

l(p_t, y_t) = (y_t - z_t) \mathbf{1}_{z_t \le y_t} + c\mathbf{1}_{z_t > y_t}

with c \in [0,1]. The seller can only observe whether the customer buys or not the product and has no clue about the empirical distribution of y_t. A natural question is whether it exists a randomized strategy for the seller such that the average regret is Hannan consistent.

In a more general setting we define the following prediction protocol:

Prediction Protocol: Partial Monitoring Multi-Armed Bandit Problem

For each round t=1,2,...

1. the environment chooses the next outcome y_t \in \{1,...,M\} without reveling it;
2. the forecaster chooses a probability vector \mathbf{p}_t over the set of M actions and draws an action I_t \in \{1,...,M\};
3. the forecaster suffers the loss l(I_t, y_t);
4. only a feedback h(I_t, y_t) is reveled to the forecaster.

The losses of the forecaster can be summurized in the loss matrix \mathbf{L} = [l(i,j)]_{N \times M}. With no loss of generality l(I_t, y_t) \in [0,1]. At every iteration the forecaster chooses an action I_t, suffers a loss l(I_t, y_t) but she only observes a feedback h(I_t, y_t) parametrized by a given feedback function h that assigns to each action/outcome pair \in \{1,...,N\} \times \{1,...,M\} an element of a finite set \mathcal{S} = \{s_1,...,s_m\} of signals. The values are collected in the feedback matrix \mathbf{H} = [h(i,j)]_{N \times M}. Notice that the forecaster at time t has access only to the information (h(I_1, yL = _1),..., h(I_{t-1}, y_{t-1})). In [1] the following strategy was shown to be Hannan consistent at a sub-optimal rate O(n^{-1/3}).

Assume that l(i,j) = \sum_{l=1}^N k(i,l) h(l,j), that is \mathbf{L} = \mathbf{K} \mathbf{H}, considering \mathbf{H} and [\mathbf{H} \quad \mathbf{L}] having the same rank. Define k^* = \text{max}_{i,j}\{1, |k(i,j)|\} and as an unbiased estimator of the loss:

\tilde{l}(i, y_t) = \frac{k(i, I_t) h(I_t, y_t)}{p_{I_t, t}} \quad \forall i = 1,...,N

with p_{I_t, t} being the probability of having chosen action I_t at time t and \tilde{L}_{i,t} = \sum_{s=1}^t \tilde{l}(i,y_t). Initialize \tilde{L}_{1,0} = ... = \tilde{L}_{M,0} = 0. For each round t = 1,2,...

1. Let \eta_t = (k^*)^{-2/3}((ln M)/M)^{2/3} t^{-2/3} and \gamma_t = (k^*)^{2/3}M^{2/3}(ln M)^{1/3}t^{-1/3};
2. choose an action I_t from the set of actions \{1,...,M\} at random according to the distribution \mathbf{p}_t defined by

p_{i,t} = (1 - \gamma_t) \frac{e^{-\eta_t \tilde{L}_{i,t-1}}}{\sum_{k=1}^M e^{-\eta_t \tilde{L}_{k,t-1}}} + \frac{\gamma_t}{M}

3. let \tilde{L}_{i,t} = \tilde{L}_{i,t-1} + \tilde{l}(i, y_t) for all i = 1,...,M

In [1] the authors shown that under some mild conditions the strategy has a performance bound with a magnitude proportional to n^{2/3}(k^* M)^{2/3}(ln M)^{1/3}, that is with a convergency rate O(n^{-1/3}). Interestingly in the scenario of a simple multi-armed bandit problem, whenever \mathbf{H} = \mathbf{L}, the theorem leads to a bound of order (M^2 (ln M)/n)^{1/3}, much slower compared to the result obtained in the previous section. Finding the class of problems for which this bound can be improved remains in fact a challenging research question

Discussion

We have just shown that it exists a consistent strategy even without having full information on the loss that you incurred, as in the case of the third example in the introductory session, but at a slower rate.

Advertisements

Covariance matrix estimation, a challenging research topic

Covariance matrix estimation represents a challenging topic for many research fields. Sample covariance matrix might perform poorly in many circumstances, especially when the number of variables is approximately equal or greater then the number of observations. Moreover, when the precision matrix is the object of interest the sample covariance matrix might not be positive definite and more robust estimators must be used. With this article I will try to give a brief (and non-comprehensive) overview of some of the topics in this research field. In particular, I will describe Stenian shrinkage, covariance matrix selection through penalized likelihood and graphical lasso implementing the description with some potential extensions of these methodologies.

As shown by Wolf and Ledoit, 2004, the sample covariance matrix can be written as S = X^T(I - \frac{1}{n}11^T)X where X \in R^{NXP} and (I - \frac{1}{n} 11^T) \in R^{NXN}, where P is the number of variables and N the number of observations. Because the rank is bounded by the minimum of the rank of the two matrices, whenever P is greater then N the matrix is not full rank.

When instead P \le N but their ratio is close to one, the problem is very likely to be ill-conditioned and different estimators are needed especially to compute the precision matrix. Intuitively, “S becomes unstable in the sense that small perturbations in measurements can lead to disproportionately large fluctuations in its entries” (Chi and Lange, 2014).

How to find the weighting parameter is a crucial issue. Wolf et al., 2004, minimized the expected quadratic distance between the true and the estimated covariance matrix based on the Froebenius norm in the context of portfolio risk optimization. Therefore, the objective function that they proposed is \sum_{i=1}^N \sum_{j=1}^N E(\alpha f_{i,j} + (1-\alpha) s_{i,j} - \sigma_{i,j})^2, where E[\sigma_{i,j}] is replaced with its sample estimation. The researchers used asymptotic properties to justify their choice. On the other hand, as shown by Zhou et al, 2010, the rate of convergence under the Froebenius norm depends on the both the number of observation and the dimension of the problem. Therefore, we felt that in high dimensional settings a complementary approach in the context of portfolio risk minimization might be by estimating the weighting constant \alpha using cross validation. In a project developed by 4 current Data Science students, described in an upcoming article of this blog, they proposed the following methodology for portfolio risk minimization: every four months, they selected a sequence of alpha from 0 to 1, they computed the corresponding covariance matrix and the best capital allocation according to the given matrix. They kept the portfolio for the subsequent two months and then they computed the variance of the returns under the given portfolio during the two-months period. The process was repeated for daily returns of 800 random companies for a sample of 8000 companies from 2010 to 2016. They finally selected the lowest alpha leading to the lowest average out of sample variance. The best alpha was around 1/2, while for values lower then 0.44 the matrix was singular as expected. This methodology might be effective when the target loss function depends on some other model specification. On the other hand, this method has many computational limitations and it might be infeasible in many circumstances. In this context both missing values and the choice of the subsample might have had affected the result in an unknown way.

A different approach for covariance estimation consists in solving a penalized regression problem in order to estimate the covariance matrix (Pourahmadi et al, 2006). The idea comes from a well-know link between the entries of the precision matrix and a regression problem. To show this, each variable x_{i,n} can be described as: x_{i,n} = \sum_{j \neq i} \theta_{i,j}x_{j,n} + \lambda |\theta|_1 + u_{i,n} , where u_{i,n} is gaussian noise centered on zero. By assuming for the moment that \lambda = 0, we can identify a projection matrix H \in R^{NxN} containing the coefficients \theta such that \epsilon = X - \hat{X} = X(I-H) and where \epsilon is a vector of residuals, I is the identity matrix and X is a vector of observed values. Taking the variance of the residuals we know that: var(\epsilon) = (I-H)\Sigma(I-H)' \rightarrow (I-H)'D^{-1}(I-H) = \Sigma^{-1} where D is a diagonal matrix containing the variance of the residuals. It can be shown that we can express the off diagonal elements of H as a function of the precision matrix K: (\Sigma^{-1})_{i,j}= -H_{i,j}*K_{i,i} \rightarrow \theta_{i,j} = -\frac{K_{i,j}}{K_{i,i}}. This properties have been exploited in many settings. Pourahmadi et al. proposed a parametrization of the concentration matrix using the generalized Cholesky decomposition for time serie analysis. For time series, the coefficient \theta_{i,t} is the auto-regressive coefficient, where each serie is regressed on the previous observations. The generalized Cholesky decomposes the matrix into LDL, where L is a triangular matrix with ones elements on the diagonal and D is a diagonal matrix. The matrix L can be interpreted as the matrix L = I - H, containing the negative coefficients on the off-diagonal elements of the upper triangle, ones on the diagonal – each observations is not regressed on itself- and zero in the remaining entries. The matrix D instead is the positive diagonal matrix containing the residuals variance. By imposing \lambda > 0 and therefore by shrinking the regression coefficients, the entire concentration matrix can be shrunken towards a more robust estimator. The objective function proposed is the log-likelihood of the data depending on \sigma_t^2 , the residual variance for each observation at time t, and the regression coefficients. Interestingly, the best tuning parameter \lambda can be known using the generalized cross validation formula which provides a closed form approximation to leave-one-out cross validation for any linear regressors (The elements of Statistical Learning, ch.7, Tibshirani et al.).

We feel that the two approaches, apparently very different each other, might be linked, although further investigation is necessary in this field. The LDL decomposition provides a map between the precision matrix and the hat matrix, considered the necessary component to find the tuning parameter by using a closed form approximation to cross validation. On the other hand the MLE approach with shrinkage might be computational costly in terms of number of parameters to be estimated. The Stenian shrinkage instead is cheap in terms of the computation of the entries of the two matrices, but the main challenge consists in finding the best weighting parameter. Therefore, if there exist a map between the \alpha of the Stenian shrinkage and the \lambda for a shrinkage of the regression coefficients, then the two approaches might overlap and some computational gains could be achieved. One possible way could be to compute a set of matrices corresponding to a sequence of the weighting parameters \alpha, under the Stenian model and then compare the performance of each matrix using generalized cross validation by exploiting the generalized Cholesky on the given Stenian covariance matrix. On the other hand this has not been proposed in any paper and more investigation is necessary to assess the correctness of this approach.

Many other methodologies for shrinkage of the covariance matrix exists. A further alternative approach is by estimating the covariance by maximizing the log-likelihood expressed as log[det(\Theta)] - tr(S \Theta) where S is the sample covariance matrix and \Theta = \Sigma^{-1}. According to Tibshirani et al., 2008, we can express the problem by imposing an additional term representing a penalty on the sum of the L1 norms of the entries. The imposition of the penalty ensure sparsity whenever the tuning parameter is large enough.
There are many extensions to this problem. The general idea is to minimize the likelihood function by imposing some given constraints. Many alternative approximations have been proposed in recent years – e.g. penalize the nuclear norm of the matrix – and many other approaches have been studied.

References

Cai, T. Tony, Cun-Hui Zhang, and Harrison H. Zhou. “Optimal rates of convergence for covariance matrix estimation.” The Annals of Statistics 38.4 (2010): 2118-2144.

Chi, Eric C., and Kenneth Lange. “Stable estimation of a covariance matrix guided by nuclear norm penalties.” Computational statistics & data analysis 80 (2014): 117-128.

Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. “Sparse inverse covariance estimation with the graphical lasso.” Biostatistics 9.3 (2008): 432-441.

Hastie, Trevor, Robert Tibshirani, and Martin Wainwright. Statistical learning with sparsity: the lasso and generalizations. CRC Press, 2015.

Huang, Jianhua Z., et al. “Covariance matrix selection and estimation via penalised normal likelihood.” Biometrika 93.1 (2006): 85-98.

Ledoit, Olivier, and Michael Wolf. “A well-conditioned estimator for large-dimensional covariance matrices.” Journal of multivariate analysis 88.2 (2004): 365-411.

About the author(Davide Viviano)

I graduated in Economics at Luiss University with an experience of studies at UBC, Vancouver and a very brief experience of research in Econometrics. My interest in both theoretical and applied research in many fields of statistics was the main reason to enroll in the Data Science Program at BGSE.