Covariance matrix estimation, a challenging research topic

Covariance matrix estimation represents a challenging topic for many research fields. Sample covariance matrix might perform poorly in many circumstances, especially when the number of variables is approximately equal or greater then the number of observations. Moreover, when the precision matrix is the object of interest the sample covariance matrix might not be positive definite and more robust estimators must be used. With this article I will try to give a brief (and non-comprehensive) overview of some of the topics in this research field. In particular, I will describe Stenian shrinkage, covariance matrix selection through penalized likelihood and graphical lasso implementing the description with some potential extensions of these methodologies.

As shown by Wolf and Ledoit, 2004, the sample covariance matrix can be written as $S = X^T(I - \frac{1}{n}11^T)X$ where $X \in R^{NXP}$ and $(I - \frac{1}{n} 11^T) \in R^{NXN}$, where P is the number of variables and N the number of observations. Because the rank is bounded by the minimum of the rank of the two matrices, whenever P is greater then N the matrix is not full rank.

When instead $P \le N$ but their ratio is close to one, the problem is very likely to be ill-conditioned and different estimators are needed especially to compute the precision matrix. Intuitively, “S becomes unstable in the sense that small perturbations in measurements can lead to disproportionately large fluctuations in its entries” (Chi and Lange, 2014).

How to find the weighting parameter is a crucial issue. Wolf et al., 2004, minimized the expected quadratic distance between the true and the estimated covariance matrix based on the Froebenius norm in the context of portfolio risk optimization. Therefore, the objective function that they proposed is $\sum_{i=1}^N \sum_{j=1}^N E(\alpha f_{i,j} + (1-\alpha) s_{i,j} - \sigma_{i,j})^2$, where $E[\sigma_{i,j}]$ is replaced with its sample estimation. The researchers used asymptotic properties to justify their choice. On the other hand, as shown by Zhou et al, 2010, the rate of convergence under the Froebenius norm depends on the both the number of observation and the dimension of the problem. Therefore, we felt that in high dimensional settings a complementary approach in the context of portfolio risk minimization might be by estimating the weighting constant $\alpha$ using cross validation. In a project developed by 4 current Data Science students, described in an upcoming article of this blog, they proposed the following methodology for portfolio risk minimization: every four months, they selected a sequence of alpha from 0 to 1, they computed the corresponding covariance matrix and the best capital allocation according to the given matrix. They kept the portfolio for the subsequent two months and then they computed the variance of the returns under the given portfolio during the two-months period. The process was repeated for daily returns of 800 random companies for a sample of 8000 companies from 2010 to 2016. They finally selected the lowest alpha leading to the lowest average out of sample variance. The best alpha was around 1/2, while for values lower then 0.44 the matrix was singular as expected. This methodology might be effective when the target loss function depends on some other model specification. On the other hand, this method has many computational limitations and it might be infeasible in many circumstances. In this context both missing values and the choice of the subsample might have had affected the result in an unknown way.

A different approach for covariance estimation consists in solving a penalized regression problem in order to estimate the covariance matrix (Pourahmadi et al, 2006). The idea comes from a well-know link between the entries of the precision matrix and a regression problem. To show this, each variable $x_{i,n}$ can be described as: $x_{i,n} = \sum_{j \neq i} \theta_{i,j}x_{j,n} + \lambda |\theta|_1 + u_{i,n}$ , where $u_{i,n}$ is gaussian noise centered on zero. By assuming for the moment that $\lambda = 0$, we can identify a projection matrix $H \in R^{NxN}$ containing the coefficients $\theta$ such that $\epsilon = X - \hat{X} = X(I-H)$ and where $\epsilon$ is a vector of residuals, $I$ is the identity matrix and X is a vector of observed values. Taking the variance of the residuals we know that: $var(\epsilon) = (I-H)\Sigma(I-H)' \rightarrow (I-H)'D^{-1}(I-H) = \Sigma^{-1}$ where D is a diagonal matrix containing the variance of the residuals. It can be shown that we can express the off diagonal elements of H as a function of the precision matrix K: $(\Sigma^{-1})_{i,j}= -H_{i,j}*K_{i,i} \rightarrow \theta_{i,j} = -\frac{K_{i,j}}{K_{i,i}}$. This properties have been exploited in many settings. Pourahmadi et al. proposed a parametrization of the concentration matrix using the generalized Cholesky decomposition for time serie analysis. For time series, the coefficient $\theta_{i,t}$ is the auto-regressive coefficient, where each serie is regressed on the previous observations. The generalized Cholesky decomposes the matrix into LDL, where L is a triangular matrix with ones elements on the diagonal and D is a diagonal matrix. The matrix L can be interpreted as the matrix $L = I - H$, containing the negative coefficients on the off-diagonal elements of the upper triangle, ones on the diagonal – each observations is not regressed on itself- and zero in the remaining entries. The matrix D instead is the positive diagonal matrix containing the residuals variance. By imposing $\lambda > 0$ and therefore by shrinking the regression coefficients, the entire concentration matrix can be shrunken towards a more robust estimator. The objective function proposed is the log-likelihood of the data depending on $\sigma_t^2$ , the residual variance for each observation at time t, and the regression coefficients. Interestingly, the best tuning parameter $\lambda$ can be known using the generalized cross validation formula which provides a closed form approximation to leave-one-out cross validation for any linear regressors (The elements of Statistical Learning, ch.7, Tibshirani et al.).

We feel that the two approaches, apparently very different each other, might be linked, although further investigation is necessary in this field. The LDL decomposition provides a map between the precision matrix and the hat matrix, considered the necessary component to find the tuning parameter by using a closed form approximation to cross validation. On the other hand the MLE approach with shrinkage might be computational costly in terms of number of parameters to be estimated. The Stenian shrinkage instead is cheap in terms of the computation of the entries of the two matrices, but the main challenge consists in finding the best weighting parameter. Therefore, if there exist a map between the $\alpha$ of the Stenian shrinkage and the $\lambda$ for a shrinkage of the regression coefficients, then the two approaches might overlap and some computational gains could be achieved. One possible way could be to compute a set of matrices corresponding to a sequence of the weighting parameters $\alpha$, under the Stenian model and then compare the performance of each matrix using generalized cross validation by exploiting the generalized Cholesky on the given Stenian covariance matrix. On the other hand this has not been proposed in any paper and more investigation is necessary to assess the correctness of this approach.

Many other methodologies for shrinkage of the covariance matrix exists. A further alternative approach is by estimating the covariance by maximizing the log-likelihood expressed as $log[det(\Theta)] - tr(S \Theta)$ where S is the sample covariance matrix and $\Theta = \Sigma^{-1}$. According to Tibshirani et al., 2008, we can express the problem by imposing an additional term representing a penalty on the sum of the L1 norms of the entries. The imposition of the penalty ensure sparsity whenever the tuning parameter is large enough.
There are many extensions to this problem. The general idea is to minimize the likelihood function by imposing some given constraints. Many alternative approximations have been proposed in recent years – e.g. penalize the nuclear norm of the matrix – and many other approaches have been studied.

References

Cai, T. Tony, Cun-Hui Zhang, and Harrison H. Zhou. “Optimal rates of convergence for covariance matrix estimation.” The Annals of Statistics 38.4 (2010): 2118-2144.

Chi, Eric C., and Kenneth Lange. “Stable estimation of a covariance matrix guided by nuclear norm penalties.” Computational statistics & data analysis 80 (2014): 117-128.

Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. “Sparse inverse covariance estimation with the graphical lasso.” Biostatistics 9.3 (2008): 432-441.

Hastie, Trevor, Robert Tibshirani, and Martin Wainwright. Statistical learning with sparsity: the lasso and generalizations. CRC Press, 2015.

Huang, Jianhua Z., et al. “Covariance matrix selection and estimation via penalised normal likelihood.” Biometrika 93.1 (2006): 85-98.

Ledoit, Olivier, and Michael Wolf. “A well-conditioned estimator for large-dimensional covariance matrices.” Journal of multivariate analysis 88.2 (2004): 365-411.

I graduated in Economics at Luiss University with an experience of studies at UBC, Vancouver and a very brief experience of research in Econometrics. My interest in both theoretical and applied research in many fields of statistics was the main reason to enroll in the Data Science Program at BGSE.