Proof : Linear Regression

The linear regression is one of the simplest modelisation techniques and is an interview favourite for quantitative finance. That's because a good understanding of the proof demonstrates a solid grasp of estimator concepts and good computing abilities.

1. Context

Despite its simplicity, linear regression is probably one of the most popular models in quantitative finance. Its main assumptions can be easily verified in most scenarios, and the inputs can be transformed to "extend" its abilities. One example of such transformation is the log regression, when we apply the log function to one of the variables.

There are two proofs for linear regression: one is a pure scalar and the other uses matrix notation. We will review the matrix proof as it's more general and often considered easier to memorize!
The linear regression often refers to one form of regression called Ordinary Least Squares (OLS). Note that other forms of regression, such as Lasso and Ridge, have slightly different proofs and assumptions.

2. Assumptions

There are several assumptions that can be made to establish the proof. You will notice that many overlap, and sometimes stronger assumptions are made. Here are the ones I commonly use.

Let \(X\) the explanatory variable, called the regressor and \(y\) the observed variable, called the regressand (or dependent variable).

3. Result

Under the above assumptions and notations, OLS yields the below result. \[\exists\beta,y=X\beta+\epsilon\]

And there is an estimator of \(\beta\), which we write \(\hat{\beta}\), such that.\[\hat{\beta}=\left(X^TX\right)^{-1}X^Ty\]

For the scalar version, with \(x\) scalar, this is equivalent to the below.\[\hat{\beta}=\frac{\mathbf{Cov}(x, y)}{\mathbf{Var}x}\]

4. Proof of \(\beta\)

4.1. Proof Idea

The main idea behind the proof of OLS can be easily represented in the 2D (also known as Euclidian) space. We plot points with coordinates \((x_i, y_i)\).
Our objective is to draw a line (with slope beta, that intersects in the origin) and to select beta such that the distance between the line and each data point is as small as possible.

The measure of the distance between points is the \(\mathcal{L}_2\)-norm, also called Euclidian norm, written \(||.||_2\).
It is the choice of this norm that is specific to OLS compared with other regressions, we will use it regardless of the dimension of our vectors \(X, y\).

Our constraint to minimize the distance can be written as follows.\[\hat{\beta}=\mathbf{arg}\min\limits_\beta\left|\left|y-X\beta\right|\right|_2\]

4.2. Sum of Squared Residuals

We now need to express what we minimize in a way we can find \(\mathbf{arg}\min\) of the \(\beta\). It's also at this point that you can either solve using matrices or using scalars.

\[\begin{align} \left|\left|y-X\beta\right|\right|_2 &= \left(\sum\limits_{i=1}^n\left|y_i-x_i^T\beta\right|^2\right)^\frac{1}{2}\\ \left|\left|y-X\beta\right|\right|_2^2 &= \sum\limits_{i=1}^n\left(y_i-x_i^T\beta\right)^2 \\ &= (y-X\beta)^T(y-X\beta) \end{align} \]

NB The expression with the sum of the squared differences between \(X\beta\) and \(y\) is the Sum of Squared Residuals, written SSR. It's always positive.\[SSR = \left|\left|y-X\beta\right|\right|_2^2\]

Because \(\mathcal{L}_2\)-norm is positive, we know that minimizing \(\left|\left|y-X\beta\right|\right|_2\) is the same as minimizing \(SSR\).\[\hat{\beta}=\mathbf{arg}\min\limits_\beta SSR\]

4.3. Matrix Derivation

Firs twe decompose the SSR using distribution properties.\[\begin{align} SSR &= \left|\left|y-X\beta\right|\right|_2^2\\ &= \left|\left|y\right|\right|_2^2 + \left|\left|X\beta\right|\right|_2^2 - 2y^TX\beta\\ &= \left|\left|y\right|\right|_2^2 + (X\beta)^TX\beta - 2y^TX\beta\\ \end{align} \]

Because it's a quadratic function in \(\beta\), know this function has a minimum in \(\beta=\hat{\beta}\) when the first derivative is equal to 0.\[\begin{align} \left.\frac{\partial SSR}{\partial \beta}\right|_{\beta=\hat{\beta}}&= \frac{\partial SSR}{\partial \beta}\left(\left|\left|y\right|\right|_2^2 + (X\hat{\beta})^TX\hat{\beta} - 2y^TX\hat{\beta}\right)\\ &= 0 + 2X^TX\hat{\beta} - 2X^Ty\\ \end{align} \]

We obtain the score equation knowing that this first derivative is equal to 0.\[X^TX\hat{\beta}=X^Ty\]

Our assumptions on the regressor \(X\) allow us to invert \(X^TX\) and we obtain the result.\[\boxed{\hat{\beta}=\left(X^TX\right)^{-1}X^Ty}\]

We can compute the linear estimator \(\hat{y}\) of \(y\) using \(\hat{\beta}\).\[ \begin{align}y &= X\hat{\beta} + \varepsilon\\\hat{y} &= X\hat{\beta} \end{align}\]

We have madean assumption that \(\mathbf{E}\varepsilon=0\) so we known that \(\mathbf{E}y = \mathbf{E}\hat{y}\).

5. Note on \(R^2\)

The coefficient of determination \(R^2\) is often used to assess the quality of the regression. The closer it is to 1, the better, with 0 being the worst fit.

NB Some programming languages such as Python have packages where \(R^2\) can be negative, in which case it means the fit is "worse than random". Towards Data Science has a great post on this one.

The idea behind the coefficient of determination is, how much of the sample variance is explained by the linear model. Therefore, we can easily write down the below.\[\begin{align} R^2=\frac{\mathbf{Var}\hat{y}}{\mathbf{Var}y}&=\frac{\sum_i(\hat{y}_i - \mathbf{E}\hat{y})^2}{\sum_i(y_i - \mathbf{E}y)^2}\\ &=\frac{\sum_i(\hat{y}_i - \bar{y})^2}{\sum_i(y_i - \bar{y})^2} \end{align}\]

We introduce the concepts of Explained Sum of Squares (ESS) and Total Sum of Squares (TSS).\[\begin{align} \sum_i(y_i - \bar{y})^2 &= \sum_i(y_i - \hat{y}_i)^2 + \sum_i(\hat{y}_i - \bar{y})^2 \\ TSS &= SSR + ESS \end{align}\]

We can replace in the former equation another expression of \(R^2\) using SSR.\[\begin{align} R^2 &= \frac{ESS}{TSS} \\ &= \frac{TSS - SSR}{TSS} \end{align}\]This leds to the most common expression of the coefficient of determination.\[\boxed{R^2 = 1 - \frac{RSS}{TSS}}\]

If you've found this helpful, check out my other tips here!

via GIPHY