# Overfitting, the Bias-Variance tradeoff, Regularization, and Double DescentÂ¶

Preamble: Run the cells below to import the necessary Python packages

These notes are modified and extended from notes by **Volodymyr Kuleshov** at Cornell Tech. These have been modified from the originals to segue into a the physics course, and to motivate double-descent

```
## Preamble / required packages
import numpy as np
np.random.seed(0)
## Import local plotting functions and in-notebook display functions
import matplotlib.pyplot as plt
from IPython.display import Image, display
%matplotlib inline
import warnings
## Comment this out to activate warnings
warnings.filterwarnings('ignore')
```

# Overfitting and UnderfittingÂ¶

We have seen a number of supervised learning algorithms.

Next, let's look at why they work (and why sometimes they don't).

```
```

# Polynomial RegressionÂ¶

Recall that in linear regression, we fit a linear model to a dataset $X \in \mathbb{R}^{N_\text{data} \times N_\text{features}}$ and a vector of labels $y \in \mathbb{R}^N_\text{data}$

$$ \mathbf{y} = \boldsymbol{\theta}^\top \mathbf{X} $$

where $\boldsymbol{\theta} \in \mathbb{R}^{N_\text{features} \times 1}$ is a vector of parameters that we want to learn, which we sometimes call the model weights.

What if we wanted to fit a polynomial instead? We could instead solve a polynomial regression problem of the form

$$ \mathbf{y} = \boldsymbol{\theta}^\top \Phi(\mathbf{X}) $$

where $\Phi(\mathbf{X})$ is a matrix of features that are polynomial transformations of the original features, $$ \Phi(\mathbf{X}) = \begin{bmatrix} 1 & x_{11} & x_{12} & \dots & x_{1d} & x_{11}^2 & x_{12}^2 & \dots & x_{1d}^2 & \dots \\ 1 & x_{21} & x_{22} & \dots & x_{2d} & x_{21}^2 & x_{22}^2 & \dots & x_{2d}^2 & \dots \\ \vdots & \vdots & \vdots & \ddots & \vdots & \vdots & \vdots & \ddots & \vdots & \ddots \\ 1 & x_{n1} & x_{n2} & \dots & x_{nd} & x_{n1}^2 & x_{n2}^2 & \dots & x_{nd}^2 & \dots \\ \end{bmatrix} $$

For example, if we have a single feature $x$ ($N_\text{features} = 1$), we could use the lifted featurization $\Phi(x) = [1, x, x^2, x^3, \ldots, x^p]$ to fit a polynomial of degree $p$. Sometimes these feature-wise transformations are called *basis functions* or *kernels*.

$$ f_\theta(x) := \theta^\top \phi(x) = \sum_{j=0}^p \theta_j x^j $$ that is linear in $\theta$ but non-linear in $x$ because the features $$\phi(x) = [1\; x\; \ldots\; x^p]$$ are non-linear. Using these features, we can fit any polynomial of degree $p$ by first transforming the features and then fitting a linear model.

### Create a synthetic datasetÂ¶

- As a demonstration, we'll make a simple synthetic dataset with a single feature $x$ and a single label $y$.
- We don't know this function in practice. It's the unknown generator of our data.

```
true_fn = lambda X: np.cos(1.5 * np.pi * X)
np.random.seed(0)
n_samples = 30
X = np.sort(np.random.rand(n_samples))
y = true_fn(X) + np.random.randn(n_samples) * 0.1
X_test = np.linspace(0, 1, 100)
plt.plot(X_test, true_fn(X_test), 'k', label="True")
plt.plot(X, y, '.b', markersize=10, label="Sampled")
plt.legend(loc="best")
plt.xlabel("x")
plt.ylabel("y")
```

Text(0, 0.5, 'y')

## The Vandermonde matrixÂ¶

The Vandermonde matrix $\Phi \in \mathbb{R}^{N \times p}$ is a matrix whose rows are the features $\phi(x)$ for a set of points $x_1,\ldots,x_n$. For example, if we have $n=3$ points $x_1=1$, $x_2=2$, $x_3=3$, and $p=2$, then the Vandermonde matrix is $$ \begin{bmatrix} 1 & 1 & 1 \\ 1 & 2 & 4 \\ 1 & 3 & 9 \end{bmatrix} $$

```
xx = np.array([1, 2, 3])
print(np.vander(xx, 2, increasing=True), end='\n\n')
print(np.vander(xx, 3, increasing=True), end='\n\n')
print(np.vander(xx, 5, increasing=True), end='\n\n')
print(np.vander(xx, 10, increasing=True), end='\n\n')
```

[[1 1] [1 2] [1 3]] [[1 1 1] [1 2 4] [1 3 9]] [[ 1 1 1 1 1] [ 1 2 4 8 16] [ 1 3 9 27 81]] [[ 1 1 1 1 1 1 1 1 1 1] [ 1 2 4 8 16 32 64 128 256 512] [ 1 3 9 27 81 243 729 2187 6561 19683]]

### Solving yet another least-squares problemÂ¶

In principle, we can simply fit a polynomial of degree $p$ to a set of points $(x_1,y_1),\ldots,(x_n,y_n)$ by solving the linear system $\Phi\theta = y$.

$$ \begin{bmatrix} 1 & x_1 & x_1^2 & \ldots & x_1^p \\ 1 & x_2 & x_2^2 & \ldots & x_2^p \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_n & x_n^2 & \ldots & x_n^p \end{bmatrix} \begin{bmatrix} \theta_0 \\ \theta_1 \\ \vdots \\ \theta_p \end{bmatrix} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix} $$

Notice how we "linearized" our problem by applying nonlinear functions to our features. This is the basis of many "kernel methods" in machine learning.

### Properties of the Vandermonde matrixÂ¶

The Vandermonde matrix $\Phi$ is a square matrix of size $n\times p$, where $n$ is the number of datapoints and $p$ is one plus the degree of the polynomial that we want to fit. In this sense, the "features" of the model are just higher powers of existing features, making the features redundant.

If $p = n$, then the Vandermonde matrix is square and invertible, and the linear system has a unique solution via $\Phi^{-1}$. Notice how the Vandermond matrix always has full rank in this case, if the points $x_1,\ldots,x_n$ are all distinct. However, the matrix becomes singular if any of the datapoints are repeated.

However, if $p > n$, then the Vandermonde matrix is rectangular and the linear system is not guaranteed to have a unique solution. Conversely, if $p < n$, then the Vandermonde matrix is rank-deficient and the linear system is not guaranteed to have a solution. In this case, we can find the least-squares solution using the Moore-Penrose pseudoinverse $(\Phi^\top\Phi)^{-1}\Phi^\top$.

**Important**, as the number of points increases, the Vandermonde matrix becomes ill-conditioned, which degrades the accuracy of the solution.

```
xx = np.array([1, 2, 3]) # 3 data points
pvals = range(2, 20)
all_condition_numbers = []
for i in pvals:
all_condition_numbers.append(np.linalg.cond(np.vander(xx, i, increasing=True)))
plt.plot(pvals, all_condition_numbers)
plt.xlabel('Polynomial degree at fixed data quantity')
plt.ylabel('Condition number')
```

Text(0, 0.5, 'Condition number')

## How can we interpret this effect?Â¶

Let's try making unit vectors out of the powers of our data vector

```
xx = np.array([1.0, 2.0]) # 2 data points
phi = np.vander(xx, 10, increasing=True)
phi /= np.linalg.norm(phi, axis=0, keepdims=True)
```

```
def plot_vector(vec, **kwargs):
plt.plot([0, vec[0]], [0, vec[1]], **kwargs)
plt.figure(figsize=(6, 6))
plot_vector(phi[:, 0], color='k', label='First column')
plot_vector(phi[:, 1], color='r', label='Second column')
plt.title("First two columns of the Vandermonde matrix")
plt.figure(figsize=(6, 6))
plot_vector(phi[:, -2], color='k', label='First column')
plot_vector(phi[:, -1], color='r', label='Second column')
plt.title('Last two columns')
# plot with color gradient
plt.figure(figsize=(6, 6))
for i in range(10):
plot_vector(phi[:, i], color=plt.cm.viridis(i / 10))
plt.title('All columns')
```

Text(0.5, 1.0, 'All columns')

```
plt.imshow(phi)
plt.xlabel('Power index')
plt.ylabel('Datapoint index')
```

Text(0, 0.5, 'Datapoint index')

The largest element of our data vector $x$ begins to dominate at higher powers of $x$, causing the column vectors to become more and more similar to each other, and resulting in the least-squares problem becoming ill-conditioned.

**Nothing comes for free!** If we don't have sufficient features or measurement channels, we can't always just make up new ones as functions of our existing features. Eventually, redundancy catches up with us. We can't convert an underdetermined problem to an overdetermined problem by adding redundant features.

```
```

```
```

# Polynomials RegressionÂ¶

When we switch from linear models to polynomials, we can better fit the data and increase the accuracy of our models.

Instead of directly computing the Vandermonde matrix, we can use the

`PolynomialFeatures`

class from`sklearn.preprocessing`

to compute the Vandermonde matrix for us. Because it's built into scikit-learn, it's optimized for speed and memory usage.We need to do two operations: featurize the data using the Vandermonde matrix, and then train a linear regression.

`scitkit-learn`

allows us to combine these operations into a single combined model with the`Pipeline`

class.

```
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
degrees = [1, 2, 3]
plt.figure(figsize=(14, 5))
for i in range(len(degrees)):
ax = plt.subplot(1, len(degrees), i + 1)
polynomial_features = PolynomialFeatures(degree=degrees[i])
linear_regression = LinearRegression()
pipeline = Pipeline([("pf", polynomial_features), ("lr", linear_regression)])
pipeline.fit(X[:, np.newaxis], y)
ax.plot(X_test, true_fn(X_test), color='k', label="True function")
ax.plot(X_test, pipeline.predict(X_test[:, np.newaxis]), label="Model")
ax.plot(X, y, '.b', markersize=10, label="Sampled")
ax.set_xlim((0, 1))
ax.set_ylim((-2, 2))
ax.legend(loc="best")
ax.set_title("Polynomial of Degree {}".format(degrees[i]))
```

Although fitting a linear model does not work well, quadratic or cubic polynomials seem to improve the fit.

We can assess whether a given model is underfitting by plotting the residuals, which are the differences between the predicted values and the true values. If the residuals are not random, then the model is underfitting.

If the true values are $y$ and the predicted values are $\hat{y}$, then the residuals are $y - \hat{y}$. The reason we expect uniform scatter is that our linear regression model can be seen as defining a data-generating process of the form

$$ y = \theta^\top \phi(x) + \epsilon $$

where $\epsilon$ is a random variable with zero mean and constant variance. The residuals are therefore given by

$$ y - \theta^\top \phi(x) = \epsilon $$

```
## Plot residuals for each model
plt.figure(figsize=(14, 5))
for i in range(len(degrees)):
ax = plt.subplot(1, len(degrees), i + 1)
polynomial_features = PolynomialFeatures(degree=degrees[i])
linear_regression = LinearRegression()
pipeline = Pipeline([("pf", polynomial_features), ("lr", linear_regression)])
pipeline.fit(X[:, np.newaxis], y)
ax.plot(X, y - pipeline.predict(X[:, np.newaxis]), '.b', markersize=10, label="Samples")
ax.set_xlim((0, 1))
ax.set_ylim((-2, 2))
ax.legend(loc="best")
ax.set_title("Residuals for Degree {}".format(degrees[i]))
```

We can see one heuristic for determining whether we have chosen a sufficiently-expressive model: **Uniform scatter in the residuals**

Hypothetically, we want to view our data generating process as comprising a deterministic function $f$ and a random noise term $\epsilon$. We want to fit a model that is close to $f$ and that is able to capture everything except the variability in $\epsilon$.

$$ y_i = f(x_i) + \epsilon_i $$

where $f(x_i) = \theta_0 + \theta_1 x_i + ...$ and $\epsilon_i$ is a random variable with mean 0 and variance $\sigma^2$. When we compute our residuals, we are computing the difference between the true values $y_i$ and our model's predictions $\hat{y}_i = f(x_i)$. $$ r_i = y_i - f(x_i) $$

Thus we want to $r_i$ to look like a zero-centered random distribution, consistent with our implicit model of the generating process for this dataset.

```
```

```
```

```
```

```
```

# Why not go even higher?Â¶

- Increase the degree of our polynomial improved our fit accuracy by producing a model that explained more of the variance in the data.
- Our residuals appeared more uniform as well, suggesting that our model stopped underfitting

What happens if we further increase the degree of the polynomial?

Runge's phenomenon: can fit an $N$-point dataset perfectly with a polynomial of degree $N$.

```
degrees = [30]
plt.figure(figsize=(10, 6))
for i in range(len(degrees)):
ax = plt.subplot(1, len(degrees), i + 1)
polynomial_features = PolynomialFeatures(degree=degrees[i])
linear_regression = LinearRegression()
pipeline = Pipeline([("pf", polynomial_features), ("lr", linear_regression)])
pipeline.fit(X[:, np.newaxis], y)
X_test = np.linspace(0, 1, 100)
ax.plot(X_test, true_fn(X_test), color='k', label="True function")
ax.plot(X_test, pipeline.predict(X_test[:, np.newaxis]), label="Model")
ax.plot(X, y, '.b', markersize=10, label="Sampled")
ax.set_xlim((0, 1))
ax.set_ylim((-2, 2))
ax.legend(loc="best")
ax.set_title("Polynomial of Degree {}".format(degrees[i]))
```

#### Let's plot the error on the training pointsÂ¶

- Nice.
- This is a more obvious version of the exact same phenomenon we saw with our other models.

```
## Compute train MSE
polynomial_features = PolynomialFeatures(degree=30)
linear_regression = LinearRegression()
pipeline = Pipeline([("pf", polynomial_features), ("lr", linear_regression)])
pipeline.fit(X[:, np.newaxis], y)
print("Train MSE: {}".format(np.mean((pipeline.predict(X[:, np.newaxis]) - y)**2)))
```

Train MSE: 0.0021376156181802915

### Let's plot the error on a test setÂ¶

- Uh-oh

```
## Compute test MSE
X_test = np.linspace(0, 1, 100)
print("Test MSE: {}".format(np.mean((pipeline.predict(X_test[:, np.newaxis]) - true_fn(X_test))**2)))
```

Test MSE: 39620.98485755194

### Model complexityÂ¶

This is a nebulous concept, in machine learning. In principle, it's the number of trainable parameters in a model

However, we've seen before that regularizers and constraints can reduce the effective number of degrees of freedom, and that not all parameter combinations are "reachable" during training.

Nonetheless, we'll treat the number of parameters as an upper bound on the complexity of a model.

#### Plotting train/test error vs. model complexityÂ¶

- This is a key plot used in model selection and hyperparameter tuning.

```
## Plot train and test error versus model size
degrees = range(1, 21)
train_errors = []
test_errors = []
for degree in degrees:
polynomial_features = PolynomialFeatures(degree=degree,)
linear_regression = LinearRegression()
pipeline = Pipeline([("pf", polynomial_features), ("lr", linear_regression)])
pipeline.fit(X[:, np.newaxis], y)
## Compute errors on training and testing data
train_errors.append(np.mean((pipeline.predict(X[:, np.newaxis]) - y)**2))
test_errors.append(np.mean((pipeline.predict(X_test[:, np.newaxis]) - true_fn(X_test))**2))
plt.figure(figsize=(6, 6))
plt.semilogy(degrees, train_errors, label='Train error')
plt.semilogy(degrees, test_errors, label='Test error')
plt.legend(loc='best')
plt.xlabel('Model complexity (polynomial degree)')
plt.ylabel('Mean Squared Error')
```

Text(0, 0.5, 'Mean Squared Error')

### We'd say that the best model is the one that minimizes the test error.Â¶

```
```

# The Bias-Variance TradeoffÂ¶

### OverfittingÂ¶

A very expressive model (e.g., a high degree polynomial) fits the training dataset perfectly.

But the model makes highly incorrect predictions outside this dataset, and doesn't generalize.

We would say that the overfit model exhibits

**high variance**in its score across different hypothetical testing setsWe could also say that the model has

**low bias**because it will fit any training data very well.

### UnderfittingÂ¶

A small model (e.g. a straight line), will not fit the training data well.

Therefore, it will also not be accurate on new data.

The model will, however, exhibit

**low variance**in its error on random testing sets.We say that it has

**high bias**due to its strong tendency to do a limited number of things.

```
```

### Determining Overfitting vs. UnderfittingÂ¶

We can diagnose overfitting and underfitting by measuring performance on a the held out test dataset (not used for training).

If training perforance is high but holdout performance is low, we are overfitting.

If training perforance is low but holdout performance is low, we are underfitting.

As we've seen previously, the gap between train and test scores is one measure of overfitting. The larger the gap, the more overfitting.

How do we choose the correct model complexity? For our polynomial example, $p$ is a hyperparameter. We need to tune it on a validation set, or run cross-validation on our training partition

```
degrees = [1, 20, 5]
titles = ['Underfitting', 'Overfitting', 'A Good Fit']
plt.figure(figsize=(14, 5))
for i in range(len(degrees)):
ax = plt.subplot(1, len(degrees), i + 1)
polynomial_features = PolynomialFeatures(degree=degrees[i])
linear_regression = LinearRegression()
pipeline = Pipeline([("pf", polynomial_features), ("lr", linear_regression)])
pipeline.fit(X[:, np.newaxis], y)
ax.plot(X_test, true_fn(X_test), color='k', label="True function")
ax.plot(X_test, pipeline.predict(X_test[:, np.newaxis]), label="Model")
ax.plot(X, y, '.b', markersize=10, label="Sampled")
ax.set_xlim((0, 1))
ax.set_ylim((-2, 2))
ax.legend(loc="best")
ax.set_title("{} (Degree {})".format(titles[i], degrees[i]))
```

## How to Fix UnderfittingÂ¶

Feature engineering: find better features that will make the dataset easier to fit.

Pick a more expressive model class (random forests or neural networks instead of linear models).

Pick an optimization algorithm that can find better parameters (SGD instead of gradient descent).

## How to Fix OverfittingÂ¶

Use a simpler model class (a linear model instead of a neural net)

Use fewer trainable parameters (a smaller neural net)

Keep the same model, but collect more training data

**Regularization: modify the training process to penalize overly complex models.***

```
```

```
```

```
```

# RegularizationÂ¶

We will try applying L2 or Ridge regularization to our polynomial regression model.

We will also calculate the accuracy on the training set and the test set. We will again use the coefficient of determination $R^2$ as our metric.

Why are we using L2 regularization instead of L1 (Lasso)?

### L2 Regularization for Polynomial RegressionÂ¶

Let's consider an application to the polynomial model we have seen so far. Given polynomial features $\phi(x)$, we optimize the following objective:

$$ J(\theta) = \frac{1}{2n} \sum_{i=1}^n \left( y^{(i)} - \theta^\top \phi(x^{(i)}) \right)^2 + \frac{\lambda}{2} \cdot ||\theta||_2^2. $$

```
from sklearn.linear_model import Ridge
lambda_values = [0.0, 1e-8, 1e-6, 1e-3, 1e-1, 1e0, 1e1]
plt.figure(figsize=(10, 25))
for i, lambda_value in enumerate(lambda_values):
ax = plt.subplot(len(lambda_values), 1, i + 1)
polynomial_features = PolynomialFeatures(degree=20)
linear_regression = Ridge(alpha=lambda_value)
pipeline = Pipeline([("pf", polynomial_features), ("lr", linear_regression)])
pipeline.fit(X[:, np.newaxis], y)
X_test = np.linspace(0, 1, 100)
ax.plot(X_test, true_fn(X_test), color='k', label="True function")
ax.plot(X_test, pipeline.predict(X_test[:, np.newaxis]), label="Model")
ax.plot(X, y, '.b', markersize=10, label="Sampled")
ax.set_xlim((0, 1))
ax.set_ylim((-2, 2))
ax.legend(loc="best")
ax.set_title(f"Regularization {lambda_value}, MSE: {pipeline.score(X_test[:, None], true_fn(X_test)):.4f}")
```