Optimization in many dimensions¶
Preamble: Run the cells below to import the necessary Python packages
This notebook created by William Gilpin. Consult the course website for all content and GitHub repository for raw files and runnable online code.
import numpy as np
from IPython.display import Image, display
import matplotlib.pyplot as plt
%matplotlib inline
%load_ext autoreload
%autoreload 2
## Set nicer colors
plt.rcParams['image.cmap'] = 'PuBu'
plt.rcParams['axes.prop_cycle'] = plt.cycler(color=[[1.0, .3882, .2784]])
plt.rcParams['lines.markersize'] = 10
The autoreload extension is already loaded. To reload it, use: %reload_ext autoreload
Optimization in a complex landscape¶
Previously, we considered optimization of functions that depend on a single variable. In this notebook, we will consider optimization of functions that depend on many variables.
Our multivariate landscape is a sum of Gaussian wells, $$ \mathcal{L} = \sum_{p=1}^P \mathcal{L}_p = -\sum_{p=1}^P A_p \exp\left(-\frac{(\mathbf{x} - \boldsymbol{\mu}_{p})^2}{2\sigma^2}\right) $$ where $\mathbf{x} \in \mathbb{R}^D$ is the position in the landscape, $\boldsymbol{\mu}_p \in \mathbb{R}^D$ is the location of the $p^{th}$ well, and $\sigma \in \mathbb{R}^D$ is the width of the wells. The weights $\sum_p A_p = 1$, $A_p \geq 0$ are the relative depths of the wells.
Recall that, during optimization, we usually think of our "cost" as being the number of calls to the function we are trying to optimize. Here, to implement this landscape, we will make calculating its value at a single point as fast as possible using vectorization. Since the well positions are all fixed, and the landscape is a sum over them, we we can vectorize the other sum. Additionally, if we want to calculate the landscape value at many points at once, we can use broadcasting, or "batching," because the different evaluations are independent of each other.
class RandomLossLandscapeWithoutGradient:
"""
Creates a random two-dimensional loss landscape with multiple circular gaussian wells
Args:
d (int): number of dimensions for the loss landscape
n_wells (int): number of gaussian wells
random_state (int): random seed
"""
def __init__(self, d=2, n_wells=3, random_state=None):
# Fix the random seed
self.random_state = random_state
np.random.seed(random_state)
# Select random well weights, locations, and widths
self.coeffs = np.random.random(n_wells)
self.coeffs /= np.sum(self.coeffs)
self.locs = np.random.randn(n_wells, d)
self.widths = np.random.rand(n_wells)[None, :]
def _gaussian_well(self, X, width=1):
"""A single gaussian well centered at 0 with width `width`"""
return -np.exp(-np.sum((X / width) ** 2, axis=1))
def loss(self, X):
"""
Compute the loss landscape at points X
Args:
X (np.ndarray): points at which to compute the loss landscape. This should
be of shape (n_batch, n_dim)
width (float): width of the gaussian wells
Returns:
np.ndarray: loss landscape at points X
Notes:
The loss landscape is computed as the sum of the individual gaussian wells
The shape of the argument to np.sum is (n_batch, n_wells)
"""
# return np.einsum(
# '...i,i->...',
# self._gaussian_well(X[..., None] - self.locs.T[None, :], self.widths),
# self.coeffs
# )
# print(X[..., None].shape)
# print(self.locs.T[None, :].shape)
# print(self.widths.shape)
# print((self._gaussian_well(X[..., None] - self.locs.T[None, :])).shape)
# print(self._gaussian_well(X[..., None] - self.locs.T[None, :], self.widths).shape)
return np.sum(
self._gaussian_well(X[..., None] - self.locs.T[None, :], self.widths) * self.coeffs,
axis=1
)
def __call__(self, X):
return self.loss(X)
loss = RandomLossLandscapeWithoutGradient(random_state=0, n_wells=8)
## We have to make a list of points at which to plot the landscape. NumPy's meshgrid is
## a built-in utility for this purpose.
x = np.linspace(-3, 3, 100)
y = np.linspace(-3, 3, 100)
xx, yy = np.meshgrid(x, y)
X = np.array([xx.ravel(), yy.ravel()]).T
print(X.shape)
Z = loss(X) # same as loss.loss(X) because class is callable
plt.figure(figsize=(8, 8))
plt.scatter(X[:, 0], X[:, 1], c=Z)
plt.plot(xx.ravel()[Z.argmin()], yy.ravel()[Z.argmin()], '*', markersize=20)
plt.axis('off')
(10000, 2)
(-3.3, 3.3, -3.3, 3.3)
Recall that many optimization routines requre the gradient of the landscape. We can analytically calculate the gradient of this landscape
$$ \mathcal{L} = \sum_{p=1}^P \mathcal{L}_p = -\sum_{p=1}^P \exp\left(-\frac{(\mathbf{x} - \boldsymbol{\mu}_{p})^2}{2\sigma^2}\right) $$
$$ \nabla \mathcal{L} = \sum_{p=1}^P \frac{(\mathbf{x} - \boldsymbol{\mu}_{p})}{\sigma^2} \exp\left(-\frac{(\mathbf{x} - \boldsymbol{\mu}_{p})^2}{2\sigma^2}\right) $$
We can implement both the landscape and its gradient within a single loss object. As we saw in univariate optimization, we will likely find ourselves calling the loss function frequently, and so we will try to make our implementation as efficient as possible. In our implementation, we vectorize the computation of the loss and gradient over the $P$ wells using np.einsum
Some considerations when performing multivariate optimization¶
Having a smooth landscape is important for many optimization algorithms to work well, because gradient information is required for most local search methods.
Because differentiability is a concern, we expect that flatter regions in the landscape will pose challenges. In our Gaussian landscape, the regions far away from the $P$ centers will be flat, and so we expect optimizers initialized in these regions to struggle on this problem.
Computing analytic gradients is hard¶
- Computing analytic gradients is hard, and often impossible. We can use finite differences to approximate the gradient along each dimension:
$$ \frac{\partial f}{\partial x_i} \approx \frac{f(x_1, \ldots, x_i + \epsilon, \ldots, x_D) - f(x_1, \ldots, x_i - \epsilon, \ldots, x_D)}{2\epsilon} $$
- Finite difference gradients are computationally expensive: In $d$ dimensions, we need to evaluate the function $\mathcal{O}(d)$ times to compute the gradient. This is impractical for high-dimensional problems, but it is acceptable for our 2D landscape.
- Symbolic gradients are useful, but prone to errors
- We use a gradcheck function to mitigate the risk of errors in our gradients.
- (Later in the course): Faster gradient estimation for more complex loss functions with automatic differentiation (the symbolic chain rule for arbitrary compositions of primitive functions).
def gradcheck(loss, x, eps=1e-9):
"""
This function computes the exact gradient and an approximate gradient using
finite differences
Args:
loss (callable): loss function. Must have a method `grad` that returns the
analytic gradient
x (np.ndarray): input to the loss function
eps (float): epsilon for finite differences
Returns:
grad (np.ndarray): analytic gradient
grad_num (np.ndarray): numerical gradient
"""
x = np.array(x)
grad = loss.grad(x)
grad_num = list()
for i in range(x.shape[0]):
x1, x2 = x.copy(), x.copy()
x1[i] += eps
x2[i] -= eps
grad_num.append((loss(x1) - loss(x2)) / (2 * eps))
grad_num = np.array(grad_num)
return grad, grad_num
grad, grad_num = gradcheck(loss, np.random.randn(2))
print(f"True gradient: {grad.squeeze()}\nApproximate gradient: {grad_num.squeeze()}")
print(np.allclose(grad, grad_num, atol=1e-1))
True gradient: [ 0.00512768 -0.0059314 ] Approximate gradient: [ 0.00512768 -0.0059314 ] True
Why not use global maximization?¶
Why bother using gradient descent when we can just check every point in our system's domain? That's literally how we generated the plot above.
For a continuous field, we need to pick some minimum length scale over which we don't expect the objective to change rapidly. For discrete optimization, we just try all combinations
Assume we need $N$ points to sample along any linear dimension. The number of queries to the objective function scales as $N^d$, where $d$ is the dimensionality
- For the example we just used, we sampled $N = 100$ points along each dimension, resulting in $100^2 = 10^4$ function calls.
- Assuming $100$ points per axis, when $d = 3$ we have $10^6$ while for $d = 7$ it takes $10^{14}$, etc
- Let's assume it takes $10$ $\mu s$ per evaluation. The 2D global search takes $0.1$ s, the 3D search takes $10$ s, the $d = 5$ search takes $1.15$ days, and the $d = 7$ search takes $32$ years
What is a typical value of $d$? OpenAI's DallE image generation model contains $12$ billion parameters ($d \approx 10^{10}$).
x = np.linspace(-3, 3, 10)
y = np.linspace(-3, 3, 10)
xx, yy = np.meshgrid(x, y)
X = np.array([xx.ravel(), yy.ravel()]).T
Z = loss(X) # same as loss.loss(X) because class is callable
plt.figure(figsize=(8, 8))
plt.scatter(X[:, 0], X[:, 1], c=Z, s=2000)
plt.plot(xx.ravel()[Z.argmin()], yy.ravel()[Z.argmin()], '*r', markersize=10)
plt.axis('off')
(-3.3, 3.3, -3.3, 3.3)
Optimizing in many dimensions¶
- We will now implement a series of optimization algorithms to find the minima of our Gaussian landscape
- We will first define a base class for our optimizers, which will contain the common functionality of all optimizers. Many optimization algorithms are iterative, meaning that they start out at some (usually randomly-chosen) point on the landscape, and then incrementally update their position in order to reach the minimum.
- The constructor will take a loss function itself, a learning rate that controls the rate of iterative updating, a maximum number of iterations, a tolerance that determines how close the optimization needs to get to a minimum to stop, a random seed, and a flag to store the history of steps taken during optimization
- The base class will contain a
fit
method that will perform the optimization, and aupdate
method that will be implemented by each subclass
class BaseOptimizer:
"""A Multivariate Gradient Descent Optimizer
Args:
loss (callable): loss function. Must have a method `grad` that returns the
analytic gradient
lr (float): learning rate
max_iter (int): maximum number of iterations
tol (float): tolerance for stopping criterion
random_state (int): random seed
store_history (bool): whether to store the optimization trajectory
"""
def __init__(self, loss, lr=0.1, max_iter=1000, tol=1e-6, random_state=None, store_history=False):
self.loss = loss
self.lr = lr
self.max_iter = max_iter
self.tol = tol
self.random_state = random_state
np.random.seed(random_state)
self.store_history = store_history
if self.store_history:
self.Xs = []
self.losses = []
def fit(self, X):
"""Fit the optimizer to the data
Args:
X (np.ndarray): initial guess for the optimizer. Shape (n_dim,) or
(n_batch, n_dim) if we have multiple initial points
Returns:
self
"""
self.X = X
if self.store_history:
self.Xs = [self.X.copy()]
self.losses = [self.loss(X)]
for i in range(self.max_iter):
self.X = self.update(self.X)
if self.store_history:
self.Xs.append(self.X.copy())
self.losses.append(self.loss(self.X))
# Stop early if loss is not decreasing any more for *any* batch elements
if np.linalg.norm(self.loss.grad(self.X)) < self.tol:
break
return self
def update(self, X):
raise NotImplementedError("Implement this method in a subclass")
Multivariate gradient descent¶
- A first-order optimization method because it only requires the first partial derivatives of the objective function
- If we assume that our current position is $\mathbf{x}$, then the gradient of the objective function at this point is $\nabla \mathcal{L}(\mathbf{x})$
- The gradient descent update rule becomes $$\mathbf{x} \leftarrow \mathbf{x} - \eta \nabla \mathcal{L}(\mathbf{x})$$ where $\eta$ is a step size parameter we call the "learning rate" that determines the size of the step we take in the direction of the gradient
class GradientDescent(BaseOptimizer):
"""A Multivariate Gradient Descent Optimizer"""
def update(self, X):
return X - self.lr * self.loss.grad(X)
# Initialize optimizer
optimizer = GradientDescent(loss, lr=4.3/10000, max_iter=2000, tol=1e-6, random_state=0, store_history=True)
# Initialize many starting points
X0 = 6 * np.random.random(size=(100, 2)) - 3
# Fit optimizer
optimizer.fit(X0.copy())
<__main__.GradientDescent at 0x17fc7b210>
x = np.linspace(-3, 3, 100)
y = np.linspace(-3, 3, 100)
xx, yy = np.meshgrid(x, y)
X = np.array([xx.ravel(), yy.ravel()]).T
Z = loss.loss(X)
plt.figure(figsize=(8, 8))
plt.scatter(X[:, 0], X[:, 1], c=Z)
plt.plot(*X0.T, '.k')
plt.xlim([-3, 3])
plt.ylim([-3, 3])
plt.axis('off')
plt.title('Initial Guesses')
plt.figure(figsize=(8, 8))
plt.scatter(X[:, 0], X[:, 1], c=Z)
plt.plot(*optimizer.X.T, '.')
plt.xlim([-3, 3])
plt.ylim([-3, 3])
plt.axis('off')
plt.title('Final Guesses')
Text(0.5, 1.0, 'Final Guesses')
We next plot the loss versus time for each of the starting points.
plt.figure()
plt.plot(optimizer.losses, color=(0.7, 0.7, 0.7), lw=1, alpha=0.2)
plt.plot(np.mean(optimizer.losses, axis=1), 'k', lw=3)
plt.xlabel('Iteration')
plt.ylabel('Loss')
Text(0, 0.5, 'Loss')
Xs = np.array(optimizer.Xs)
plt.figure(figsize=(8, 8))
plt.scatter(X[:, 0], X[:, 1], c=Z)
plt.plot(Xs[0, :, 0], Xs[0, :, 1], '.');
plt.plot(Xs[:, :, 0], Xs[:, :, 1], '-');
# plt.plot(Xs[-1, :, 0], Xs[-1, :, 1], '.');
plt.xlim([-3, 3])
plt.ylim([-3, 3])
plt.axis('off')
plt.title('Final Guesses')
Text(0.5, 1.0, 'Final Guesses')
## Make an interactive video
from ipywidgets import interact, interactive, fixed, interact_manual, Layout
import ipywidgets as widgets
Xs = np.array(optimizer.Xs)
def plotter(i):
plt.figure(figsize=(6, 6))
plt.scatter(X[:, 0], X[:, 1], c=Z)
plt.plot(Xs[0, :, 0], Xs[0, :, 1], '.');
plt.plot(Xs[:i, :, 0], Xs[:i, :, 1]);
plt.plot(Xs[i, :, 0], Xs[i, :, 1], '.b');
# plt.plot(Xs[0, :, 0], Xs[0, :, 1], '.');
# plt.plot(Xs[:i, :, 0], Xs[:, :i, 1], '-');
plt.xlim([-3, 3])
plt.ylim([-3, 3])
plt.axis('off')
plt.show()
interact(
plotter,
i=widgets.IntSlider(0, 0, Xs.shape[0] - 1, 1, layout=Layout(width='800px'))
)
interactive(children=(IntSlider(value=0, description='i', layout=Layout(width='800px'), max=2000), Output()), …
<function __main__.plotter(i)>
Hyperparameter tuning: What is the correct learning rate?¶
- Hyperparameter tuning is a big deal in machine learning, we will revisit this topic later in the course
- For now, we will use intution to see how the learning rate and other hyperparameters affect the dynamics of the system
Momentum and stochasticity¶
We can think of gradent descent as the dynamics of a first-order overdamped system:
For a particle of mass $m$ in a potential $V(\mathbf{x})$, the forces acting on the particle are given by Newton's second law: $$ \mathbf{F} = -\nabla \mathcal{U}(\mathbf{x}) $$
$$ \mathbf{a} = \frac{\mathbf{F}}{m} $$
If we assume linear damping with a damping coefficient $\gamma$, then the dynamics of the particle are given by the kinematic equation:
$$ m \ddot{\mathbf{x}} = -\gamma \dot{\mathbf{x}} - \nabla \mathcal{U}(\mathbf{x}) $$ where $\mathbf{F}$ is the force, $\mathbf{a}$ is the acceleration.
If we assume $\gamma \gg m$, then the overdamped dynamics are given by the first-order equation of motion: $$ \dot{\mathbf{x}} = -\frac{1}{\gamma}\nabla \mathcal{U}(\mathbf{x}) $$
In discrete time, the derivatives $\dot{\mathbf{x}}$ can be approximated using finite differences: $$ \dot{\mathbf{x}} \approx \frac{\mathbf{x}_{t+1} - \mathbf{x}_t}{\Delta t} $$
In discrete time, this equation corresponds to the update rule: $$ \mathbf{x}_{t+1} = \mathbf{x}_t - \eta \nabla \mathcal{U}(\mathbf{x}_t) $$ where $\eta = \Delta t / \gamma$ is the learning rate.
Momentum¶
- What if we want to take larger steps in the direction of the gradient?
- In the overdamped limit, we can think of the gradient descent update rule as a first-order system with a damping coefficient $\gamma$. What if we are not in the underdamped limit?
The kinematic equation remains: $$ m \ddot{\mathbf{x}} = -\gamma \dot{\mathbf{x}} - \nabla \mathcal{U}(\mathbf{x}) $$
In discrete time, the derivatives $\dot{\mathbf{x}}$ and $\ddot{\mathbf{x}}$ can be approximated using finite differences: $$ \dot{\mathbf{x}} \approx \frac{\mathbf{x}_{t+1} - \mathbf{x}_t}{\Delta t} $$ $$ \ddot{\mathbf{x}} \approx \frac{\mathbf{x}_{t+1} - 2\mathbf{x}_t + \mathbf{x}_{t-1}}{\Delta t^2} $$
We can therefore add a momentum term to the update rule: $$ \mathbf{x}_{t+1} = \mathbf{x}_t - \eta \nabla \mathcal{U}(\mathbf{x}_t) + \beta \left(\mathbf{x}_{t} - \mathbf{x}_{t-1}\right) $$ where $\beta$ is a momentum parameter proprotional to mass.
class GradientDescentMomentum(BaseOptimizer):
"""A Multivariate Gradient Descent Optimizer"""
def __init__(self, loss, momentum=0.9, **kwargs):
super().__init__(loss, **kwargs)
self.momentum = momentum
self.v = None
def update(self, X):
if self.v is None:
self.v = np.zeros_like(X)
self.v = self.momentum * self.v - self.lr * self.loss.grad(X)
return X + self.v
# Initialize optimizer
optimizer = GradientDescentMomentum(loss, lr=0.1, momentum=0.9*5, max_iter=2000, tol=1e-6,
random_state=0, store_history=True)
# Initialize starting point
X0 = 6 * np.random.random(size=(100, 2)) - 3
# Fit optimizer
optimizer.fit(X0.copy())
<__main__.GradientDescentMomentum at 0x29ef64610>
x = np.linspace(-3, 3, 100)
y = np.linspace(-3, 3, 100)
xx, yy = np.meshgrid(x, y)
X = np.array([xx.ravel(), yy.ravel()]).T
Z = loss.loss(X)
plt.figure(figsize=(8, 8))
plt.scatter(X[:, 0], X[:, 1], c=Z)
plt.plot(*X0.T, '.k')
plt.xlim([-3, 3])
plt.ylim([-3, 3])
plt.axis('off')
plt.title('Initial Guesses')
plt.figure(figsize=(8, 8))
plt.scatter(X[:, 0], X[:, 1], c=Z)
plt.plot(*optimizer.X.T, '.k')
plt.xlim([-3, 3])
plt.ylim([-3, 3])
plt.axis('off')
plt.title('Final Guesses')
Text(0.5, 1.0, 'Final Guesses')
plt.figure()
plt.plot(optimizer.losses, color=(0.7, 0.7, 0.7), lw=1, alpha=0.2)
plt.plot(np.mean(optimizer.losses, axis=1), 'k', lw=3)
plt.xlabel('Iteration')
plt.ylabel('Loss')
Text(0, 0.5, 'Loss')
Xs = np.array(optimizer.Xs)
plt.figure(figsize=(8, 8))
plt.scatter(X[:, 0], X[:, 1], c=Z)
plt.plot(Xs[0, :, 0], Xs[0, :, 1]);
plt.plot(Xs[:, :, 0], Xs[:, :, 1]);
plt.plot(Xs[-1, :, 0], Xs[-1, :, 1], '.b');
plt.xlim([-3, 3])
plt.ylim([-3, 3])
plt.axis('off')
plt.title('Final Guesses')
Text(0.5, 1.0, 'Final Guesses')
## Make an interactive video
from ipywidgets import interact, interactive, fixed, interact_manual, Layout
import ipywidgets as widgets
Xs = np.array(optimizer.Xs)
def plotter(i):
plt.figure(figsize=(6, 6))
plt.scatter(X[:, 0], X[:, 1], c=Z)
plt.plot(Xs[0, :, 0], Xs[0, :, 1], '.k');
plt.plot(Xs[:i, :, 0], Xs[:i, :, 1], 'r');
plt.plot(Xs[i, :, 0], Xs[i, :, 1], '.b');
plt.xlim([-3, 3])
plt.ylim([-3, 3])
plt.axis('off')
plt.show()
interact(
plotter,
i=widgets.IntSlider(0, 0, Xs.shape[0] - 1, 1, layout=Layout(width='800px'))
)
interactive(children=(IntSlider(value=0, description='i', layout=Layout(width='800px'), max=2000), Output()), …
<function __main__.plotter(i)>
Stochastic Gradient Descent¶
We can also add stochasticity to the gradient descent update rule
$$ \mathbf{x} \leftarrow \mathbf{x} - \eta \nabla \mathcal{L}(\mathbf{x}) + \boldsymbol{\epsilon_t} $$ where $\epsilon_t$ is a random noise term
Dynamical interpretation:¶
We can add a time-dependent noise term to our overdamped dynamical system for gradient descent. $$ \dot{\mathbf{x}} = -\frac{1}{\gamma}\nabla \mathcal{U}(\mathbf{x}) + \boldsymbol{\xi_t} $$
Usually we assume that the noise term is a Langevin process, such as the random "kicks" encountered during the Brownian motion of a particle in a fluid. For each component of the vector $\boldsymbol{\epsilon_t}$, $$ \langle \xi_t \xi_s \rangle = 2 \gamma \delta(t-s) $$ $$ \langle \xi_t \rangle = 0 $$ Over discrete times steps, this corresponds the noise vector having a Gaussian distribution with zero mean. We also assume that the different dimensions of the noise force are uncorrelated (the covariance matrix is diagonal).
class StochasticGradientDescent(BaseOptimizer):
"""A Multivariate Gradient Descent Optimizer"""
def __init__(self, loss, noise, **kwargs):
super().__init__(loss, **kwargs)
self.noise = noise
def update(self, X):
grad = self.loss.grad(self.X)
noisy_grad = grad + self.noise * np.random.randn(*grad.shape)
return X - self.lr * noisy_grad
A Caveat: SGD is different in machine learning¶
- Our stochastic gradient descent update rule randomly samples and then nudges the gradient accordingly
- The stochastic gradient descent rules often encountered in machine learning models randomly sample a minibatch of training data points. Effectively, the loss landscape of ML models is a noisy version of the loss landscape we have been considering so far, and it is approximated by the training data---so randomly subsampling data is equivalent to randomly sampling the true loss landscape
# Initialize optimizer
optimizer = StochasticGradientDescent(loss, lr=0.1, noise=0.2, max_iter=2000, tol=1e-6, random_state=0, store_history=True)
# Initialize starting point
X0 = 6 * np.random.random(size=(100, 2)) - 3
# Fit optimizer
optimizer.fit(X0.copy())
<__main__.StochasticGradientDescent at 0x12b4df710>
x = np.linspace(-3, 3, 100)
y = np.linspace(-3, 3, 100)
xx, yy = np.meshgrid(x, y)
X = np.array([xx.ravel(), yy.ravel()]).T
Z = loss.loss(X)
plt.figure(figsize=(8, 8))
plt.scatter(X[:, 0], X[:, 1], c=Z)
plt.plot(*X0.T, '.k')
plt.xlim([-3, 3])
plt.ylim([-3, 3])
plt.axis('off')
plt.title('Initial Guesses')
plt.figure(figsize=(8, 8))
plt.scatter(X[:, 0], X[:, 1], c=Z)
plt.plot(*optimizer.X.T, '.k')
plt.xlim([-3, 3])
plt.ylim([-3, 3])
plt.axis('off')
plt.title('Final Guesses')
Text(0.5, 1.0, 'Final Guesses')
plt.figure()
plt.plot(optimizer.losses, color=(0.7, 0.7, 0.7), lw=1, alpha=0.2)
plt.plot(np.mean(optimizer.losses, axis=1), 'k', lw=3)
plt.xlabel('Iteration')
plt.ylabel('Loss')
Text(0, 0.5, 'Loss')
Xs = np.array(optimizer.Xs)
plt.figure(figsize=(8, 8))
plt.scatter(X[:, 0], X[:, 1], c=Z)
plt.plot(Xs[0, :, 0], Xs[0, :, 1], '.');
plt.plot(Xs[:, :, 0], Xs[:, :, 1]);
plt.plot(Xs[-1, :, 0], Xs[-1, :, 1], '.b');
plt.xlim([-3, 3])
plt.ylim([-3, 3])
plt.axis('off')
plt.title('Final Guesses')
Text(0.5, 1.0, 'Final Guesses')
## Make an interactive video
from ipywidgets import interact, interactive, fixed, interact_manual, Layout
import ipywidgets as widgets
Xs = np.array(optimizer.Xs)
def plotter(i):
plt.figure(figsize=(6, 6))
plt.scatter(X[:, 0], X[:, 1], c=Z)
plt.plot(Xs[0, :, 0], Xs[0, :, 1], '.k');
plt.plot(Xs[:i, :, 0], Xs[:i, :, 1], 'r');
plt.plot(Xs[i, :, 0], Xs[i, :, 1], '.b');
plt.xlim([-3, 3])
plt.ylim([-3, 3])
plt.axis('off')
plt.show()
interact(
plotter,
i=widgets.IntSlider(0, 0, Xs.shape[0] - 1, 1, layout=Layout(width='800px'))
)
interactive(children=(IntSlider(value=0, description='i', layout=Layout(width='800px'), max=2000), Output()), …
<function __main__.plotter(i)>
Constrained optimization with projected gradient descent¶
- So far, the individual components of our vector $\mathbf{x}$ have been updated independently of each other. They only interact in the sense that both specify the value of the loss function $\mathcal{L}(\mathbf{x})$, at which we compute the gradient.
- What if we want to impose constraints on the components of $\mathbf{x}$?
Example: The squared elements of $\mathbf{x}$ must sum to one. This is a common constraint when we are optimizing over unit directions in a space, or potentially when we are optimizing over probability distributions. $$ \sum_i x_i^2 = 1 $$
If we denote the set as $\mathcal{S} = \{ \mathbf{x} \in \mathbb{R}^n : \sum_i x_i^2 = 1 \}$, then we can project a vector $\mathbf{x}$ onto the set by solving the following optimization problem: $$ \mathbf{x}^* = \arg \min_{\mathbf{x} \in \mathcal{S}} \|\mathbf{x} - \mathbf{x}^*\|^2 $$
Projected gradient descent¶
We enforce the constraint by projecting the output gradient descent update rule onto the set at each step $$ \mathbf{x} \leftarrow \mathbf{x} - \eta \nabla \mathcal{L}(\mathbf{x}) $$ Followed by projection, $$ \mathbf{x} \leftarrow \text{Proj}_{\mathcal{S}}(\mathbf{x}) $$
class ProjectedGradientDescent(BaseOptimizer):
"""A Multivariate Gradient Descent Optimizer"""
def __init__(self, loss, **kwargs):
super().__init__(loss, **kwargs)
def update(self, X):
self.X = self.project(self.X)
grad = self.loss.grad(self.X)
return self.project(X - self.lr * grad)
def project(self, X):
"""
Project onto the simplex consisting of points where the norm of all elements is one
"""
X = X.copy()
X /= np.linalg.norm(X, axis=1, keepdims=True)
return X
# Initialize optimizer
optimizer = ProjectedGradientDescent(loss, lr=0.1, max_iter=2000, tol=1e-6, random_state=0, store_history=True)
# Initialize starting point
X0 = 6 * np.random.random(size=(100, 2)) - 3
# Fit optimizer
optimizer.fit(X0.copy())
<__main__.ProjectedGradientDescent at 0x17a48a110>
x = np.linspace(-3, 3, 100)
y = np.linspace(-3, 3, 100)
xx, yy = np.meshgrid(x, y)
X = np.array([xx.ravel(), yy.ravel()]).T
Z = loss.loss(X)
plt.figure(figsize=(8, 8))
plt.scatter(X[:, 0], X[:, 1], c=Z)
plt.plot(*X0.T, '.k')
plt.xlim([-3, 3])
plt.ylim([-3, 3])
plt.axis('off')
plt.title('Initial Guesses')
plt.figure(figsize=(8, 8))
Xs = np.array(optimizer.Xs)
plt.scatter(X[:, 0], X[:, 1], c=Z, zorder=-2)
plt.plot(Xs[0, :, 0], Xs[0, :, 1], '.k');
plt.plot(Xs[:, :, 0], Xs[:, :, 1], 'r', zorder=-1);
plt.plot(Xs[-1, :, 0], Xs[-1, :, 1], '.b', markersize=10);
plt.xlim([-3, 3])
plt.ylim([-3, 3])
plt.axis('off')
plt.title('Final Guesses')
Text(0.5, 1.0, 'Final Guesses')
## Make an interactive video
from ipywidgets import interact, interactive, fixed, interact_manual, Layout
import ipywidgets as widgets
Xs = np.array(optimizer.Xs)
def plotter(i):
plt.figure(figsize=(6, 6))
plt.scatter(X[:, 0], X[:, 1], c=Z)
# plt.plot(Xs[0, :, 0], Xs[0, :, 1], '.k');
# plt.plot(Xs[:i, :, 0], Xs[:i, :, 1], 'r');
plt.plot(Xs[i, :, 0], Xs[i, :, 1], '.r');
plt.xlim([-3, 3])
plt.ylim([-3, 3])
plt.axis('off')
plt.show()
interact(
plotter,
i=widgets.IntSlider(0, 0, Xs.shape[0] - 1, 1, layout=Layout(width='800px'))
)
interactive(children=(IntSlider(value=0, description='i', layout=Layout(width='800px'), max=2000), Output()), …
<function __main__.plotter(i)>
What about inequality constraints?¶
- What if we don't require strictly that the individual elements $x_i$ add up to something, but instead that they are bounded?
- For example, $\sum_i x_i^2 \leq 1$ or $0 \leq x_i \leq 1$?
- Where we are constrained to a subdomain, but not a subspace. This is the setting for Interior Point methods
- See the textbook by Boyd and Vandenberghe for discussion of these methods
Equality constrained optimization with Lagrange multipliers¶
We can also enforce equality constraints by adding a Lagrange multiplier term to the loss function. For example, for our unit norm constraint, we augment the loss function to the following, $$ \mathcal{L}^{(a)}(\mathbf{x}, \lambda) = \mathcal{L}(\mathbf{x}) + \lambda \left( \mathbf{x}^T \mathbf{x} - 1 \right) $$ where $\lambda$ is a Lagrange multiplier.
We can then solve the following optimization problem: $$ \mathbf{x}^* = \arg \min_{\mathbf{x}} \mathcal{L}^{(a)}(\mathbf{x}, \lambda) $$
- The solution to this problem is given by the following update rule: $$ \mathbf{x} \leftarrow \mathbf{x} - \eta \nabla \mathcal{L}^{(a)}(\mathbf{x}, \lambda) $$ As well as an update rule for the Lagrange multiplier: $$ \lambda \leftarrow \lambda + \eta \nabla_\lambda \mathcal{L}^{(a)}(\mathbf{x}, \lambda) $$ Note that we are doing gradient ascent on the Lagrange multiplier, since we don't necessarily want to minimize the function with respect to $\lambda$. This algorithm represents a version of Lagrangian duality.
For our example, we write the gradient of the augmented loss function as $$ \nabla \mathcal{L}^{(a)}(\mathbf{x}, \lambda) = \mathcal{L}(\mathbf{x}) + \lambda \mathbf{x} $$ and the gradient of the Lagrange multiplier is $$ \nabla_\lambda \mathcal{L}^{(a)}(\mathbf{x}, \lambda) = \mathbf{x}^T \mathbf{x} - 1 $$
class GradientDescentLagrange(BaseOptimizer):
"""A Multivariate Gradient Descent Optimize with Lagrange Multipliers"""
def __init__(self, loss, lam, **kwargs):
super().__init__(loss, **kwargs)
self.lam = lam # Lagrange multiplier hyperparameter
def update(self, X):
grad = self.loss.grad(X)
self.lam = self.lam + self.lr * (np.einsum('ij,ij->i', X, X) - 1)
return X - self.lr * (grad + self.lam[..., None] * X)
# Initialize optimizer
optimizer = GradientDescentLagrange(loss, lr=0.1, lam=0.5, max_iter=2000, tol=1e-6, random_state=0, store_history=True)
# Initialize starting point
X0 = 6 * np.random.random(size=(100, 2)) - 3
# Fit optimizer
optimizer.fit(X0.copy())
<__main__.GradientDescentLagrange at 0x17f44e6d0>
x = np.linspace(-3, 3, 100)
y = np.linspace(-3, 3, 100)
xx, yy = np.meshgrid(x, y)
X = np.array([xx.ravel(), yy.ravel()]).T
Z = loss.loss(X)
plt.figure(figsize=(8, 8))
plt.scatter(X[:, 0], X[:, 1], c=Z)
plt.plot(*X0.T, '.')
plt.xlim([-3, 3])
plt.ylim([-3, 3])
plt.axis('off')
plt.title('Initial Guesses')
plt.figure(figsize=(8, 8))
Xs = np.array(optimizer.Xs)
plt.scatter(X[:, 0], X[:, 1], c=Z, zorder=-2)
plt.plot(Xs[0, :, 0], Xs[0, :, 1], '.');
plt.plot(Xs[:, :, 0], Xs[:, :, 1], zorder=-1);
plt.plot(Xs[-1, :, 0], Xs[-1, :, 1], '.b', markersize=10);
plt.xlim([-3, 3])
plt.ylim([-3, 3])
plt.axis('off')
plt.title('Final Guesses')
Text(0.5, 1.0, 'Final Guesses')
## Make an interactive video
from ipywidgets import interact, interactive, fixed, interact_manual, Layout
import ipywidgets as widgets
Xs = np.array(optimizer.Xs)
def plotter(i):
plt.figure(figsize=(6, 6))
plt.scatter(X[:, 0], X[:, 1], c=Z)
# plt.plot(Xs[0, :, 0], Xs[0, :, 1], '.k');
# plt.plot(Xs[:i, :, 0], Xs[:i, :, 1], 'r');
plt.plot(Xs[i, :, 0], Xs[i, :, 1], '.');
# unit_circle = np.random.random((2000, 2)) - 0.5
# unit_circle = unit_circle / np.linalg.norm(unit_circle, axis=1, keepdims=True)
# plt.plot(*unit_circle.T, '.k', markersize=0.5)
plt.xlim([-3, 3])
plt.ylim([-3, 3])
plt.axis('off')
plt.show()
interact(
plotter,
i=widgets.IntSlider(0, 0, Xs.shape[0] - 1, 1, layout=Layout(width='800px'))
)
interactive(children=(IntSlider(value=0, description='i', layout=Layout(width='800px'), max=2000), Output()), …
<function __main__.plotter(i)>
First-order methods widely-used today¶
- Stochastic gradient descent remains suprisingly effective. Remember that the stochasticity in ML comes from approximating our loss landscape from a subset of the data, not from the noise term in the update rule
- Adam, Adagrad, RMSprop, etc. are all variants of stochastic gradient descent. We track gradients over time, and use them to adapt the learning rate and gradient in various ways.
Second order methods¶
- So far, we have only been using methods in which we take the first derivative of the loss function.
- As we saw in the 1D case, we can converge much faster by using information from the second derivative of the loss function to adjust our learning rate
For a loss function $\mathcal{L}(\mathbf{x}) \in \mathbb{R}$ computed at location $\mathbf{x} \in \mathbb{R}^d$, the second derivative is given by the Hessian matrix, $\mathbf{H} = \nabla^2 \mathcal{L}(\mathbf{x}) \in \mathbb{R}^{d \times d}$
Writing this out in matrix form, we have $$ \mathbf{H} = \begin{bmatrix} \frac{\partial^2 \mathcal{L}(\mathbf{x})}{\partial x_1^2} & \frac{\partial^2 \mathcal{L}(\mathbf{x})}{\partial x_1 \partial x_2} & \cdots & \frac{\partial^2 \mathcal{L}(\mathbf{x})}{\partial x_1 \partial x_d} \\ \frac{\partial^2 \mathcal{L}(\mathbf{x})}{\partial x_2 \partial x_1} & \frac{\partial^2 \mathcal{L}(\mathbf{x})}{\partial x_2^2} & \cdots & \frac{\partial^2 \mathcal{L}(\mathbf{x})}{\partial x_2 \partial x_d} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2 \mathcal{L}(\mathbf{x})}{\partial x_d \partial x_1} & \frac{\partial^2 \mathcal{L}(\mathbf{x})}{\partial x_d \partial x_2} & \cdots & \frac{\partial^2 \mathcal{L}(\mathbf{x})}{\partial x_d^2} \end{bmatrix} $$
Recalling our intuition from the 1D case, we can see that the Hessian matrix is a measure of how steeply curved the loss function appears at a given point. If the Hessian is positive definite, then the loss function is locally convex, and we can use the second derivative to adjust our learning rate.
The Hessian of the Random Gaussian well landcape¶
Recall that our test landscape is a sum of Gaussian wells, $$ \mathcal{L} = \sum_{p=1}^P \mathcal{L}_p = -\sum_{p=1}^P \exp\left(-\frac{(\mathbf{x} - \boldsymbol{\mu}_{p})^2}{2\sigma^2}\right) $$ where $\mathbf{x} \in \mathbb{R}^d$ is the location in the landscape, $\boldsymbol{\mu}_p \in \mathbb{R}^d$ is the center of the $p$th well, and $\sigma$ is the width of the well.
The gradient of the loss function is given by $$ \nabla \mathcal{L} = \sum_{p=1}^P \exp\left(-\frac{(\mathbf{x} - \boldsymbol{\mu}_{p})^2}{2\sigma^2}\right) \frac{(\mathbf{x} - \boldsymbol{\mu}_{p})}{\sigma^2} $$
The Hessian matrix is given by $$ \mathbf{H} = \sum_{p=1}^P \exp\left(-\frac{(\mathbf{x} - \boldsymbol{\mu}_{p})^2}{2\sigma^2}\right) \left(\frac{1}{\sigma^2} \mathbf{I} - \frac{(\mathbf{x} - \boldsymbol{\mu}_{p})(\mathbf{x} - \boldsymbol{\mu}_{p})^T}{\sigma^4}\right) $$ Note the order of the last product, which is an outer product and not a contracting inner product.
We add our derived implementation of the Hessian matrix to our loss object using subclassing.
class RandomLossLandscapeWithHessian(RandomLossLandscape):
"""
Subclass of the Random Gaussian Loss Landscape that adds an analytic Hessian calculation
"""
def __init__(self, **kwargs):
super().__init__(**kwargs)
# Everything is passed to the RandomLossLandscape constructor
def _hessian_gaussian_well(self, X, width=1):
oprod = np.einsum('ijk,imk->ijmk', X, X) # (n, d, d, 1)
iden = np.eye(X.shape[1])[None, ..., None]
return (-iden + oprod / width**2) / width**2 * self._gaussian_well(X, width)[:, None, None, :]
def hessian(self, X):
return np.einsum('...i,i->...', self._hessian_gaussian_well(X[..., None] - self.locs.T[None, :], self.widths), self.coeffs)
loss = RandomLossLandscapeWithHessian(random_state=0, n_wells=8)
x = np.linspace(-3, 3, 100)
y = np.linspace(-3, 3, 100)
xx, yy = np.meshgrid(x, y)
X = np.array([xx.ravel(), yy.ravel()]).T
Z = loss(X) # same as loss.loss(X) because class is callable
plt.figure(figsize=(8, 8))
plt.scatter(X[:, 0], X[:, 1], c=Z)
plt.plot(xx.ravel()[Z.argmin()], yy.ravel()[Z.argmin()], '*r', markersize=10)
plt.axis('off')
plt.figure(figsize=(8, 8))
eig_dirs = np.linalg.eigvalsh(loss.hessian(X))
signed_log = lambda x: np.sign(x) * np.log(1 + np.abs(x))
plt.scatter(X[:, 0], X[:, 1], c=np.log10(np.abs(np.mean(eig_dirs, axis=1))))
plt.plot(xx.ravel()[Z.argmin()], yy.ravel()[Z.argmin()], '*r', markersize=10)
plt.axis('off')
plt.figure(figsize=(8, 8))
x = np.linspace(-3, 3, 15)
y = np.linspace(-3, 3, 15)
xx, yy = np.meshgrid(x, y)
X = np.array([xx.ravel(), yy.ravel()]).T
Z = loss.loss(X)
plt.quiver(X[:, 0], X[:, 1], loss.grad(X)[:, 0], loss.grad(X)[:, 1], Z, scale=1e0)
# plt.streamplot(x, y, loss.grad(X)[:, 0].reshape(100, 100), loss.grad(X)[:, 1].reshape(100, 100), color=Z.reshape(100, 100))
plt.axis('off')
(-3.3, 3.3, -3.3, 3.3)
The multivariate Newton's method¶
Newton's method is a second order method for unconstrained optimization
The update rule is given by $$ \mathbf{x} \leftarrow \mathbf{x} - \mathbf{H}^{-1} \nabla \mathcal{L}(\mathbf{x}_t) $$
We can interpret this as scaling the size of a gradient descent step by the curvature of the loss function, and rotating the step to align with the eigenvectors of the Hessian matrix. Aligning with the eigenvectors of the Hessian results in straighter paths to the minimum.
class MultivariateNewtonsMethod:
"""
Optimize a function, subject to the constraint that the solution lies in a convex set.
"""
def __init__(self, loss, lr=0.1, max_iter=1000, tol=1e-6, random_state=None, store_history=False):
self.loss = loss
self.lr = lr
self.max_iter = max_iter
self.tol = tol
self.random_state = random_state
np.random.seed(random_state)
self.store_history = store_history
if self.store_history:
self.Xs = []
self.losses = []
def fit(self, X):
self.X = X
if self.store_history:
self.Xs = [self.X.copy()]
self.losses = [self.loss(X)]
for i in range(self.max_iter):
## compute the gradient and the Hessian
## adding a random term to the Hessian helps with numerical stability
grad = self.loss.grad(self.X)
hess = self.loss.hessian(self.X) + 1e-12 * np.eye(self.X.shape[1])[None, ...]
## Invert the Hessian
ihess = np.linalg.inv(hess)
## Because we want to operate on the last axis, we need to use einsum
self.X -= np.einsum('ink,ik->in', ihess, grad, optimize=True)
## momentum
# self.X -= self.lr * np.einsum('ink,ik->in', ihess, grad, optimize=True)
if self.store_history:
self.Xs.append(self.X.copy())
self.losses.append(self.loss(self.X))
# Stop early if loss is not decreasing any more for *any* batch elements
if np.linalg.norm(self.loss.grad(self.X)) < self.tol:
break
# Initialize optimizer
optimizer = MultivariateNewtonsMethod(loss, max_iter=1000, tol=1e-6, random_state=0, store_history=True)
# Initialize starting point
X0 = 6 * np.random.random(size=(100, 2)) - 3
# X0 = np.random.random(size=(100, 2))
# Fit optimizer
optimizer.fit(X0.copy())
x = np.linspace(-3, 3, 100)
y = np.linspace(-3, 3, 100)
xx, yy = np.meshgrid(x, y)
X = np.array([xx.ravel(), yy.ravel()]).T
Z = loss.loss(X)
plt.figure(figsize=(8, 8))
plt.scatter(X[:, 0], X[:, 1], c=Z)
plt.plot(*X0.T, '.')
plt.xlim([-3, 3])
plt.ylim([-3, 3])
plt.axis('off')
plt.title('Initial Guesses')
plt.figure(figsize=(8, 8))
Xs = np.array(optimizer.Xs)
plt.scatter(X[:, 0], X[:, 1], c=Z, zorder=-2)
plt.plot(Xs[0, :, 0], Xs[0, :, 1], '.');
plt.plot(Xs[:, :, 0], Xs[:, :, 1], zorder=-1);
plt.plot(Xs[-1, :, 0], Xs[-1, :, 1], '.b', markersize=10);
plt.xlim([-3, 3])
plt.ylim([-3, 3])
plt.axis('off')
plt.title('Final Guesses')
Text(0.5, 1.0, 'Final Guesses')
plt.figure()
plt.plot(optimizer.losses, color=(0.7, 0.7, 0.7), lw=1, alpha=0.2)
plt.plot(np.mean(optimizer.losses, axis=1), 'k', lw=3)
plt.xlabel('Iteration')
plt.ylabel('Loss')
Text(0, 0.5, 'Loss')
## Make an interactive video
from ipywidgets import interact, interactive, fixed, interact_manual, Layout
import ipywidgets as widgets
Xs = np.array(optimizer.Xs)
def plotter(i):
plt.figure(figsize=(6, 6))
plt.scatter(X[:, 0], X[:, 1], c=Z)
# plt.plot(Xs[0, :, 0], Xs[0, :, 1], '.k');
# plt.plot(Xs[:i, :, 0], Xs[:i, :, 1], 'r');
plt.plot(Xs[i, :, 0], Xs[i, :, 1], '.');
plt.xlim([-3, 3])
plt.ylim([-3, 3])
plt.axis('off')
plt.show()
interact(
plotter,
i=widgets.IntSlider(0, 0, Xs.shape[0] - 1, 1, layout=Layout(width='800px'))
)
interactive(children=(IntSlider(value=0, description='i', layout=Layout(width='800px'), max=929), Output()), _…
<function __main__.plotter(i)>
The results are underwhelming¶
This acts weird because our objective function is non-convex, so convergence is not guaranteed. Notice that points close to the biggest well had no problem; it's the points in between wells that are having trouble. That's because these are flat regions with extremely low curvature.
The Hessian matrix is not positive definite in some regions, so we can't use Newton's method. We can use quasi-Newton methods, but they are not guaranteed to converge to the global minimum.
We can try adding momentum to the update rule
The spectrum of the Hessian¶
- The eigenvalues of the Hessian matrix are a measure of how steeply the loss function changes in each direction
- For a harmonic potential, they give us the stiffness of the potential in each direction
Taylor expand the potential around a given point $\mathbf{x}$, relative to a nearby point $\mathbf{x}'$ $$ \mathcal{L}(\mathbf{x}) \approx \mathcal{L}(\mathbf{x'}) + \nabla \mathcal{L}(\mathbf{x'})^T (\mathbf{x} - \mathbf{x'}) + \frac{1}{2} (\mathbf{x} - \mathbf{x'})^T \mathbf{H}(\mathbf{x'}) (\mathbf{x} - \mathbf{x'}) + ... $$ where $\mathbf{v} \in \mathbb{R}^d$ is a small displacement from the current point $\mathbf{x}$.
Imagine that $\mathbf{x} - \mathbf{x'}$ is a small displacement along one of the eigenvector axes of a multivariate harmonic potential. The relative size of the second term is directly proportional to the eigenvalue associated with that direction
But different directions have different eigenvalues, so the relative size of the second term depends on the direction of the displacement.
For very oblong potentials, the difference along various directions can be dramatic. This can be quantified by condition number, or the the ratio of the largest to smallest eigenvalue: $$ {\displaystyle \kappa (H)={\frac {\left|\lambda _{\text{max}}(H)\right|}{\left|\lambda _{\text{min}}(H)\right|}},} $$
Optimally, we'd have a separate learning rate for each direction in space, but this becomes computationally expensive in high dimensions. Instead, we can use the condition number to adapt the learning rate.
Adjust learning rate based on condition number
$$ \mathbf{x} \leftarrow \mathbf{x} - \frac{\eta}{\kappa(\mathbf{H}) + \epsilon} \nabla \mathcal{L}(\mathbf{x}) $$ where $\eta$ is the initial learning rate, $\kappa(\mathbf{H})$ is the condition number of the Hessian matrix, and $\epsilon$ is a small constant to prevent division by zero.
Another option is to adjust learning rate based only the largest Hessian eigenvalue (can use the power method to find this quickly) $$ \mathbf{x} \leftarrow \mathbf{x} - \frac{\eta}{\lambda_{\text{max}}} \nabla \mathcal{L}(\mathbf{x}) $$ where $\lambda_{\text{max}}$ is the largest eigenvalue of the Hessian matrix
Another option is to scale using the smallest eigenvalue as well Another option is to adjust learning rate based only the largest Hessian eigenvalue (can use the power method to find this quickly) $$ \mathbf{x} \leftarrow \mathbf{x} - \frac{\eta}{|\lambda_{\text{max}|} + |\lambda_{\text{min}}|} \nabla \mathcal{L}(\mathbf{x}) $$ where $\lambda_{\text{max}}$ is the largest eigenvalue of the Hessian matrix
class GradientDescentAdaptive(BaseOptimizer):
"""
Optimize a function, subject to the constraint that the solution lies in a convex set.
"""
def __init__(self, loss, **kwargs):
super().__init__(loss, **kwargs)
def update(self, X):
grad = self.loss.grad(self.X)
hess = self.loss.hessian(self.X) + 1e-12 * np.eye(self.X.shape[1])[None, ...]
scale = np.abs(np.linalg.eigh(hess)[0][-1]) # + np.abs(np.linalg.eigh(hess)[0][0])
return X - self.lr * grad / scale
# Initialize optimizer
optimizer = GradientDescentAdaptive(loss, max_iter=2000, tol=1e-6, random_state=0, store_history=True)
# Initialize starting point
X0 = 6 * np.random.random(size=(100, 2)) - 3
# Fit optimizer"
optimizer.fit(X0.copy())
<__main__.GradientDescentAdaptive at 0x17a746a10>
x = np.linspace(-3, 3, 100)
y = np.linspace(-3, 3, 100)
xx, yy = np.meshgrid(x, y)
X = np.array([xx.ravel(), yy.ravel()]).T
Z = loss.loss(X)
plt.figure(figsize=(8, 8))
plt.scatter(X[:, 0], X[:, 1], c=Z)
plt.plot(*X0.T, '.')
plt.xlim([-3, 3])
plt.ylim([-3, 3])
plt.axis('off')
plt.title('Initial Guesses')
plt.figure(figsize=(8, 8))
Xs = np.array(optimizer.Xs)
plt.scatter(X[:, 0], X[:, 1], c=Z, zorder=-2)
plt.plot(Xs[0, :, 0], Xs[0, :, 1], '.');
plt.plot(Xs[:, :, 0], Xs[:, :, 1], zorder=-1);
plt.plot(Xs[-1, :, 0], Xs[-1, :, 1], '.r', markersize=10);
plt.xlim([-3, 3])
plt.ylim([-3, 3])
plt.axis('off')
plt.title('Final Guesses')
Text(0.5, 1.0, 'Final Guesses')
## Make an interactive video
from ipywidgets import interact, interactive, fixed, interact_manual, Layout
import ipywidgets as widgets
Xs = np.array(optimizer.Xs)
def plotter(i):
plt.figure(figsize=(6, 6))
plt.scatter(X[:, 0], X[:, 1], c=Z)
# plt.plot(Xs[0, :, 0], Xs[0, :, 1], '.k');
# plt.plot(Xs[:i, :, 0], Xs[:i, :, 1], 'r');
plt.plot(Xs[i, :, 0], Xs[i, :, 1], '.');
plt.xlim([-3, 3])
plt.ylim([-3, 3])
plt.axis('off')
plt.show()
interact(
plotter,
i=widgets.IntSlider(0, 0, Xs.shape[0] - 1, 1, layout=Layout(width='800px'))
)
interactive(children=(IntSlider(value=0, description='i', layout=Layout(width='800px'), max=2000), Output()), …
<function __main__.plotter(i)>
Quasi-Newton methods¶
Computing the Hessian matrix is computationally expensive in high dimensions, even moreso its inverse
Recall that solving a linear system took $\mathcal{O}(N^3)$ operations due to matrix inversion
We can reduce this cost by approximating the Hessian matrix with a low-rank approximation
A popular algorithm in this class is L-BFGS (Limited-memory Broyden–Fletcher–Goldfarb–Shanno algorithm). This uses an iterative method to approximate the Hessian matrix at each step
This requires an interior loop within our optimization, so we hope that the inner approximation loop converges quickly and terminates after a few steps
class LBFGS(BaseOptimizer):
def __init__(self, loss, memory=10, **kwargs):
super().__init__(loss, **kwargs)
self.memory = memory
self.grad_hist = [] # Gradient history
self.s_hist = [] # Steps history
def update(self, X):
grad = self.loss.grad(X)
s, y = None, None
if len(self.grad_hist) > 0:
s = X - self.X_prev # Step taken
y = grad - self.grad_prev # Change in gradient
if len(self.grad_hist) >= self.memory:
self.grad_hist.pop(0)
self.s_hist.pop(0)
self.grad_hist.append(y)
self.s_hist.append(s)
Hk = self.get_inverse_hessian_approximation()
pk = -Hk.dot(grad)
self.X_prev = X.copy()
self.grad_prev = grad.copy()
return X + self.lr * pk
def get_inverse_hessian_approximation(self):
"""
Compute the approximate inverse Hessian matrix.
Uses the LBFGS algorithm's update rules.
"""
H0 = np.eye(len(self.X)) # Initial approximation (identity matrix)
# Apply the LBFGS update formula using the history
for i in range(len(self.s_hist) - 1, -1, -1):
s = self.s_hist[i]
y = self.grad_hist[i]
rho = 1.0 / (y @ s)
if i == len(self.s_hist) - 1:
# Scale the initial Hessian approximation (H0)
gamma = (s @ y) / (y @ y)
H0 = gamma * H0
Hs = H0 @ s
H0 = H0 + np.outer(y, y) * rho * rho - (np.outer(Hs, s) + np.outer(s, Hs)) * rho
return H0
loss = RandomLossLandscapeWithHessian(random_state=0, n_wells=8)
# Initialize optimizer
optimizer = LBFGS(loss, lr=0.1, max_iter=2000, tol=1e-6, random_state=0, store_history=True)
# Initialize starting point
X0 = 6 * np.random.random(size=(100, 2)) - 3
# Fit optimizer
optimizer.fit(X0.copy())
<__main__.LBFGS at 0x29c759810>
x = np.linspace(-3, 3, 100)
y = np.linspace(-3, 3, 100)
xx, yy = np.meshgrid(x, y)
X = np.array([xx.ravel(), yy.ravel()]).T
Z = loss.loss(X)
plt.figure(figsize=(8, 8))
plt.scatter(X[:, 0], X[:, 1], c=Z)
plt.plot(*X0.T, '.')
plt.xlim([-3, 3])
plt.ylim([-3, 3])
plt.axis('off')
plt.title('Initial Guesses')
plt.figure(figsize=(8, 8))
Xs = np.array(optimizer.Xs)
plt.scatter(X[:, 0], X[:, 1], c=Z, zorder=-2)
plt.plot(Xs[0, :, 0], Xs[0, :, 1], '.');
plt.plot(Xs[:, :, 0], Xs[:, :, 1], zorder=-1);
plt.plot(Xs[-1, :, 0], Xs[-1, :, 1], '.r', markersize=10);
plt.xlim([-3, 3])
plt.ylim([-3, 3])
plt.axis('off')
plt.title('Final Guesses')
Text(0.5, 1.0, 'Final Guesses')
Bakeoff: Which method is best?¶
We can compare the performance of the different methods by running them on the same problem
We compute two metrics: the final loss value, and the total amount of walltime required to reach that loss value
A more accurate measure of performance might count individual operations and memory usage, rather than aggregate walltime, but this should give us an idea of the tradeoffs involved
import time
loss = RandomLossLandscapeWithHessian(random_state=0, n_wells=8)
X0 = 6 * np.random.random(size=(500, 2)) - 3
optimizer_kwargs = {"max_iter": 5000, "lr": 0.1, "tol": 1e-6, "random_state": 0, "store_history": True}
optimizer_list = [
GradientDescent(loss, **optimizer_kwargs),
GradientDescentMomentum(loss, **optimizer_kwargs),
StochasticGradientDescent(loss, noise=0.2, **optimizer_kwargs),
# ProjectedGradientDescent(loss, **optimizer_kwargs),
# GradientDescentLagrange(loss, lam=0.5, **optimizer_kwargs),
MultivariateNewtonsMethod(loss, **optimizer_kwargs),
GradientDescentAdaptive(loss, **optimizer_kwargs),
LBFGS(loss, **optimizer_kwargs),
]
all_losses = []
all_walltimes = []
for optimizer in optimizer_list:
start_time = time.time()
optimizer.fit(X0.copy())
stop_time = time.time()
final_loss = np.mean(optimizer.losses, axis=1)[-1]
all_losses.append(final_loss)
all_walltimes.append(stop_time - start_time)
all_losses = np.array(all_losses)
all_walltimes = np.array(all_walltimes)
optimizer_names = np.array([type(o).__name__ for o in optimizer_list])
plt.figure(figsize=(6, 6))
sorted_losses = np.argsort(all_losses)
plt.barh(optimizer_names[sorted_losses], all_losses[sorted_losses])
plt.xlabel('Final Loss')
plt.ylabel('Optimizer')
plt.figure(figsize=(6, 6))
sorted_times = np.argsort(all_walltimes)
plt.barh(optimizer_names[sorted_times], all_walltimes[sorted_times])
plt.xlabel('Wall Time')
plt.ylabel('Optimizer')
plt.figure(figsize=(6, 6))
plt.scatter(all_losses, all_walltimes)
## label optimizers
for i in range(len(optimizer_names)):
plt.annotate(optimizer_names[i], (all_losses[i], all_walltimes[i]))
plt.xlabel('Final Loss')
plt.ylabel('Wall Time')
Text(0, 0.5, 'Wall Time')
What to do without any derivative functions?¶
What about calculations with intermediate variables?¶
- Our loss function required a sum over Gaussian wells, and so we had elaborate matrix-vector products to compute the gradient and Hessian
Finite difference derivatives¶
- Symbolic derivatives much more practical
- Computing them by hand is usually possible, but tedious
- Unfavorable scaling in high dimensions
Any tricks to do it faster?¶
- Computational tricks for fast multivariate chain rule: Backpropagation, automatic differentiation
- Instead of writing down the full input-output and its inverse, we can just write down the matrix-vector products in a computation graph, and then invert it to apply the chain rule.
- Exact like symbolic derivatives, but not explicit---we just store the order of the chain rule
- Most complex functions are compositions of matrix-vector products and unary functions like $\sin$, $\tanh$, etc
- Basic idea: the multivariate chain rule can be written as a graph, traversing this graph can be optimized to reduce calls to nodes by caching intermediate results on the forward pass, etc
- We'll look at tools implementing these ideas behind the scenes soon, but we won't go over the mathematics until the deep learning section of the course.