Optimization in many dimensionsĀ¶
Preamble: Run the cells below to import the necessary Python packages
This notebook created by William Gilpin. Consult the course website for all content and GitHub repository for raw files and runnable online code.
import numpy as np
from IPython.display import Image, display
import matplotlib.pyplot as plt
%matplotlib inline
%load_ext autoreload
%autoreload 2
## Set nicer colors
plt.rcParams['image.cmap'] = 'PuBu'
plt.rcParams['axes.prop_cycle'] = plt.cycler(color=[[1.0, .3882, .2784]])
plt.rcParams['lines.markersize'] = 10
The autoreload extension is already loaded. To reload it, use: %reload_ext autoreload
Optimization in a complex landscapeĀ¶
Previously, we considered optimization of functions that depend on a single variable. In this notebook, we will consider optimization of functions that depend on many variables.
Our multivariate landscape is a sum of Gaussian wells, $$ \mathcal{L} = \sum_{p=1}^P \mathcal{L}_p = -\sum_{p=1}^P A_p \exp\left(-\frac{(\mathbf{x} - \boldsymbol{\mu}_{p})^2}{2\sigma^2}\right) $$ where $\mathbf{x} \in \mathbb{R}^D$ is the position in the landscape, $\boldsymbol{\mu}_p \in \mathbb{R}^D$ is the location of the $p^{th}$ well, and $\sigma \in \mathbb{R}^D$ is the width of the wells. The weights $\sum_p A_p = 1$, $A_p \geq 0$ are the relative depths of the wells.
Recall that, during optimization, we usually think of our "cost" as being the number of calls to the function we are trying to optimize. Here, to implement this landscape, we will make calculating its value at a single point as fast as possible using vectorization. Since the well positions are all fixed, and the landscape is a sum over them, we we can vectorize the other sum. Additionally, if we want to calculate the landscape value at many points at once, we can use broadcasting, or "batching," because the different evaluations are independent of each other.
class RandomLossLandscapeWithoutGradient:
"""
Creates a random two-dimensional loss landscape with multiple circular gaussian wells
Args:
d (int): number of dimensions for the loss landscape
n_wells (int): number of gaussian wells
random_state (int): random seed
"""
def __init__(self, d=2, n_wells=3, random_state=None):
# Fix the random seed
self.random_state = random_state
np.random.seed(random_state)
# Select random well weights, locations, and widths
self.coeffs = np.random.random(n_wells)
self.coeffs /= np.sum(self.coeffs)
self.locs = np.random.randn(n_wells, d)
self.widths = np.random.rand(n_wells)[None, :]
def _gaussian_well(self, X, width=1):
"""A single gaussian well centered at 0 with width `width`"""
return -np.exp(-np.sum((X / width) ** 2, axis=1))
def loss(self, X):
"""
Compute the loss landscape at points X
Args:
X (np.ndarray): points at which to compute the loss landscape. This should
be of shape (n_batch, n_dim)
width (float): width of the gaussian wells
Returns:
np.ndarray: loss landscape at points X
Notes:
The loss landscape is computed as the sum of the individual gaussian wells
The shape of the argument to np.sum is (n_batch, n_wells)
"""
# return np.einsum(
# '...i,i->...',
# self._gaussian_well(X[..., None] - self.locs.T[None, :], self.widths),
# self.coeffs
# )
# print(X[..., None].shape)
# print(self.locs.T[None, :].shape)
# print(self.widths.shape)
# print((self._gaussian_well(X[..., None] - self.locs.T[None, :])).shape)
# print(self._gaussian_well(X[..., None] - self.locs.T[None, :], self.widths).shape)
return np.sum(
self._gaussian_well(X[..., None] - self.locs.T[None, :], self.widths) * self.coeffs,
axis=1
)
def __call__(self, X):
return self.loss(X)
loss = RandomLossLandscapeWithoutGradient(random_state=0, n_wells=8)
## We have to make a list of points at which to plot the landscape. NumPy's meshgrid is
## a built-in utility for this purpose.
x = np.linspace(-3, 3, 100)
y = np.linspace(-3, 3, 100)
xx, yy = np.meshgrid(x, y)
X = np.array([xx.ravel(), yy.ravel()]).T
print(X.shape)
Z = loss(X) # same as loss.loss(X) because class is callable
plt.figure(figsize=(8, 8))
plt.scatter(X[:, 0], X[:, 1], c=Z)
plt.plot(xx.ravel()[Z.argmin()], yy.ravel()[Z.argmin()], '*', markersize=20)
plt.axis('off')
(10000, 2)
(-3.3, 3.3, -3.3, 3.3)
Recall that many optimization routines requre the gradient of the landscape. We can analytically calculate the gradient of this landscape
$$ \mathcal{L} = \sum_{p=1}^P \mathcal{L}_p = -\sum_{p=1}^P \exp\left(-\frac{(\mathbf{x} - \boldsymbol{\mu}_{p})^2}{2\sigma^2}\right) $$
$$ \nabla \mathcal{L} = \sum_{p=1}^P \frac{(\mathbf{x} - \boldsymbol{\mu}_{p})}{\sigma^2} \exp\left(-\frac{(\mathbf{x} - \boldsymbol{\mu}_{p})^2}{2\sigma^2}\right) $$
We can implement both the landscape and its gradient within a single loss object. As we saw in univariate optimization, we will likely find ourselves calling the loss function frequently, and so we will try to make our implementation as efficient as possible. In our implementation, we vectorize the computation of the loss and gradient over the $P$ wells using np.einsum
Some considerations when performing multivariate optimizationĀ¶
Having a smooth landscape is important for many optimization algorithms to work well, because gradient information is required for most local search methods.
Because differentiability is a concern, we expect that flatter regions in the landscape will pose challenges. In our Gaussian landscape, the regions far away from the $P$ centers will be flat, and so we expect optimizers initialized in these regions to struggle on this problem.
Computing analytic gradients is hardĀ¶
- Computing analytic gradients is hard, and often impossible. We can use finite differences to approximate the gradient along each dimension:
$$ \frac{\partial f}{\partial x_i} \approx \frac{f(x_1, \ldots, x_i + \epsilon, \ldots, x_D) - f(x_1, \ldots, x_i - \epsilon, \ldots, x_D)}{2\epsilon} $$
- Finite difference gradients are computationally expensive: In $d$ dimensions, we need to evaluate the function $\mathcal{O}(d)$ times to compute the gradient. This is impractical for high-dimensional problems, but it is acceptable for our 2D landscape.
- Symbolic gradients are useful, but prone to errors
- We use a gradcheck function to mitigate the risk of errors in our gradients.
- (Later in the course): Faster gradient estimation for more complex loss functions with automatic differentiation (the symbolic chain rule for arbitrary compositions of primitive functions).
def gradcheck(loss, x, eps=1e-9):
"""
This function computes the exact gradient and an approximate gradient using
finite differences
Args:
loss (callable): loss function. Must have a method `grad` that returns the
analytic gradient
x (np.ndarray): input to the loss function
eps (float): epsilon for finite differences
Returns:
grad (np.ndarray): analytic gradient
grad_num (np.ndarray): numerical gradient
"""
x = np.array(x)
grad = loss.grad(x)
grad_num = list()
for i in range(x.shape[0]):
x1, x2 = x.copy(), x.copy()
x1[i] += eps
x2[i] -= eps
grad_num.append((loss(x1) - loss(x2)) / (2 * eps))
grad_num = np.array(grad_num)
return grad, grad_num
grad, grad_num = gradcheck(loss, np.random.randn(2))
print(f"True gradient: {grad.squeeze()}\nApproximate gradient: {grad_num.squeeze()}")
print(np.allclose(grad, grad_num, atol=1e-1))
True gradient: [ 0.00512768 -0.0059314 ] Approximate gradient: [ 0.00512768 -0.0059314 ] True
Why not use global maximization?Ā¶
Why bother using gradient descent when we can just check every point in our system's domain? That's literally how we generated the plot above.
For a continuous field, we need to pick some minimum length scale over which we don't expect the objective to change rapidly. For discrete optimization, we just try all combinations
Assume we need $N$ points to sample along any linear dimension. The number of queries to the objective function scales as $N^d$, where $d$ is the dimensionality
- For the example we just used, we sampled $N = 100$ points along each dimension, resulting in $100^2 = 10^4$ function calls.
- Assuming $100$ points per axis, when $d = 3$ we have $10^6$ while for $d = 7$ it takes $10^{14}$, etc
- Let's assume it takes $10$ $\mu s$ per evaluation. The 2D global search takes $0.1$ s, the 3D search takes $10$ s, the $d = 5$ search takes $1.15$ days, and the $d = 7$ search takes $32$ years
What is a typical value of $d$? OpenAI's DallE image generation model contains $12$ billion parameters ($d \approx 10^{10}$).
x = np.linspace(-3, 3, 10)
y = np.linspace(-3, 3, 10)
xx, yy = np.meshgrid(x, y)
X = np.array([xx.ravel(), yy.ravel()]).T
Z = loss(X) # same as loss.loss(X) because class is callable
plt.figure(figsize=(8, 8))
plt.scatter(X[:, 0], X[:, 1], c=Z, s=2000)
plt.plot(xx.ravel()[Z.argmin()], yy.ravel()[Z.argmin()], '*r', markersize=10)
plt.axis('off')
(-3.3, 3.3, -3.3, 3.3)
Optimizing in many dimensionsĀ¶
- We will now implement a series of optimization algorithms to find the minima of our Gaussian landscape
- We will first define a base class for our optimizers, which will contain the common functionality of all optimizers. Many optimization algorithms are iterative, meaning that they start out at some (usually randomly-chosen) point on the landscape, and then incrementally update their position in order to reach the minimum.
- The constructor will take a loss function itself, a learning rate that controls the rate of iterative updating, a maximum number of iterations, a tolerance that determines how close the optimization needs to get to a minimum to stop, a random seed, and a flag to store the history of steps taken during optimization
- The base class will contain a
fit
method that will perform the optimization, and aupdate
method that will be implemented by each subclass
class BaseOptimizer:
"""A Multivariate Gradient Descent Optimizer
Args:
loss (callable): loss function. Must have a method `grad` that returns the
analytic gradient
lr (float): learning rate
max_iter (int): maximum number of iterations
tol (float): tolerance for stopping criterion
random_state (int): random seed
store_history (bool): whether to store the optimization trajectory
"""
def __init__(self, loss, lr=0.1, max_iter=1000, tol=1e-6, random_state=None, store_history=False):
self.loss = loss
self.lr = lr
self.max_iter = max_iter
self.tol = tol
self.random_state = random_state
np.random.seed(random_state)
self.store_history = store_history
if self.store_history:
self.Xs = []
self.losses = []
def fit(self, X):
"""Fit the optimizer to the data
Args:
X (np.ndarray): initial guess for the optimizer. Shape (n_dim,) or
(n_batch, n_dim) if we have multiple initial points
Returns:
self
"""
self.X = X
if self.store_history:
self.Xs = [self.X.copy()]
self.losses = [self.loss(X)]
for i in range(self.max_iter):
self.X = self.update(self.X)
if self.store_history:
self.Xs.append(self.X.copy())
self.losses.append(self.loss(self.X))
# Stop early if loss is not decreasing any more for *any* batch elements
if np.linalg.norm(self.loss.grad(self.X)) < self.tol:
break
return self
def update(self, X):
raise NotImplementedError("Implement this method in a subclass")
Multivariate gradient descentĀ¶
- A first-order optimization method because it only requires the first partial derivatives of the objective function
- If we assume that our current position is $\mathbf{x}$, then the gradient of the objective function at this point is $\nabla \mathcal{L}(\mathbf{x})$
- The gradient descent update rule becomes $$\mathbf{x} \leftarrow \mathbf{x} - \eta \nabla \mathcal{L}(\mathbf{x})$$ where $\eta$ is a step size parameter we call the "learning rate" that determines the size of the step we take in the direction of the gradient
class GradientDescent(BaseOptimizer):
"""A Multivariate Gradient Descent Optimizer"""
def update(self, X):
return X - self.lr * self.loss.grad(X)
# Initialize optimizer
optimizer = GradientDescent(loss, lr=4.3/10000, max_iter=2000, tol=1e-6, random_state=0, store_history=True)
# Initialize many starting points
X0 = 6 * np.random.random(size=(100, 2)) - 3
# Fit optimizer
optimizer.fit(X0.copy())
<__main__.GradientDescent at 0x17fc7b210>
x = np.linspace(-3, 3, 100)
y = np.linspace(-3, 3, 100)
xx, yy = np.meshgrid(x, y)
X = np.array([xx.ravel(), yy.ravel()]).T
Z = loss.loss(X)
plt.figure(figsize=(8, 8))
plt.scatter(X[:, 0], X[:, 1], c=Z)
plt.plot(*X0.T, '.k')
plt.xlim([-3, 3])
plt.ylim([-3, 3])
plt.axis('off')
plt.title('Initial Guesses')
plt.figure(figsize=(8, 8))
plt.scatter(X[:, 0], X[:, 1], c=Z)
plt.plot(*optimizer.X.T, '.')
plt.xlim([-3, 3])
plt.ylim([-3, 3])
plt.axis('off')
plt.title('Final Guesses')
Text(0.5, 1.0, 'Final Guesses')
We next plot the loss versus time for each of the starting points.
plt.figure()
plt.plot(optimizer.losses, color=(0.7, 0.7, 0.7), lw=1, alpha=0.2)
plt.plot(np.mean(optimizer.losses, axis=1), 'k', lw=3)
plt.xlabel('Iteration')
plt.ylabel('Loss')
Text(0, 0.5, 'Loss')