Convex Optimization¶
Preamble: Run the cells below to import the necessary Python packages
This notebook created by William Gilpin. Consult the course website for all content and GitHub repository for raw files and runnable online code.
import numpy as np
from IPython.display import Image, display
import matplotlib.pyplot as plt
%matplotlib inline
Optimization¶
- The cost function, fitness, or empirical risk is usually well-defined
- Error of a neural network on training data
- The fitness of a genotype
- The stability of a folded protein configuration
- Whether a given configuration is a true ground state or metastable
- Loss landscapes in ML are approximated by finite data (empirical risk); ideally with more data the empirical risk converges to the true risk
If we can query every point in input space, we can guarantee that we've found a global minimum
If the function is convex, we can step towards minima in a principled way
Non-convex, we either have to perform many queries, or have external information that we can use to refine our search. Non-convex optimization over binary variables is NP-hard (difficulty scales exponentially with problem size)
- That means it's in the same difficulty class as Donkey Kong Country
xx = np.linspace(-1, 1, 500)
f_con = lambda x: 0.5 * x**2 - 0.2
f_noncon = lambda x : x**2 + 0.5 * np.sin(30 * x) - 0.2
plt.figure()
plt.plot(xx, f_con(xx))
plt.title('A convex optimization problem')
plt.xlabel('x')
plt.ylabel('f(x)')
plt.figure()
plt.plot(xx, f_noncon(xx))
plt.title('A non-convex optimization problem')
plt.xlabel('x')
plt.ylabel('f(x)')
Text(0, 0.5, 'f(x)')
We've already seen an example: in the Conjugate Gradient Method, we took steps towards a fixed point that represented the least-squares solution to the unconstrained linear problem
One dimensional optimization: fixed point methods and rootfinding¶
- I'm given some function $f(x)$, and I want to calculate the global minimum (convex) or at least the nearest local minimum to a start point (for nonconvex)
- If use our standard approach of finding the nearest point at which $f' = 0$, reducing the calculation to rootfinding:
$$ x^* = \arg \min_x f(x) \implies f'(x^*) = 0 $$
There are a few considerations when performing rootfinding:
If the function $f(x)$ has a complex form, we can't necessarily find the roots analytically.
If we have a polynomial with multiple roots, we can find the roots one at a time using the Durand-Kerner method
Gradient descent¶
- A first-order method
- Rolling down a hill
- Requires an estimate of the derivative of the loss function. Minimizing the loss is equivalent to finding the root of its derivative
- Requires local information only, rather than global information (querying points all over the landscape)
Starting from an arbitrary point on the landscape, take steps along the direction of steepest descent
$$ x \leftarrow x - \eta \dfrac{df}{dx} $$
The learning rate is a "hyperparameter" that we choose based on problem knowledge. Too large, and we can get stuck in oscillating solutions around the optima. Too small, and the system takes a long time to converge.
class GradientDescentOptimizer:
def __init__(self, learning_rate=0.1, max_iter=1000, tolerance=1e-6, store_history=False):
self.learning_rate = learning_rate
self.max_iter = max_iter
self.tolerance = tolerance
self.store_history = store_history
if self.store_history:
self.history = list()
def optimize(self, df, x0):
x = x0
for i in range(self.max_iter):
x_new = x - self.learning_rate * df(x)
if np.linalg.norm(x_new - x) < self.tolerance:
break
x = x_new
if self.store_history:
self.history.append(x_new)
return x
# def loss(x):
# return 0.5 * x**2 - 0.2
loss = lambda x: 0.5 * x**2 - 0.2
loss_grad = lambda x: x
loss = lambda x: x**2 + 0.05 * np.sin(30 * x) - 0.2
loss_grad = lambda x: 2 * x + 0.05 * 30 * np.cos(30 * x)
# optimizer = GradientDescentOptimizer(store_history=True, learning_rate=2.0)
optimizer = GradientDescentOptimizer(store_history=True, learning_rate=0.005)
x_opt = optimizer.optimize(loss_grad, 0.5)
plt.figure()
plt.plot(xx, loss(xx))
plt.plot(x_opt, loss(x_opt), 'ro')
plt.xlabel('x')
plt.ylabel("Loss")
plt.figure()
plt.plot(optimizer.history)
plt.xlabel('Iteration')
plt.ylabel('x')
Text(0, 0.5, 'x')
loss = lambda x: x**2 + 0.05 * np.sin(30 * x) - 0.2
loss_grad = lambda x: 2 * x + 0.05 * 30 * np.cos(30 * x)
## Solve optimization
optimizer = GradientDescentOptimizer(learning_rate=0.02, store_history=True)
x_opt = optimizer.optimize(loss_grad, 0.4)
## Plot loss
xx = np.linspace(-1, 1, 500)
plt.figure()
plt.plot(xx, loss(xx))
plt.plot(x_opt, loss(x_opt), 'ro')
plt.xlabel('x')
plt.ylabel("Loss")
plt.figure()
plt.plot(optimizer.history)
plt.xlabel('Iteration')
plt.ylabel('x')
Text(0, 0.5, 'x')
## Solve optimization
optimizer = GradientDescentOptimizer(learning_rate=0.02, store_history=True)
x_opt = optimizer.optimize(loss_grad, 0.7)
## Plot loss
xx = np.linspace(-1, 1, 500)
plt.figure()
plt.plot(xx, loss(xx))
plt.plot(x_opt, loss(x_opt), 'ro')
plt.xlabel('x')
plt.ylabel("Loss")
plt.figure()
plt.plot(optimizer.history)
plt.xlabel('Iteration')
plt.ylabel('x')
Text(0, 0.5, 'x')
Modifying gradient descent for non-convex functions¶
We can add momentum to the optimizer, so that it keeps a memory of past timesteps and bearings
$$ v \leftarrow \eta \dfrac{df}{dx} + \alpha\, v $$
$$ x \leftarrow x - v $$
In the units on numerical integration, we will revisit the idea of optimization as solving an initial value problem
class GradientDescentMomentumOptimizer:
def __init__(self, learning_rate=0.1, alpha=0.9, max_iter=1000, tolerance=1e-6, store_history=False):
self.learning_rate = learning_rate
self.alpha = alpha
self.max_iter = max_iter
self.tolerance = tolerance
self.store_history = store_history
if self.store_history:
self.history = list()
def optimize(self, df, x0):
x = x0
v = 0
for i in range(self.max_iter):
v = self.alpha * v + self.learning_rate * df(x)
x_new = x - v
if np.linalg.norm(x_new - x) < self.tolerance:
break
x = x_new
if self.store_history:
self.history.append(x_new)
return x
# loss = lambda x: 0.5 * x**2 - 0.2
# loss_grad = lambda x: x
loss = lambda x: x**2 + 0.05 * np.sin(30 * x) - 0.2
loss_grad = lambda x: 2 * x + 0.05 * 30 * np.cos(30 * x)
## Solve optimization
optimizer = GradientDescentMomentumOptimizer(learning_rate=0.3, alpha=0.2, store_history=True)
x_opt = optimizer.optimize(loss_grad, 0.5)
## Plot loss
xx = np.linspace(-1, 1, 500)
plt.figure()
plt.plot(xx, loss(xx))
plt.plot(x_opt, loss(x_opt), 'ro')
plt.xlabel('x')
plt.ylabel("Loss")
plt.figure()
plt.plot(optimizer.history)
plt.xlabel('Iteration')
plt.ylabel('x')
Text(0, 0.5, 'x')
Second-order methods and Newton's method¶
Can we improve convergence using information about the local landscape geometry?
Newton's method¶
Suppose that our initial guess is $x_k$ and our objective function is $f(x)$. We perform a second-order Taylor expansion around the guess, $$ {\displaystyle f(x_{k}+ h)\approx f(x_{k})+f'(x_{k})h +{\frac {1}{2}}f''(x_{k})h^{2}.} $$ We want to minimize this approximated local function in $h$, $$ {\displaystyle \displaystyle 0={\frac {\rm {d}}{{\rm {d}} h}}\left(f(x_{k})+f'(x_{k})h+{\frac {1}{2}}f''(x_{k})h^{2}\right)=f'(x_{k})+f''(x_{k})h,} $$ Solving this equation, we arrive at $$ h^* = -{\frac {f'(x_{k})}{f''(x_{k})}}, $$
Having optimized the approximant, we now update the position of $x$, $$ x_{k+1} = x_{k} + h^* = x_{k} - {\frac {f'(x_{k})}{f''(x_{k})}}. $$
Basically, in a single timestep Newton's method we approximate the function locally as a parabola, and we then jump to the optimum of that parabolic approximant. Newton's method therefore consists of optimizing a global function by optimizing a series of local approximants.
What if we don't know the derivatives of $f(x)$?¶
- This is sometimes called the "secant" method
- In many types of problems, we are best served by having a loss function with a differentiable form
- As a fallback, we can use finite difference operators to approximate $f'$ and $f''$
$$ f'(x) \approx \dfrac{f(x + \Delta x / 2) - f(x - \Delta x / 2)}{\Delta x} $$ $$ f''(x) \approx \dfrac{f(x + \Delta x) - 2 f(x) + f(x - \Delta x)}{\Delta x^2} $$ where $\Delta x / x \ll 1$
Here we have used central finite differences, but we could also use forward finite differences $$ f'(x) \approx \dfrac{f(x + \Delta x) - f(x)}{\Delta x} $$ $$ f''(x) \approx \dfrac{f(x + 2 \Delta x) - 2 f(x + \Delta x) + f(x)}{\Delta x^2} $$ Or, analogously, backwards finite differences.
There also exist higher-order finite difference approximations, as well as multivariate generalizations. Notice how the coefficients of the first and second derivatives are Pascal's triangle.
Why not always do this?¶
- Finite difference approximations are not exact, tend to fail when we need them most (rapidly changing functions)
- Finite difference approximations are computationally expensive as we increase dimensionality. In 1D, we have to call the objective function five times in Newton's method just to get first-order finite difference operators for the first and second derivatives of the potential
Image("../resources/newton.png", width=800)
# Source: https://jermwatt.github.io/machine_learning_refined/notes/4_Second_order_methods/4_4_Newtons.html