# Convex OptimizationÂ¶

Preamble: Run the cells below to import the necessary Python packages

*This notebook created by William Gilpin. Consult the course website for all content and GitHub repository for raw files and runnable online code.*

```
import numpy as np
from IPython.display import Image, display
import matplotlib.pyplot as plt
%matplotlib inline
```

### OptimizationÂ¶

- The cost function, fitness, or empirical risk is usually well-defined
- Error of a neural network on training data

- The fitness of a genotype

- The stability of a folded protein configuration

- Whether a given configuration is a true ground state or metastable

- Loss landscapes in ML are approximated by finite data (empirical risk); ideally with more data the empirical risk converges to the true risk

If we can query every point in input space, we can guarantee that we've found a global minimum

If the function is convex, we can step towards minima in a principled way

Non-convex, we either have to perform many queries, or have external information that we can use to refine our search. Non-convex optimization over binary variables is NP-hard (difficulty scales exponentially with problem size)

- That means it's in the same difficulty class as Donkey Kong Country

```
xx = np.linspace(-1, 1, 500)
f_con = lambda x: 0.5 * x**2 - 0.2
f_noncon = lambda x : x**2 + 0.5 * np.sin(30 * x) - 0.2
plt.figure()
plt.plot(xx, f_con(xx))
plt.title('A convex optimization problem')
plt.xlabel('x')
plt.ylabel('f(x)')
plt.figure()
plt.plot(xx, f_noncon(xx))
plt.title('A non-convex optimization problem')
plt.xlabel('x')
plt.ylabel('f(x)')
```

Text(0, 0.5, 'f(x)')

We've already seen an example: in the Conjugate Gradient Method, we took steps towards a fixed point that represented the least-squares solution to the unconstrained linear problem

```
```

```
```

```
```

# One dimensional optimization: fixed point methods and rootfindingÂ¶

- I'm given some function $f(x)$, and I want to calculate the global minimum (convex) or at least the nearest local minimum to a start point (for nonconvex)
- If use our standard approach of finding the nearest point at which $f' = 0$, reducing the calculation to rootfinding:

$$ x^* = \arg \min_x f(x) \implies f'(x^*) = 0 $$

```
```

```
```

# Gradient descentÂ¶

- A first-order method
- Rolling down a hill
- Requires an estimate of the derivative of the loss function. Minimizing the loss is equivalent to finding the root of its derivative
- Requires local information only, rather than global information (querying points all over the landscape)

Starting from an arbitrary point on the landscape, take steps along the direction of steepest descent

$$ x \leftarrow x - \eta \dfrac{df}{dx} $$

The learning rate is a "hyperparameter" that we choose based on problem knowledge. Too large, and we can get stuck in oscillating solutions around the optima. Too small, and the system takes a long time to converge.

```
class GradientDescentOptimizer:
def __init__(self, learning_rate=0.1, max_iter=1000, tolerance=1e-6, store_history=False):
self.learning_rate = learning_rate
self.max_iter = max_iter
self.tolerance = tolerance
self.store_history = store_history
if self.store_history:
self.history = list()
def optimize(self, df, x0):
x = x0
for i in range(self.max_iter):
x_new = x - self.learning_rate * df(x)
if np.linalg.norm(x_new - x) < self.tolerance:
break
x = x_new
if self.store_history:
self.history.append(x_new)
return x
# def loss(x):
# return 0.5 * x**2 - 0.2
loss = lambda x: 0.5 * x**2 - 0.2
loss_grad = lambda x: x
# loss = lambda x: x**2 + 0.05 * np.sin(30 * x) - 0.2
# loss_grad = lambda x: 2 * x + 0.05 * 30 * np.cos(30 * x)
# optimizer = GradientDescentOptimizer(store_history=True, learning_rate=2.0)
optimizer = GradientDescentOptimizer(store_history=True)
x_opt = optimizer.optimize(loss_grad, 0.5)
plt.figure()
plt.plot(xx, loss(xx))
plt.plot(x_opt, loss(x_opt), 'ro')
plt.xlabel('x')
plt.ylabel("Loss")
plt.figure()
plt.plot(optimizer.history)
plt.xlabel('Iteration')
plt.ylabel('x')
```

Text(0, 0.5, 'x')

### Modifying gradient descent for non-convex functionsÂ¶

We can add *momentum* to the optimizer, so that it keeps a memory of past timesteps and bearings

$$ v \leftarrow \eta \dfrac{df}{dx} + \alpha\, v $$

$$ x \leftarrow x - v $$

Next class / next unit: optimization as solving an initial value problem

```
loss = lambda x: x**2 + 0.05 * np.sin(30 * x) - 0.2
loss_grad = lambda x: 2 * x + 0.05 * 30 * np.cos(30 * x)
## Solve optimization
optimizer = GradientDescentOptimizer(learning_rate=0.02, store_history=True)
x_opt = optimizer.optimize(loss_grad, 0.4)
## Plot loss
xx = np.linspace(-1, 1, 500)
plt.figure()
plt.plot(xx, loss(xx))
plt.plot(x_opt, loss(x_opt), 'ro')
plt.xlabel('x')
plt.ylabel("Loss")
plt.figure()
plt.plot(optimizer.history)
plt.xlabel('Iteration')
plt.ylabel('x')
```

Text(0, 0.5, 'x')

```
## Solve optimization
optimizer = GradientDescentOptimizer(learning_rate=0.02, store_history=True)
x_opt = optimizer.optimize(loss_grad, 0.7)
## Plot loss
xx = np.linspace(-1, 1, 500)
plt.figure()
plt.plot(xx, loss(xx))
plt.plot(x_opt, loss(x_opt), 'ro')
plt.xlabel('x')
plt.ylabel("Loss")
plt.figure()
plt.plot(optimizer.history)
plt.xlabel('Iteration')
plt.ylabel('x')
```

Text(0, 0.5, 'x')

```
class GradientDescentMomentumOptimizer:
def __init__(self, learning_rate=0.1, alpha=0.9, max_iter=1000, tolerance=1e-6, store_history=False):
self.learning_rate = learning_rate
self.alpha = alpha
self.max_iter = max_iter
self.tolerance = tolerance
self.store_history = store_history
if self.store_history:
self.history = list()
def optimize(self, df, x0):
x = x0
v = 0
for i in range(self.max_iter):
v = self.alpha * v + self.learning_rate * df(x)
x_new = x - v
if np.linalg.norm(x_new - x) < self.tolerance:
break
x = x_new
if self.store_history:
self.history.append(x_new)
return x
## Solve optimization
optimizer = GradientDescentMomentumOptimizer(learning_rate=0.1, alpha=0.3, store_history=True)
x_opt = optimizer.optimize(loss_grad, 0.9)
## Plot loss
xx = np.linspace(-1, 1, 500)
plt.figure()
plt.plot(xx, loss(xx))
plt.plot(x_opt, loss(x_opt), 'ro')
plt.xlabel('x')
plt.ylabel("Loss")
plt.figure()
plt.plot(optimizer.history)
plt.xlabel('Iteration')
plt.ylabel('x')
```

Text(0, 0.5, 'x')

```
```

```
```

```
```

## Second-order methods and Newton's methodÂ¶

Can we improve convergence using information about the local landscape geometry?

### Newton's methodÂ¶

Suppose that our initial guess is $x_k$ and our objective function is $f(x)$. We perform a second-order Taylor expansion around the guess, $$ {\displaystyle f(x_{k}+ h)\approx f(x_{k})+f'(x_{k})h +{\frac {1}{2}}f''(x_{k})h^{2}.} $$ We want to minimize this approximated local function in $h$, $$ {\displaystyle \displaystyle 0={\frac {\rm {d}}{{\rm {d}} h}}\left(f(x_{k})+f'(x_{k})h+{\frac {1}{2}}f''(x_{k})h^{2}\right)=f'(x_{k})+f''(x_{k})h,} $$ Solving this equation, we arrive at $$ h^* = -{\frac {f'(x_{k})}{f''(x_{k})}}, $$

Having optimized the approximant, we now update the position of $x$, $$ x_{k+1} = x_{k} + h^* = x_{k} - {\frac {f'(x_{k})}{f''(x_{k})}}. $$

Basically, in a single timestep Newton's method we approximate the function locally as a parabola, and we then jump to the optimum of that parabolic approximant. Newton's method therefore consists of optimizing a global function by optimizing a series of local approximants.

### What if we don't know the derivatives of $f(x)$?Â¶

- This is sometimes called the "secant" method
- In many types of problems, we are best served by having a loss function with a differentiable form
- As a fallback, we can use finite difference operators to approximate $f'$ and $f''$

$$ f'(x) \approx \dfrac{f(x + \Delta x / 2) - f(x - \Delta x / 2)}{\Delta x} $$ $$ f''(x) \approx \dfrac{f(x + \Delta x) - 2 f(x) + f(x - \Delta x)}{\Delta x^2} $$ where $\Delta x / x \ll 1$

Here we have used central finite differences, but we could also use forward finite differences $$ f'(x) \approx \dfrac{f(x + \Delta x) - f(x)}{\Delta x} $$ $$ f''(x) \approx \dfrac{f(x + 2 \Delta x) - 2 f(x + \Delta x) + f(x)}{\Delta x^2} $$ Or, analogously, backwards finite differences.

There also exist higher-order finite difference approximations, as well as multivariate generalizations. Notice how the coefficients of the first and second derivatives are Pascal's triangle.

### Why not always do this?Â¶

- Finite difference approximations are not exact, tend to fail when we need them most (rapidly changing functions)
- Finite difference approximations are computationally expensive as we increase dimensionality. In 1D, we have to call the objective function five times in Newton's method just to get first-order finite difference operators for the first and second derivatives of the potential

```
Image("../resources/newton.png", width=800)
# Source: https://jermwatt.github.io/machine_learning_refined/notes/4_Second_order_methods/4_4_Newtons.html
```