Preamble and Imports¶

This notebook is modified notebooks originally developed in Pankaj Mehta’s ML for physics course, although I’ve added some modification based on Volodymyr Kuleshov’s applied ML course

## Preamble / required packages
import numpy as np
import scipy.sparse as sp
np.random.seed(0)

# Import local plotting functions and in-notebook display functions
import matplotlib.pyplot as plt
from IPython.display import Image, display
%matplotlib inline

import warnings
# Comment this out to activate warnings
warnings.filterwarnings('ignore')

Supervised learning¶

Given an input $X$ , construct a function that assigns it a label $\hat{y}$ .
The function $\hat{y} = f(X; \theta)$ is learned from many instances of “labelled” data comprising $X$ and $y$ pairs. The “weights” or parameters $\theta$ define the function space over which we are searching.

Image("../resources/supervised_learning.gif", width=1000)
# https://github.com/ReiCHU31/Cat-Dog-Classification-Flask-App

For this course, we will see supervised learning as the process of learning a generator for a dataset: given a set of points, can we approximate the underlying process that produced those points?
- Forecasting: given some past values, predict future values
- Regression: Given a known generator with unknonw parameters (like a quadratic Hamiltonian with unknown amplitudes), infer those amplitudes
- Classification (most famous example in ML): Given examples of labelled images, states, etc, predict the class of unlabelled data. We can think of the learned decision boundary as defining a generator of new examples belonging to that class (see conditional GANs, etc)

Linear Regression and the Ising Model¶

Consider the 1D Ising model with nearest-neighbor interactions

H[\boldsymbol{S}]=-J\sum_{j=1}^L S_{j}S_{j+1}

(1)

on a chain of length $L$ with periodic boundary conditions and $S_j=\pm 1$ Ising spin variables. In one dimension, this paradigmatic model has no phase transition at finite temperature.

For a more detailed introduction we refer the reader to consult one of the many textbooks on the subject (see for instance Goldenfeld, Lubensky, Baxter, etc.).

Learning the Ising model¶

Suppose you are performing an experiment on an unknown magnetic material. You observe a large number of spin configurations, and are able to compute their Ising energies. If you don’t know the value of $J$ and thus the Hamiltonian, can you infer it given a data set of $i=1\ldots n$ points of the form $\{(H[\boldsymbol{S}^i],\boldsymbol{S}^i)\}$ .

Generate some data from a simulation¶

In the real world, we won’t have access to this generator. Here because we made our own data, we have a ground truth, or oracle, against which to compare our results.

### define Ising model parameters
# system size
L = 40



class IsingModel:
    """
    The Ising model with ferromagnetic interactions that encourage nearest neighbors
    to align
    """
    def __init__(self, L, random_state=None):
        self.L = L
        self.random_state = random_state

        self.J = np.zeros((L, L),)
        # Ferromagnetic interaction between nearest-neighbors
        for i in range(L):
            self.J[i, (i + 1) % L] = -1.0  # off first diagonal

    def sample(self, n_samples=1):
        if self.random_state is not None:
            np.random.seed(self.random_state)
        return np.random.choice([-1, 1], size=(n_samples, self.L))

    def energy(self, state):
        return np.einsum("...i,ij,...j->...", states, self.J, states)

model_experiment = IsingModel(L)

# create 10000 random Ising states
states = model_experiment.sample(n_samples=10000)


# calculate Ising energies
energies = model_experiment.energy(states)

print("Input data has shape: ", states.shape)
print("Labels have shape: ", energies.shape)

Input data has shape:  (10000, 40)
Labels have shape:  (10000,)

Choosing the right model for the problem¶

First of all, we have to decide on a model class (possible Hamiltonians) we use to fit the data. In the absence of any prior knowledge, one sensible choice is the all-to-all Ising model

H_\mathrm{model}[\boldsymbol{S}^i] = - \sum_{j=1}^L \sum_{k=1}^L J_{j,k}S_{j}^iS_{k}^i.

(2)

Notice that this model is uniquely defined by the non-local coupling strengths $J_{jk}$ which we want to learn. Importantly, this model is linear in ${\mathbf J}$ which makes it possible to use linear regression to fit the values of $\mathbf{J}$ .

However, our regression inputs are vector microstates $\mathbf{X}$ , while our outputs are scalar energies $H$ . However, our Hamiltonian is a function of spin-spin interactions, and so, in order to apply linear regression, we would like to recast this model in the form

H_\mathrm{model}^i \equiv \mathbf{X}^i \cdot \mathbf{J},

(3)

where the vectors $\mathbf{X}^i$ represent all two-body interactions $\{S_{j}^iS_{k}^i \}_{j,k=1}^L$ , and the index $i$ runs over the samples in the data set. To make the analogy complete, we can also represent the dot product by a single index $p = \{j,k\}$ , i.e. $\mathbf{X}^i \cdot \mathbf{J}=X^i_pJ_p$ . Note that the regression model does not include the minus sign, so we expect to learn negative $J$ ’s.

Our inductive bias¶

If we had known nothing at all about the generator of our data, we might have just passed the input spin vectors $\mathbf{S}$ as the inputs to our model. However, in this case we have prior knowledge that our Hamiltonian acts on nearest-neighbor interactions. Rather than leaving it to our model to figure this out, we can encode our “inductive bias” by feeding in training data that already has the form of a neighbor matrix

states = states[:, :, None] * states[:, None, :] # outer product creates neighbor matrix

# Data matrix / design matrix always has shape (n_samples, n_features)
X_all = np.reshape(states, (states.shape[0], -1)) 

# Match our labels
y_all = energies

X_all.shape

(10000, 1600)

y_all.shape

(10000,)

Least squares fitting¶

We’ll start by training a linear model that maps our nearest-neighbor matrices to the energies we observed in our experiments

# define number of samples
n_samples = 400

# define train and test data sets
X_train, X_test = X_all[:n_samples], X_all[n_samples : 3 * n_samples // 2]
y_train, y_test = y_all[:n_samples], y_all[n_samples : 3 * n_samples // 2]

from sklearn.linear_model import LinearRegression

model = LinearRegression()

model.fit(X_train, y_train)

y_pred_train = model.predict(X_train)

plt.figure(figsize=(6, 6))
plt.plot(y_train, y_pred_train, ".k")
plt.xlabel("True energy (Training Data)")
plt.ylabel("Predicted energy (Training Data)")

What about data that the model hasn’t seen before?

y_pred_test = model.predict(X_test)
# y_pred_test = model.predict(X_test)
plt.figure(figsize=(6, 6))
plt.plot(y_test, y_pred_test, ".k")
plt.xlabel("True energy (Test Data)")
plt.ylabel("Predicted energy (Test Data)")

Overfitting

High train accuracy just tells us that our model class is capable of expressing patterns found the training data
For all datasets, there exists a way to get 100% train accuracy as long as I have access to memory equal to the size of the trainin dataset (1-NN lookup table)
Need to either regularize (training data can’t be perfectly fit) or use a test dataset to see how good our model actually is
A reasonable heuristic when choosing model complexity is to find one that can just barely overfit train (suggests suffcient power)

Evaluating the performance¶

Regression scoring: the coefficient of determination $R^2$ ¶

In what follows the model performance (in-sample and out-of-sample) is evaluated using the so-called coefficient of determination, which is given by:

\begin{align} R^2 &= \left(1-\frac{u}{v}\right),\\ u&=(y_{pred}-y_{true})^2\\ v&=(y_{true}-\langle y_{true}\rangle)^2 \end{align}

(4)

The best possible score is 1.0 but it can also be negative. A constant model that always predicts the expected value of $y$ , $\langle y_{true}\rangle$ , disregarding the input features, would get a $R^2$ score of 0.

print("Train error was:", model.score(X_train, y_train))
print("Test error was:", model.score(X_test, y_test))

Train error was: 1.0
Test error was: 0.49574373554401063

But raw score doesn’t tell the whole story¶

We can get a great fit, but our model might have a lot of free parameters
There might be multiple valid coupling matrices $J$ that explain the observed data
Our model might be predictive but not interpretable, or physical
We either need more data, better data (sample rarer states), or a better model

Summary so far...¶

We want to fit a generic spin glass model of the form

H_\mathrm{model}[\boldsymbol{S}^i] = - \sum_{j=1}^L \sum_{k=1}^L J_{j,k}S_{j}^iS_{k}^i.

(5)

To a dataset that, unbeknownst to us, comes from the nearest-neighbors Ising model:

H[\boldsymbol{S}]=-J\sum_{j=1}^L S_{j}S_{j+1}

(6)

We are given examples of microstates $\mathbf{S}^i$ and their associated energies $H^i$ , and we want to use machine learning in the form of regression to determine the value of $J_{jk}$

couplings_estimated = np.array(model.coef_).reshape((L, L))
couplings_true = model_experiment.J

plt.figure(figsize=(6, 6))
plt.imshow(couplings_true, cmap="RdBu_r", vmin=-1, vmax=1)
plt.colorbar()
plt.title("True coupling matrix $J$")

plt.figure(figsize=(6, 6))
plt.imshow(couplings_estimated, cmap="RdBu_r", vmin=-1, vmax=1)
plt.colorbar()
plt.title("Estimated coupling matrix $J$")

Regularization:¶

We constrain the model’s allowed space of valid representations in order to select for more parsimonious models
Operationally, regularizers/constraints reduce the “effective” number of parameters, and thus complexity, of our model
Imposing preferred basis functions or symmetries can be forms of regularization
Can also help the model avoid looking at regions in representation space that are dead-ends, thus simplifying training

Ridge regression and Lasso:¶

We can think of our least-squares problem as choosing the optimal $J$ that minimizes the following objective function, the mean squared error between the model energies and true energies
$\mathcal{L} = \sum_i (\mathbf{X}^i \cdot \mathbf{J}- H^i)^2$
(7)
where $i$ indicates different training examples, which have predicted energies given by $\mathbf{X}^i \cdot \mathbf{J}$ and observed energies of $H^i$
Ridge and Lasso common regularizers that act on the weights (trainable parameters) in our model
If $\mathbf{J}$ is our trainable linear regression weight matrix (and, in this context, our best estimate for the Ising interaction matrix), then the losses, or penalties, associated these regularizers have the following forms:
$\mathcal{L}_{Lasso} = \sum_i (\mathbf{X}^i \cdot \mathbf{J}- H^i)^2 + \lambda \sum_{ij} | J_{ij} |$
(8)

and

\mathcal{L}_{Ridge} = \sum_i (\mathbf{X}^i \cdot \mathbf{J}- H^i)^2 + \lambda \sum_{ij} J_{ij}^2

(9)

where the hyperparameter $\lambda$ determines the “strength” of the penalty terms.

The ridge penalty (sometimes called $L2$ because of its functional form) discourages any particular weight in the coefficient matrix from becoming too large. Ridge is usually used when we want to impose a degree of smoothness or regularity across how a model treats its various inputs. Models that take continuous data as inputs (such as time series, or images), may benefit from the ridge term. In contrast, the Lasso or L1 penalty encourages sparsity in weight space: it incentivizes models were most coefficients go to zero, thereby reducing the models dependencies on features.

from sklearn import linear_model
import matplotlib.pyplot as plt
from mpl_toolkits.axes_grid1 import make_axes_locatable
# import seaborn
%matplotlib inline

# set up Lasso and Ridge Regression models
# leastsq = linear_model.LinearRegression()
# ridge = linear_model.Ridge()
# lasso = linear_model.Lasso()
model_ols = linear_model.LinearRegression()
model_l2 = linear_model.Ridge()
model_l1= linear_model.Lasso()

# set regularisation strength values
lambdas = np.logspace(-4, 5, 10)

# define error lists
train_error_ols, test_error_ols = list(), list()
train_error_l2, test_error_l2 = list(), list()
train_error_l1, test_error_l1 = list(), list()

#Initialize coefficients for ridge regression and Lasso
coeffs_ols, coeffs_ridge, coeffs_lasso = list(), list(), list()

for lam in lambdas:
    ### ordinary least squares
    model_ols.fit(X_train, y_train) # fit model 
    coeffs_ols.append(model_ols.coef_) # store weights
    # use the coefficient of determination R^2 as the performance of prediction.
    train_error_ols.append(model_ols.score(X_train, y_train))
    test_error_ols.append(model_ols.score(X_test, y_test))

    ### ridge regression
    model_l2.set_params(alpha=lam) # set regularisation strength
    model_l2.fit(X_train, y_train) # fit model
    coeffs_ridge.append(model_l2.coef_) # store weights
    train_error_l2.append(model_l2.score(X_train, y_train))
    test_error_l2.append(model_l2.score(X_test, y_test))

    ### lasso
    model_l1.set_params(alpha=lam) # set regularisation strength
    model_l1.fit(X_train, y_train) # fit model
    coeffs_lasso.append(model_l1.coef_) # store weights
    train_error_l1.append(model_l1.score(X_train, y_train))
    test_error_l1.append(model_l1.score(X_test, y_test))




    ### plot Ising interaction J
    J_leastsq = np.array(model_ols.coef_).reshape((L, L))
    J_ridge = np.array(model_l2.coef_).reshape((L, L))
    J_lasso = np.array(model_l1.coef_).reshape((L, L))


    fig, axarr = plt.subplots(nrows=1, ncols=3)
    
    axarr[0].imshow(J_leastsq,  cmap="RdBu_r", vmin=-1, vmax=1)
    axarr[0].set_title(f"OLS \n Train={train_error_ols[-1]}, Test={test_error_ols[-1]}", fontsize=16)
    axarr[0].tick_params(labelsize=16)
    
    axarr[1].imshow(J_ridge,  cmap="RdBu_r", vmin=-1, vmax=1)
    axarr[1].set_title('Ridge $\lambda=%.4f$\n Train$=%.3f$, Test$=%.3f$' %(lam, train_error_l2[-1],test_error_l2[-1]),fontsize=16)
    axarr[1].tick_params(labelsize=16)
    
    im=axarr[2].imshow(J_lasso, cmap="RdBu_r", vmin=-1, vmax=1)
    axarr[2].set_title('LASSO $\lambda=%.4f$\n Train$=%.3f$, Test$=%.3f$' %(lam, train_error_l1[-1],test_error_l1[-1]),fontsize=16)
    axarr[2].tick_params(labelsize=16)
    
    divider = make_axes_locatable(axarr[2])
    cax = divider.append_axes("right", size="5%", pad=0.05, add_to_figure=True)
    cbar=fig.colorbar(im, cax=cax)
    
    cbar.ax.set_yticklabels(np.arange(-1.0, 1.0+0.25, 0.25),fontsize=14)
    cbar.set_label('$J_{i,j}$',labelpad=15, y=0.5,fontsize=20,rotation=0)
    
    fig.subplots_adjust(right=2.0)
    
    plt.show()

# Plot our performance on both the training and test data
plt.figure(figsize=(6, 3))
plt.semilogx(lambdas, train_error_ols, "b", label="Train (OLS)")
plt.semilogx(lambdas, test_error_ols, "--b", label="Test (OLS)")
plt.title("Ordinary Least Squares", fontsize=16)
plt.xlabel("Regularization strength $\lambda$", fontsize=16)
plt.ylabel("Performance $R^2$", fontsize=16)


plt.figure(figsize=(6, 3))
plt.semilogx(lambdas, train_error_l2, "r", label="Train (Ridge)", linewidth=1)
plt.semilogx(lambdas, test_error_l2, "--r", label="Test (Ridge)", linewidth=1)
plt.title("Ridge Regression", fontsize=16)
plt.xlabel("Regularization strength $\lambda$", fontsize=16)
plt.ylabel("Performance $R^2$", fontsize=16)

plt.figure(figsize=(6, 3))
plt.semilogx(lambdas, train_error_l1, "g", label="Train (LASSO)")
plt.semilogx(lambdas, test_error_l1, "--g", label="Test (LASSO)")
plt.title("LASSO", fontsize=16)
plt.xlabel("Regularization strength $\lambda$", fontsize=16)
plt.ylabel("Performance $R^2$", fontsize=16)

Understanding the results¶

(i) at $\lambda\to 0$ and $\lambda\to\infty$ , all three models overfit the data, as can be seen from the deviation of the test errors from unity (dashed lines), while the training curves stay at unity.

(ii) While the OLS and Ridge regression test curves are monotonic, the LASSO test curve is not -- suggesting the optimal LASSO regularization parameter is $\lambda\approx 10^{-2}$ . At this sweet spot, the Ising interaction weights ${\bf J}$ contain only nearest-neighbor terms (as did the model the data was generated from).

Gauge degrees of freedom: recall that the uniform nearest-neighbor interactions strength $J_{j,k}=J$ which we used to generate the data was set to unity, $J=1$ . Moreover, $J_{j,k}$ was NOT defined to be symmetric (we only used the $J_{j,j+1}$ but never the $J_{j,j-1}$ elements). The colorbar on the matrix elements plot above suggest that the OLS and Ridge regression learn uniform symmetric weights $J=-0.5$ . There is no mystery since this amounts to taking into account both the $J_{j,j+1}$ and the $J_{j,j-1}$ terms, and the weights are distributed symmetrically between them. LASSO, on the other hand, can break this symmetry (see matrix elements plots for $\lambda=0.001$ and $\lambda=0.01$ ). Thus, we see how different regularization schemes can lead to learning equivalent models but in different gauges. Any information we have about the symmetry of the unknown model that generated the data has to be reflected in the definition of the model and the regularization chosen.

Hyperparameters¶

We can imagine that an even more general model would have both regularizers, each with different strengths

\mathcal{L}_{total} = \mathcal{L}_{least-squares} + \lambda_1 \mathcal{L}_{lasso} + \lambda_2 \mathcal{L}_{ridge}

(10)

This loss function is sometimes referred to as least-squares with an ElasticNet penalty.

Why not always use regularizers¶

The issue: we have two arbitrary factors, $\lambda_1$ and $\lambda_2$ , which determine how important the L1 and L2 penalties are relative to the primary fitting. These change the available solution space and thus model class
These are not “fit” during training like ordinary parameters; rather they are specified beforehand, perhaps with a bit of intuition or domain knowledge, These therefore represent hyperparameters of the model
Generally speaking, any “choices” we make---amount of data, model type, model parameters, neural network depth, etc are all hyperparameters. How do we choose these in a principled manner?
A major question in machine learning: How do we choose the best hyperparameters for a model?

Validation set¶

Hold out some data just for hyperparameter tuning, separate from the test set
Don’t validate on test, that leads to data leakage and thus overfitting

from sklearn import linear_model

# define train, validation and test data sets
X_train, X_val, X_test = X_all[:400], X_all[400 : 600], X_all[600 : 800]
y_train, y_val, y_test = y_all[:400], y_all[400 : 600], y_all[600 : 800]

# the hyperparameter values to check
lambdas = np.logspace(-6, 4, 9)

all_validation_losses = list()

for lam in lambdas:
    model_l1 = linear_model.Lasso(alpha=lam)
    model_l1.fit(X_train, y_train)
    validation_loss = model_l1.score(X_val, y_val)
    all_validation_losses.append(validation_loss)

best_lambda = lambdas[np.argmax(all_validation_losses)]

plt.semilogx(lambdas, all_validation_losses, label="Validation")

print("Best lambda on validation set", best_lambda)

Best lambda on validation set 0.005623413251903491

# fit best model 
model_l1 = linear_model.Lasso(alpha=best_lambda)
model_l1.fit(X_train, y_train)

y_test_predict = model_l1.predict(X_test)

plt.figure()
plt.plot(y_test, y_test_predict, ".k")
plt.xlabel("True value")
plt.ylabel("Predicted value")


# plot Ising interaction J
J_lasso = np.array(model_l1.coef_).reshape((L, L))
plt.figure()
plt.imshow(J_lasso,  cmap="RdBu_r", vmin=-1, vmax=1)

Cross-validation¶

We repeatedly split-up the train into multiple sub-splits for training and validation
For example, if we have 100 train points, we can create five 80:20 “splits”, and average the best hyperparameter across the splits
If we perform $k$ subsplits, we refer to our procedure as k-fold cross-validation
More elaborate splitting methods (random Monte Carlo, importance weighted, etc)

Image("../resources/cross_val.png", width=1000)
# https://towardsdatascience.com/cross-validation-k-fold-vs-monte-carlo-e54df2fc179b

## Cross validation

from sklearn import linear_model

# define train, validation and test data sets
X_train, X_test = X_all[:600], X_all[600 : 800]
y_train, y_test = y_all[:600], y_all[600 : 800]

# the hyperparameter values to check
lambdas = np.logspace(-6, 4, 9)

all_validation_losses = list()

for lam in lambdas:

    all_val_loss_lam = list()

    for k in range(5):

        # Create the training and validation subsets from the training data
        X_train_k = np.concatenate([X_train[:k*100], X_train[(k+1)*100:]])
        y_train_k = np.concatenate([y_train[:k*100], y_train[(k+1)*100:]])
        X_val_k = X_train[k*100:(k+1)*100]
        y_val_k = y_train[k*100:(k+1)*100]


        model_l1 = linear_model.Lasso(alpha=lam)
        model_l1.fit(X_train_k, y_train_k)
        validation_loss = model_l1.score(X_val_k, y_val_k)
        all_val_loss_lam.append(validation_loss)

    all_validation_losses.append(np.mean(all_val_loss_lam))

best_lambda = lambdas[np.argmax(all_validation_losses)]
plt.figure()
plt.semilogx(lambdas, all_validation_losses, label="Validation")
print("Best lambda on validation set", best_lambda)

Best lambda on validation set 0.005623413251903491

# fit best model 
model_l1 = linear_model.Lasso(alpha=best_lambda)
model_l1.fit(X_train, y_train)

y_test_predict = model_l1.predict(X_test)

plt.figure()
plt.plot(y_test, y_test_predict, ".k")
plt.xlabel("True value")
plt.ylabel("Predicted value")


# plot Ising interaction J
J_lasso = np.array(model_l1.coef_).reshape((L, L))
plt.figure()
plt.imshow(J_lasso,  cmap="RdBu_r", vmin=-1, vmax=1)

How many free parameters are in our model?¶

Linear regression:

y_i = \mathbf{A} \mathbf{X}_i

(11)

In our Ising example, the fitting parameters $\mathbf{A}$ corresponded to our coupling matrix $\mathbf{J}$
$\mathbf{X}_i$ were our spin microstates, and $y_i$ was their energies
$\mathbf{J} \in \mathbb{R}^{40 \times 40}$ because our inputs (microstates) are on an $L = 40$ lattice

What if we wanted to increase our model complexity?¶

Idea:¶

y_i = \mathbf{B} \mathbf{C} \mathbf{X}_i

(12)

where $\mathbf{B} \in \mathbb{R}^{40 \times p}$ and $\mathbf{C} \in \mathbb{R}^{p \times 40}$ , with $p$ being a hyperparameter that controls the complexity of the model. This “hidden” or “latent” dimensionality allows us to have a more complex model.

However, the problem is that $\mathbf{B} \mathbf{C} \equiv \mathbf{A}\in \mathbb{R}^{40 \times 40}$ , so we don’t gain any expressivity

Solution:¶

y_i = \mathbf{B} \sigma(\mathbf{C} \mathbf{X}_i)

(13)

where $\sigma(.)$ is an elementwise nonlinear function, like $\tanh(.)$

Forecasting is supervised learning¶

We can think of forecasting as a supervised learning problem.
Given a d-dimensional time series $X \in \mathbb{R}^{T \times d}$ , we want to predict the next $T_p$ points in the future, $y \in \mathbb{R}^{d \times T_p}$

from dysts.flows import MackeyGlass

eq = MackeyGlass()
# eq.tau = 17
# eq.beta = 0.2
# eq.gamma = -0.1
# eq.n = 10.0

t, sol = eq.make_trajectory(130000, pts_per_period=200, return_times=True)
t, sol = t[::30], sol[::50]
dt = t[1] - t[0]
x = sol[:, 0]
# x += 0.1 * np.random.randn(len(x))
plt.plot(x)

print(eq.tau // dt)

53.0

def chunk_time_series(y0, lookback, horizon=1):
    """
    Given a time series, return a matrix and regression target corresponding to all
    possible chunks of the time series of length lookback. The regression target is the
    horizon steps ahead of the final value in each chunk.

    Args:
        y0 (np.ndarray): The time series
        lookback (int): The length of each chunk
        horizon (int): The number of steps ahead to predict
    """

    X_out, y_out = list(), list()
    for i in range(len(y0) - lookback - horizon):
        X_out.append(y0[i : i + lookback])
        y_out.append(y0[i + lookback + horizon])

    return np.array(X_out), np.array(y_out)


X_train, y_train = chunk_time_series(x[:-200], 50, horizon=30)
X_test, y_test = chunk_time_series(x[-200:], 50, horizon=30)


from sklearn.linear_model import LinearRegression, Ridge, Lasso

model = LinearRegression()
# model = Ridge(alpha=1e-1)
model = Lasso(alpha=1e-2)
model.fit(X_train, y_train)

y_pred_train = model.predict(X_train)
print("Train accuracy was:", model.score(X_train, y_train))

plt.figure(figsize=(6, 6))
# plt.plot(y_train, y_pred_train, ".k")
plt.plot(y_train)
plt.plot(y_pred_train)
plt.xlabel("True value (Training Data)")
plt.ylabel("Predicted value (Training Data)")

y_pred_test = model.predict(X_test)
print("Test accuracy was:", model.score(X_test, y_test))
plt.figure(figsize=(6, 6))
plt.plot(y_test)
plt.plot(y_pred_test)
plt.xlabel("True value (Test Data)")

Train accuracy was: 0.5202337403312125
Test accuracy was: 0.5215330103149215

eq.tau - 30*dt - .4

0.4812999999999986

.4/dt

10.726736390453192

30*dt

1.1187000000000014

plt.plot(dt * np.arange(len(model.coef_.squeeze())), model.coef_.squeeze()[::-1])
plt.plot(dt * np.arange(len(model.coef_.squeeze())), model.coef_.squeeze()[::-1], '.')

eq.make_trajectory

<bound method DynSysDelay.make_trajectory of MackeyGlass(name='MackeyGlass', params={'beta': 2, 'gamma': 1.0, 'n': 9.65, 'tau': 2.0}, random_state=None)>

Appendix: Kuramoto model¶

from scipy.integrate import solve_ivp

class KuramotoModel:
    """
    Simulate the Kuramoto model with N oscillators and an arbitrary coupling matrix
    """

    def __init__(self, N, K, omega, alpha=0.0):
        self.N = N
        self.K = K
        self.omega = omega
        self.alpha = alpha

    

    def rhs(self, t, theta):
        # print(np.dot(self.K, np.sin(theta - theta[:, None]).T).shape)

        phase_diff = theta.reshape((1, -1)) - theta.reshape((-1, 1)) 
        sin_diff = np.sin(phase_diff - self.alpha)
        return self.omega + 1 / self.N * np.sum(self.K * sin_diff, axis=1)
        # return self.omega + 1 / self.N * np.sum(self.K * np.sin(theta - theta[:, None]).T, axis=1)
    
    def simulate(self, theta0, t0, t1, dt):
        """
        Simulate the Kuramoto model using solve_ivp
        """
        sol = solve_ivp(self.rhs, (t0, t1), theta0, t_eval=np.arange(t0, t1, dt), vectorized=True, method="LSODA")
        return sol.t, sol.y
    
kmat = np.random.normal(size=(20, 20))
kmat = kmat + kmat.T
np.fill_diagonal(kmat, 0)

def make_block_coupling(n, m=2):
    """Make an n x n block coupling matrix with m blocks"""
    K = np.zeros((n, n))
    for i in range(m):
        K[i * n // m : (i + 1) * n // m, i * n // m : (i + 1) * n // m] = 1
    return K


# Abrams et al. 2008 parameters
nval = 1024
kmat1 = 0.6 *  make_block_coupling(nval, m=2)
kmat2 = 0.4 * np.rot90(make_block_coupling(nval, m=2))
kmat = kmat1 + kmat2
np.fill_diagonal(kmat, 0)
aval = np.pi / 2  - 0.1


eq = KuramotoModel(N=nval, omega=np.random.normal(loc=1, scale=0.00, size=nval), K=kmat, alpha=aval)

ic = np.random.uniform(0, 2 * np.pi, size=nval)
# ic[:10] = np.pi
# ic = np.linspace(0, 2 * np.pi, nval)

t, theta = eq.simulate(theta0=ic, t0=0, t1=500, dt=0.01)

plt.plot(t, np.sin(theta.T))
plt.xlabel("Time")
plt.ylabel("Phase")

plt.figure()
plt.plot(t, np.sin(theta.T)[:, -10:], color='b', alpha=0.1)

plt.figure()
plt.plot(t, np.sin(theta.T)[:,  :10], color='r', alpha=0.1)

Linearizing a differential equation¶

Suppose I have a differential equation of the form

\dot{\mathbf{x}} = \mathbf{f}(\mathbf{x})

(14)

where $\mathbf{x}$ is a vector of variables, and $\mathbf{f}$ is a function that takes in $\mathbf{x}$ and returns a vector of the same dimension. In discrete time, we can write this as a linear equation via the forward Euler method

\mathbf{x}_{t+1} = \mathbf{x}_t + \Delta t \;\,\mathbf{f}(\mathbf{x}_t)

(15)

where $\Delta t$ is the time step. If we take $\Delta t \to 0$ , we recover the continuous-time equation.

Example: Kuramoto model¶

The Kuramoto model is a paradigmatic model of synchronization, and is given by the following set of equations

\dot{\theta}_i = \omega_i + \frac{1}{N} \sum_{j=1}^N K_{ij}\sin(\theta_j - \theta_i)

(16)

where $\theta_i$ is the phase of oscillator $i$ , $\omega_i$ is its natural frequency, and $K_{ij}$ is the coupling strength between oscillators $i$ and $j$ . Applying Euler’s method, we get

\theta_{i,t+1} = \theta_{i,t} + \Delta t \left(\omega_i + \frac{1}{N} \sum_{j=1}^N K_{ij}\sin(\theta_{j,t} - \theta_{i,t})\right)

(17)

If we are given a time series of the form $\theta_{i,t}$ , we can rearrange this equation to isolate a linear regression equation

\theta_{i,t+1} - \theta_{i,t} = \Delta t \left(\omega_i + \frac{1}{N} \sum_{j=1}^N K_{ij}\sin(\theta_{j,t} - \theta_{i,t})\right)

(18)

We define two new variables $Y$ that allow us to write this problem as a linear regression equation

Y_{ijk} = \sin(\theta_{jk} - \theta_{ik})

(19)

Z_{ik} = \dfrac{\theta_{i,k+1} - \theta_{i,k}}{\Delta t}

(20)

and we can write this as a linear regression equation

Z_{ik} = \omega_i + \sum_{j=1}^N \frac{K_{ij}}{N} Y_{ijk}

(21)

where the weights are given by $\mathbf{A} = \Delta t \mathbf{K}$ , and the inputs are given by $\mathbf{X}_t = \mathbf{Y}_t$ .

# Abrams et al. 2008 parameters
nval = 100
kmat1 = 0.6 *  make_block_coupling(nval, m=2)
kmat2 = 0.4 * np.rot90(make_block_coupling(nval, m=2))
kmat = (kmat1 + kmat2) * 0.2
np.fill_diagonal(kmat, 0)
aval = 0.0


eq = KuramotoModel(N=nval, omega=np.random.normal(loc=1, scale=0.02, size=nval), K=kmat, alpha=aval)

ic = np.random.uniform(0, 2 * np.pi, size=nval)
t, theta = eq.simulate(theta0=ic, t0=0, t1=50, dt=0.01)

plt.plot(t, np.sin(theta.T));

Z.reshape(-1, nval).shape

(4999, 100)

Y[:-1].reshape(-1, nval ** 2).shape

(4999, 10000)


# Z_{ik} = \omega_i + \sum_{j=1}^N  \frac{K_{ij}}{N}  Y_{ijk}


X = theta.T # shape (n_timepoints, n_oscillators)

Y = np.sin(X[:, :, None] - X[:, None, :]) # shape (n_timepoints, n_oscillators, n_oscillators)

dt = np.diff(t)[0]
Z = np.diff(X, axis=0) / dt # shape (n_timepoints - 1, n_oscillators)


# perform linear regression to learn the coupling matrix with shape (n_oscillators, n_oscillators)

from sklearn.linear_model import LinearRegression, Ridge, Lasso
# from sklearn import linear_model
# import matplotlib.pyplot as plt
# from mpl_toolkits.axes_grid1 import make_axes_locatable
# # import seaborn
# %matplotlib inline

# set up Lasso and Ridge Regression models
# leastsq = linear_model.LinearRegression()
# ridge = linear_model.Ridge()
# lasso = linear_model.Lasso()
# model_ols = linear_model.LinearRegression()
# model_l2 = linear_model.Ridge()
# model_l1= linear_model.Lasso()

# model = LinearRegression()
# model = Ridge(alpha=0.1)
model = Lasso(alpha=0.01)

model.fit(Y[:-1].reshape(-1, nval ** 2), Z.reshape(-1, nval))

plt.imshow(model.coef_[0].reshape((nval, nval)))

plt.imshow(np.mean(model.coef_, axis=0).reshape((nval, nval)))

## Train a linear regression to find the K matrix

# define number of samples


X = theta.T

# Transform into a linear regression problem


# define train and test data sets
X_train, X_test = X[:400], X[400:600]

# the hyperparameter values to check


from sklearn.linear_model import LinearRegression

model = LinearRegression()

model.fit(X_train, y_train)

y_pred_train = model.predict(X_train)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[56], line 18
     14 from sklearn.linear_model import LinearRegression
     16 model = LinearRegression()
---> 18 model.fit(X_train, y_train)
     20 y_pred_train = model.predict(X_train)

NameError: name 'y_train' is not defined

Chapters

Overfitting, the Bias-Variance tradeoff, Regularization, and Double Descent