Skip to article frontmatterSkip to article content

Predicting one-dimensional cellular automata

The University of Texas at Austin

Preamble: Run the cells below to import the necessary Python packages

  • This notebook is modified from notebooks originally developed and used in Pankaj Mehta’s ML for physics course at Boston University. Please check out those notebooks and associated textbooks for additional details and exercises.
## Preamble / required packages
import numpy as np
np.random.seed(0)

# Import local plotting functions and in-notebook display functions
import matplotlib.pyplot as plt
from IPython.display import Image, display
%matplotlib inline

import warnings
# Comment this out to activate warnings
warnings.filterwarnings('ignore')


## Set nicer default colors
plt.rcParams['image.cmap'] = 'PuBu'
plt.rcParams['axes.prop_cycle'] = plt.cycler(color=[[1.0, .3882, .2784], 
                                                    [0.0, 0.6, 0.5], 
                                                    [1.0, .6471, 0.0], 
                                                    [0.0, 0.4471, 0.7412], 
                                                    [0.5, 0.5, 0.5], 
                                                    [0.7490, 0, 0.7490], 
                                                    [0.0, 0.0, 0.0]])
plt.rcParams['lines.markersize'] = 10

Supervised learning

  • Given an input XX, construct a function that assigns it a label y^\hat{y}.

  • Regression: y^\hat{y} is a continuous variable. For example, given a picture of a person, predict their age. In physics, a common example is to predict the energy of a particle given its momentum, or forecast the next step in a time series.

  • Classification: y^\hat{y} is a discrete variable. For example, given a picture of an animal, predict whether it is a cat or a dog. In physics, a common example is to predict whether a phase is ordered or disordered, or to detect whether a signal point is a background, signal, or anomaly.

The function y^=fθ(X)\hat{y} = f_\theta(X) is learned from many instances of labelled data comprising XRNdata×NfeaturesX \in \mathbb{R}^{N_\text{data} \times N_\text{features}} and known yRNdatay \in \mathbb{R}^{N_\text{data}} pairs. The “weights” or parameters θ\theta are adjusted during training.

supervised_learning

Image from source

Supervised learning as inferring a generator

  • We can see supervised learning as the process of learning a generator for a dataset: given a set of points, can we approximate the underlying process that produced those points?
    • Forecasting: given some past values, predict future values
    • Regression: Given a known generator with unknown parameters (like a quadratic Hamiltonian with unknown amplitudes), infer those amplitudes
    • Classification (most famous example in ML): Given examples of labelled images, states, etc, predict the class of unlabelled data. We can think of the learned decision boundary as defining a generator of new examples belonging to that class (see conditional GANs, etc)

Load experimental measurements of an unknown spin chain

  • Suppose we have a spin chain with LL spins evolving in time

  • Our data consists of NdataN_\text{data} measurements of binary microstates s{0,1}L\mathbf{s} \in \{0,1\}^L corresponding to spins.

microstates = np.load('../resources/spin_chain_microstates.npy', allow_pickle=True)[:500]
plt.figure(figsize=(8, 8))
plt.imshow(microstates.T)
plt.ylabel("Lattice Position")
plt.xlabel("Time")
<Figure size 800x800 with 1 Axes>
plt.plot(microstates[0])
plt.plot(microstates[1])
<Figure size 640x480 with 1 Axes>

Forecasting as a supervised learning problem

  • Looking at the data, we notice that the spins appear to exhibit temporal correlations: the state at time tt is correlated with the state at time t1t-1.

  • One hypothesis is that the spins are evolving according to Markovian dynamics, in which the state of the lattice at time tt is depends only the state at time t1t-1.

si=f(si1)\mathbf{s}_i = f(\mathbf{s}_{i-1})
  • We don’t know the form of ff, but we can try to learn it from the data. In order to frame this as a supervised learning problem, we therefore think of each data point as a given spin configuration si\mathbf{s}_{i} and its label as the next spin configuration si+1\mathbf{s}_{i+1}.

  • Note how this concept of supervised learning differs from image classification: here, the label is not a single number, like an integer or a class label, but an entire vector with shape equal to the input data.

  • We will start by inspecting the data si\mathbf{s}_i and the regression target si+1\mathbf{s}_{i+1}.

plt.plot(microstates[10], label="Time 10")
plt.plot(microstates[11], label="Time 11")
plt.xlabel("Lattice Position")
plt.ylabel("Spin Value")
plt.legend()
<Figure size 640x480 with 1 Axes>

Training and testing splits

  • In order to frame this problem as supervised learning, we need to split the data into training and testing sets. We will use the first 80% of the data for training and the last 20% for testing.

  • By convention, in supervised learning we refer to the input data as XX and the target as yy. In this case, XZLX \in \mathbb{Z}^L will be the spins at time tt and yZLy \in \mathbb{Z}^L will be the spins at time t+1t+1.

X_all, y_all = microstates.copy()[:-1], microstates.copy()[1:]
print("X_all shape: ", X_all.shape)
print("y_all shape: ", y_all.shape, "\n")
X_all shape:  (499, 160)
y_all shape:  (499, 160) 


# Data matrix / design matrix always has shape (n_samples, n_features)
X_train, X_test = X_all[:400], X_all[400:]
y_train, y_test = y_all[:400], y_all[400:]
print("X_train shape: ", X_train.shape)
print("y_train shape: ", y_train.shape, "\n")
print("X_test shape: ", X_test.shape)
print("y_test shape: ", y_test.shape)
X_train shape:  (400, 160)
y_train shape:  (400, 160) 

X_test shape:  (99, 160)
y_test shape:  (99, 160)

A note on flattening

  • For our problem, the input data is a binary vector s(t)\mathbf{s}(t) of length LL, and the target is a binary vector s(t+1)\mathbf{s}(t+1) of length LL. Our training dataset XNdata×LX \in \mathbb{N_\text{data} \times L} has the typical form of a supervised learning dataset.

  • However, if we had a multidimensional dataset, we would need to flatten it into a 1D vector. For example, if we were studying the 2D, each snapshot of the system would be an L×LL \times L matrix, and we might have NdataN_\text{data} snapshots. Our training data would then be XRNdata×L×LX \in \mathbb{R}^{N_\text{data} \times L \times L}, and we would need to flatten it into a 1D vector XRNdata×L2X \in \mathbb{R}^{N_\text{data} \times L^2}.

Fitting a linear model with least squares

  • What is the simplest function we can think of that maps a vector siRL\mathbf{s}_{i} \in \mathbb{R}^{L} to another vector si+1RL\mathbf{s}_{i+1} \in \mathbb{R}^{L}? Recall that the linear model is defined as

    si+1=Asi\mathbf{s}_{i + 1} = A \cdot \mathbf{s}_{i}

    where ARL×LA \in \mathbb{R}^{L \times L}.

  • In our case, the features are the current state matrices yi\mathbf{y}_{i}, and the predictions are the subsequent microstate matrics yi+1\mathbf{y}_{i+1}. We therefore view linear regression as a data-driven method for learning a Markovian dynamical system that evolves the spins in time.

  • Given a dataset XRNdata×L\mathbf{X} \in \mathbb{R}^{N_\text{data} \times L} and regression target yRNdata×L\mathbf{y} \in \mathbb{R}^{N_\text{data} \times L}, we can use the least squares method to find the matrix AA that best fits the data. The least squares solution is given by

    A=(XTX)1XTyA = \left( \mathbf{X}^T \mathbf{X} \right)^{-1} \mathbf{X}^T \mathbf{y}

    This is called the Moore-Penrose pseudoinverse because the matrix XTX\mathbf{X}^T \mathbf{X} is not square, and so we can’t directly invert it. Instead, the matrix (XTX)1XT\left( \mathbf{X}^T \mathbf{X} \right)^{-1} \mathbf{X}^T minimizes the squared error between the predicted s^i+1\hat{\mathbf{s}}_{i+1} for each si\mathbf{s}_i and the true si+1\mathbf{s}_{i+1}.

  • The best-fit predictions of the trained model are given by

    s^i+1=Asi\mathbf{\hat{s}}_{i + 1} = A \cdot \mathbf{s}_{i}

    Usually, in supervised learning we generically use y^\mathbf{\hat{y}} to denote the predictions of the model, and y\mathbf{y} to denote the true labels, and X\mathbf{X} to denote the input data. So we can also write the generic form of prediction as y^=AX\mathbf{\hat{y}} = A \cdot \mathbf{X}.

Python syntax: the scikit-learn library

  • Rather than using numpy, we will use the Python machine learning library scikit-learn to perform the linear regression.

  • scikit-learn uses a consistent API for both simple models, like linear regression, and more complex models, like neural networks. Machine learning models are objects that are first instantiated with all hyperparameters, then trained on data using the fit method to determine the values of all parameters (AA in our case), and finally used to make predictions (here denoted s^i+1\mathbf{\hat{s}}_{i+1} or y^\mathbf{\hat{y}}) using the predict method.

from sklearn.linear_model import LinearRegression#, RidgeCV, Lasso

# model = RidgeCV()
# model = Lasso(alpha=6e-2)
model = LinearRegression()

model.fit(X_train, y_train)

y_pred_train = model.predict(X_train)

# plt.imshow(model.coef_)

plt.plot(X_train[0])
# plt.plot(y_train[0])
plt.plot(y_pred_train[0])
<Figure size 640x480 with 1 Axes>
plt.figure()
plt.imshow(y_train[:100])
plt.xlabel("Time")
plt.ylabel("Lattice Position")
plt.title("True values")

plt.figure()
plt.imshow(y_pred_train[:100])
plt.title("Predictions")

plt.figure()
plt.imshow(np.round(y_pred_train[:100]))
plt.title("Rounded Predictions")

plt.figure()
plt.imshow(y_train[-60:] - np.round(y_pred_train[-60:]))
plt.title("Difference between true and predicted values")

# plt.colorbar()
<Figure size 640x480 with 1 Axes><Figure size 640x480 with 1 Axes><Figure size 640x480 with 1 Axes><Figure size 640x480 with 1 Axes>

Scoring a regression model

We will use the R2R^2 score to evaluate the performance of our model. The R2R^2 score is defined as

R2=1i(yiy^i)2i(yiyˉ)2R^2 = 1 - \frac{\sum_i (\mathbf{y}_i - \hat{\mathbf{y}}_i)^2}{\sum_i (\mathbf{y}_i - \bar{\mathbf{y}})^2}

where ii indexes datapoints, yi\mathbf{y}_i are the true regression targets, y^i\hat{\mathbf{y}}_i are the predicted values, and yˉ\bar{\mathbf{y}} is the elementwise mean of the true values. The R2R^2 score is a measure of how well the model explains the variance in the data. A score of 1 indicates a perfect fit, while a score of 0 indicates that the model is no better than predicting the mean of the data.

def coefficient_of_determination(y_true, y_pred):
    """The R^2 score, or coefficient of determination, is a measure of how well 
    future samples are likely to be predicted by the model.
    
    Args:
        y_true (np.ndarray): True values
        y_pred (np.ndarray): Predicted values

    Returns:
        float: R^2 score
    """
    ss_res = np.sum((y_true - y_pred) ** 2)  # Sum of squared residuals
    ss_tot = np.sum((y_true - np.mean(y_true)) ** 2)  # Total sum of squares
    r2 = 1 - (ss_res / ss_tot)
    return r2

print("R^2 score on training data: ", coefficient_of_determination(y_train, y_pred_train))
R^2 score on training data:  0.8526570213563216
What about experiments that the model hasn’t seen before?
  • The test set corresponds to the pairs of XX and yy that the model hasn’t seen before, which here correspond to pairs of successive spin configurations that the model hasn’t seen before.

  • We use the test set to evaluate the model’s performance on unseen data. If the model performs well on the test set, we can be more confident that it learned general features of the data, rather than just memorizing the training set.

y_pred_test = model.predict(X_test)

plt.figure()
plt.imshow(y_test[:100])
plt.title("True values")

plt.figure()
plt.imshow(y_pred_test[:100])
plt.title("Predictions")

plt.figure()
plt.imshow(np.round(y_pred_test[:100]))
plt.title("Rounded Predictions")

plt.figure()
plt.imshow(y_test[-60:] - y_pred_test[-60:])
plt.title("Difference between true and predicted values")
<Figure size 640x480 with 1 Axes><Figure size 640x480 with 1 Axes><Figure size 640x480 with 1 Axes><Figure size 640x480 with 1 Axes>
print("R^2 score on test data: ", coefficient_of_determination(y_test, y_pred_test))
R^2 score on test data:  0.5461927695908562

Overfitting

  • High train accuracy just tells us that our model class is capable of expressing patterns found the training data

  • For all datasets, there exists a way to get 100% train accuracy as long as I have access to memory equal to the size of the training dataset (1-nearest-neighbor lookup table)

  • We therefore need to either regularize (training data can’t be perfectly fit) or use a test dataset to see how good our model actually is

  • A reasonable heuristic when choosing model complexity is to find one that can just barely overfit train (suggests sufficient power)

But raw score doesn’t tell the whole story

  • We can get a great fit, but our model might have a lot of free parameters
  • There might be multiple valid coupling matrices JJ that explain the observed data
  • Our model might be predictive but not interpretable, or physical
  • We either need more data, better data (sample rarer states), or a better model
Let’s look at the learned coupling matrix, or the weights of our fitted model
  • These store the “physics” that our model learned from the data

  • Recall that the number of trainable parameters is L2L^2, where LL is the number of spins in the chain. We had L=160L=160 spins, so we have ~25,00025,000 parameters to learn.

## Because we flattened the data, we need to reshape the coefficients to get the couplings
L = X_train.shape[1]
rules_estimated = np.array(model.coef_)

plt.figure(figsize=(6, 6))
plt.imshow(rules_estimated[:50, :50], cmap='RdBu_r')
plt.colorbar()
<Figure size 600x600 with 2 Axes>
Let’s try repeating the model fitting several times on different subsets of our experimental data
plt.figure(figsize=(9, 9))
## Plot 3 x 3 subplots

n_samples = 20
for i in range(9):

    ## Pick random training data set
    selection_inds = np.random.choice(range(X_train.shape[0]), size=n_samples, replace=False)
    X_train_subset, y_train_subset = X_all[selection_inds], y_all[selection_inds]
    model = LinearRegression()
    model.fit(X_train_subset, y_train_subset)
    rules_estimated = np.array(model.coef_)
    
    ## Plot learned coupling matrix
    plt.subplot(3, 3, i + 1)
    plt.imshow(rules_estimated, cmap='RdBu_r')
    plt.axis('off')

# spacing between subplots
plt.subplots_adjust(wspace=0.1, hspace=0.1)
<Figure size 900x900 with 9 Axes>
We can see that there is variance in the fitted models. While they have some similarities, the fitted parameters (weights) vary from trained replicate to replicate
  • In machine learning, there are three ways to reduce the variance of a model: increase the amount of data (expensive), or constrain the model class to have fewer (or effectively fewer) free parameters, by introducing bias towards a particular solution.

Narrowing the model class with regularization

  • We constrain the model’s allowed space of valid representations in order to select for more parsimonious models

  • Operationally, regularizers/constraints reduce the “effective” number of parameters, and thus complexity, of our model

  • Imposing preferred basis functions or symmetries can be forms of regularization

Ordinary linear regression solves an optimization problem

  • We can think of our least-squares problem as choosing the optimal AA that minimizes the following objective function, the mean squared error between the model predictions and true predictions

    LMSE(y,y^)=1Ndatai=1Ndata(yiy^i)2\mathcal{L}_\text{MSE}(\mathbf{y}, \hat{\mathbf{y}}) = \frac{1}{N_\text{data}} \sum_{i=1}^{N_\text{data}} \left( \mathbf{y}_i - \hat{\mathbf{y}}_i \right)^2
  • The ordinary least squares formula above minimizes this loss function for a given input dataset XX and paired targets yy, under the constraint that the model is linear

Regularization

  • Common regularizers add penalties to the loss function above, which can be thought of as limiting the space of allowed solutions. Instead of all L2L^2 parameters in the matrix AA being free to vary, the regularizer imposes constraints on the allowed values of the parameters.

  • Ridge regression (aka L2L2 regularization) discourages any particular weight in the coefficient matrix from becoming too large. Ridge imposes a degree of smoothness or regularity across how a model treats its various inputs. Models that take continuous data as inputs (such as time series, or images), may benefit from the ridge term.

LRidge(y,y^)=1Ndatai=1Ndata(yiy^i)2+λi,jAij2\mathcal{L}_\text{Ridge}(\mathbf{y}, \hat{\mathbf{y}}) = \frac{1}{N_\text{data}} \sum_{i=1}^{N_\text{data}} \left( \mathbf{y}_i - \hat{\mathbf{y}}_i \right)^2 + \lambda \sum_{i,j} A_{ij}^2
  • Lasso is also known as L1L1 regularization, and it encourages sparsity in weight space: it incentivizes models were most coefficients go to zero, thereby reducing the models dependencies on features. Lasso is often used in feature selection, where we want to identify the most important features in a dataset.
LLasso(y,y^)=1Ndatai=1Ndata(yiy^i)2+λi,jAij\mathcal{L}_\text{Lasso}(\mathbf{y}, \hat{\mathbf{y}}) = \frac{1}{N_\text{data}} \sum_{i=1}^{N_\text{data}} \left( \mathbf{y}_i - \hat{\mathbf{y}}_i \right)^2 + \lambda \sum_{i,j} |A_{ij}|

Both of these penalties introduce a hyperparameter, λ\lambda, that controls the strength of the regularization. When λ=0\lambda = 0, the regularized loss function reduces to the ordinary least squares loss function. As λ\lambda increases, the regularizer becomes more important relative to the data loss term, and the model is more constrained.

Let’s try re-fitting the model with these different regularizers. We will vary λ\lambda, the strength of the regularization, and see how the learned dynamical system changes



# Instantiate models
from sklearn import linear_model
model_ols = linear_model.LinearRegression()
model_l2 = linear_model.Ridge()
model_l1= linear_model.Lasso()

# Set regularization range
lambdas = np.logspace(-6, 4, 8)

# Load data
# define subset of samples
# n_samples = 100
X_train, X_test = X_all[:400], X_all[400:]
y_train, y_test = y_all[:400], y_all[400:]

# define error lists
train_error_ols, test_error_ols = list(), list()
train_error_l2, test_error_l2 = list(), list()
train_error_l1, test_error_l1 = list(), list()

#Initialize coefficients for ridge regression and Lasso
coeffs_ols, coeffs_ridge, coeffs_lasso = list(), list(), list()

for lam in lambdas:
    ### ordinary least squares
    model_ols.fit(X_train, y_train) # fit model 
    coeffs_ols.append(model_ols.coef_) # store weights
    # use the coefficient of determination R^2 as the performance of prediction.
    train_error_ols.append(model_ols.score(X_train, y_train))
    test_error_ols.append(model_ols.score(X_test, y_test))

    ### ridge regression
    model_l2.set_params(alpha=lam) # set regularisation strength
    model_l2.fit(X_train, y_train) # fit model
    coeffs_ridge.append(model_l2.coef_) # store weights
    train_error_l2.append(model_l2.score(X_train, y_train))
    test_error_l2.append(model_l2.score(X_test, y_test))

    ### lasso
    model_l1.set_params(alpha=lam) # set regularisation strength
    model_l1.fit(X_train, y_train) # fit model
    coeffs_lasso.append(model_l1.coef_) # store weights
    train_error_l1.append(model_l1.score(X_train, y_train))
    test_error_l1.append(model_l1.score(X_test, y_test))




    ### plot Ising interaction J
    J_leastsq = np.array(model_ols.coef_).reshape((L, L))
    J_ridge = np.array(model_l2.coef_).reshape((L, L))
    J_lasso = np.array(model_l1.coef_).reshape((L, L))


    fig, axarr = plt.subplots(nrows=1, ncols=3)
    
    axarr[0].imshow(J_leastsq[:50, :50],  cmap="RdBu_r", vmin=-1, vmax=1)
    axarr[0].set_title(f"OLS \n Train={train_error_ols[-1]}, Test={test_error_ols[-1]}")
    ## 3 sig figs
    # axarr[0].set_title('OLS \n Train$=%.3f$, Test$=%.3f$' %(train_error_ols[-1],test_error_ols[-1]))
    axarr[1].set_title('OLS $\lambda=%.4f$\n Train$=%.3f$, Test$=%.3f$' %(lam, train_error_ols[-1],test_error_ols[-1]))
    axarr[0].tick_params(labelsize=16)
    
    axarr[1].imshow(J_ridge[:50, :50],  cmap="RdBu_r", vmin=-1, vmax=1)
    axarr[1].set_title('Ridge $\lambda=%.4f$\n Train$=%.3f$, Test$=%.3f$' %(lam, train_error_l2[-1],test_error_l2[-1]))
    axarr[1].tick_params(labelsize=16)
    
    im=axarr[2].imshow(J_lasso[:50, :50], cmap="RdBu_r", vmin=-1, vmax=1)
    axarr[2].set_title('LASSO $\lambda=%.4f$\n Train$=%.3f$, Test$=%.3f$' %(lam, train_error_l1[-1],test_error_l1[-1]))
    axarr[2].tick_params(labelsize=16)
    
    # divider = make_axes_locatable(axarr[2])
    # cax = divider.append_axes("right", size="5%", pad=0.05, add_to_figure=True)
    # cbar=fig.colorbar(im, cax=cax)
    
    # cbar.ax.set_yticklabels(np.arange(-1.0, 1.0+0.25, 0.25),fontsize=14)
    # cbar.set_label('$J_{i,j}$',labelpad=15, y=0.5,fontsize=20,rotation=0)
    
    fig.subplots_adjust(right=2.0)
    
    plt.show()
<Figure size 640x480 with 3 Axes><Figure size 640x480 with 3 Axes><Figure size 640x480 with 3 Axes><Figure size 640x480 with 3 Axes><Figure size 640x480 with 3 Axes><Figure size 640x480 with 3 Axes><Figure size 640x480 with 3 Axes><Figure size 640x480 with 3 Axes>
# Plot our performance on both the training and test data
plt.figure(figsize=(6, 3))
plt.semilogx(lambdas, train_error_ols, "", label="Train (OLS)")
plt.semilogx(lambdas, test_error_ols, "--", label="Test (OLS)")
plt.legend(frameon=False)
plt.title("Ordinary Least Squares", fontsize=16)
plt.xlabel("Regularization strength $\lambda$")
plt.ylabel("Accuracy $R^2$")


plt.figure(figsize=(6, 3))
plt.semilogx(lambdas, train_error_l2, "", label="Train (Ridge)", linewidth=1)
plt.semilogx(lambdas, test_error_l2, "--", label="Test (Ridge)", linewidth=1)
plt.legend(frameon=False)
plt.title("Ridge Regression", fontsize=16)
plt.xlabel("Regularization strength $\lambda$")
plt.ylabel("Accuracy $R^2$")

plt.figure(figsize=(6, 3))
plt.semilogx(lambdas, train_error_l1, label="Train (LASSO)")
plt.semilogx(lambdas, test_error_l1, "--", label="Test (LASSO)")
plt.legend(frameon=False)
plt.title("LASSO", fontsize=16)
plt.xlabel("Regularization strength $\lambda$")
plt.ylabel("Accuracy $R^2$")
<Figure size 600x300 with 1 Axes><Figure size 600x300 with 1 Axes><Figure size 600x300 with 1 Axes>

It looks like the highest test error occurred at an intermediate value of λ\lambda. This is a common phenomenon in machine learning: the model is so unconstrained at low λ\lambda that it overfits the training data, while at high λ\lambda the model is so constrained that it underfits the data, missing important features.

lambda_optimal = lambdas[np.argmax(test_error_l1)]
print("Optimal lambda for Lasso: ", lambda_optimal)
Optimal lambda for Lasso:  0.019306977288832496
model_l1.set_params(alpha=lambda_optimal)
model_l1.fit(X_train, y_train)

y_pred_test = model_l1.predict(X_test)

plt.figure()
plt.imshow(y_test[:100])
plt.title("True values")

plt.figure()
plt.imshow(y_pred_test[:100])
plt.title("Predictions")

# plt.figure()
# plt.imshow(np.round(y_pred_test[:100]))
# plt.title("Rounded Predictions")

plt.figure()
plt.imshow(y_test[-60:] - y_pred_test[-60:])
plt.title("Difference between true and predicted values")
<Figure size 640x480 with 1 Axes><Figure size 640x480 with 1 Axes><Figure size 640x480 with 1 Axes>

Let’s take a look at the learned coupling matrix

plt.imshow(model_l1.coef_[:50, :50], cmap='RdBu_r')
<Figure size 640x480 with 1 Axes>

What physics do we learn from the dynamics matrix?

When we perform linear regression, we learn a matrix AA that maps the dynamics of the spins from one time step to the next.

si+1=Asi\mathbf{s}_{i + 1} = A \cdot \mathbf{s}_{i}

where siRL\mathbf{s}_{i} \in \mathbb{R}^{L} is the state of the spins at time ii, and ARL×LA \in \mathbb{R}^{L \times L} is the learned matrix.

Recall that when we discretized PDEs, a banded matrix indicated local spatial interactions, like the Laplacian acting on a lattice. We therefore expect that the banded structure of the matrix AA suggests that the spins are interacting with their nearest neighbors.

The accuracy of the linear fit also suggests that the dynamics are indeed Markovian, and that the state of the spins at time ii is sufficient to predict the state of the spins at time i+1i+1. If we suspected that this wasn’t the case, we would instead fit a higher-order model,

si+1=Asi+Bsi1\mathbf{s}_{i + 1} = A \cdot \mathbf{s}_{i} + B \cdot \mathbf{s}_{i-1}

where BB is another matrix of weights. This would allow the spins to depend on their state at the previous two time steps, and so on. We could alternatively change the shape of our training data, by concatenating the spins at multiple time steps into a single feature vector,

si+1=Avec(si,si1)\mathbf{s}_{i + 1} = A \cdot \text{vec}(\mathbf{s}_{i}, \mathbf{s}_{i-1})

where vec(si,si1)R2L\text{vec}(\mathbf{s}_{i}, \mathbf{s}_{i-1}) \in \mathbb{R}^{2L} is the concatenation of the spins at times ii and i1i-1.

ca.powers
array([1, 2, 4])
ca.ruleset
array([0, 1, 1, 1, 0, 0, 0, 0])

One-dimensional cellular automata

  • It turns out our experimental data was generated from a noisy simulation of a one-dimensional cellular automaton

  • Recall that a one-dimensional cellular automaton is a discrete dynamical system that evolves a discrete lattice of cells in discrete time steps. Each cell can be in one of NN states, and the state of each cell at time tt is determined by the states of its neighbors at time t1t-1 according to a rule.

  • We can see that our cellular automaton had two states per cell and an update radius of 3, meaning that each cell state updated based on the states of itself and its two nearest neighbors

  • The dynamical rule for a cellular automaton can be specified by a lookup table that gives the state of the cell for each possible binary string denoting the configurations of its neighbors. In our case, there are 23=82^3 = 8 possible configurations of the cell and its two neighbors, so the lookup table has 23=82^3 = 8 entries.

The lookup table is essentially a Boolean function, which takes in the binary string of the cell and its two neighbors and returns the state of the cell at the next time step. For our particular cellular automaton, the lookup table is

000 -> 0
001 -> 1
010 -> 1
011 -> 1
100 -> 0
101 -> 0
110 -> 0
111 -> 0

What happens if we instead have data from a cellular automaton with a different lookup table? Here’s a very similar cellular automaton with a different lookup table. This is a well-known cellular automaton called Rule 30:

000 -> 0
001 -> 1
010 -> 1
011 -> 1
100 -> 1
101 -> 0
110 -> 0
111 -> 0
microstates = np.load('../resources/rule30_microstates.npy', allow_pickle=True)[:500]
plt.figure(figsize=(10, 30))
plt.imshow(microstates.T[:, :100])
plt.ylabel("Lattice Position")
plt.xlabel("Time")

X_all, y_all = microstates.copy()[:-1], microstates.copy()[1:]
print("X_all shape: ", X_all.shape)
print("y_all shape: ", y_all.shape, "\n")

# Data matrix / design matrix always has shape (n_samples, n_features)
X_train, X_test = X_all[:400], X_all[400:]
y_train, y_test = y_all[:400], y_all[400:]
print("X_train shape: ", X_train.shape)
print("y_train shape: ", y_train.shape, "\n")
print("X_test shape: ", X_test.shape)
print("y_test shape: ", y_test.shape)
X_all shape:  (499, 50)
y_all shape:  (499, 50) 

X_train shape:  (400, 50)
y_train shape:  (400, 50) 

X_test shape:  (99, 50)
y_test shape:  (99, 50)
<Figure size 1000x3000 with 1 Axes>
from sklearn.linear_model import Lasso

model = Lasso()
model.fit(X_train, y_train)
y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

plt.figure()
plt.imshow(y_train[:100].T)
plt.title("True values")

plt.figure()
plt.imshow(y_pred_train[:100].T)
plt.title("Predictions")

print("R^2 score on training data: ", coefficient_of_determination(y_train, y_pred_train))
print("R^2 score on test data: ", coefficient_of_determination(y_test, y_pred_test))
R^2 score on training data:  0.001774999999999971
R^2 score on test data:  -0.0011755948469776012
<Figure size 640x480 with 1 Axes><Figure size 640x480 with 1 Axes>

What can’t our model fit?

  • We know that the generating process of our data is a one-dimensional cellular automaton, so it should be Markovian
si+1=f(si)\mathbf{s}_{i + 1} = f(\mathbf{s}_{i})
  • But what if the linear function class isn’t expressive enough to capture the dynamics of the cellular automaton?
si+1Asi\mathbf{s}_{i + 1} \neq A \cdot \mathbf{s}_{i}
  • How could we generalize our model to capture more complex dynamics?

Can we add additional free parameters into our model?

Linear regression:

si+1Asi\mathbf{s}_{i + 1} \neq A \cdot \mathbf{s}_{i}
  • The weight matrix is constrained by our input size. ARL×L\mathbf{A} \in \mathbb{R}^{L \times L} because our inputs (microstates) are on an LL lattice, and so we had L2L^2 free parameters
Idea:
si+1=BCsi\mathbf{s}_{i + 1} = \mathbf{B} \cdot \mathbf{C} \cdot \mathbf{s}_{i}

where BRL×p\mathbf{B} \in \mathbb{R}^{L \times p} and CRp×L\mathbf{C} \in \mathbb{R}^{p \times L}, with pp being a hyperparameter that controls the complexity of the model. This “hidden” or “latent” dimensionality allows us to have a more complex model.

However, the problem is that BCAR40×40\mathbf{B} \cdot \mathbf{C} \equiv \mathbf{A}\in \mathbb{R}^{40 \times 40}, so we don’t gain any expressivity

Solution:

si+1=Bσ(Csi)\mathbf{s}_{i + 1} = \mathbf{B} \cdot \sigma(\mathbf{C} \cdot \mathbf{s}_{i})

where σ(.)\sigma(.) is an elementwise nonlinear function, like tanh(.)\tanh(.). Now the model doesn’t collapse, so we have 2×L×p2 \times L \times p free parameters as well as the ability to capture nonlinear relationships among the spins

This is a one-layer neural network, with a pp unit “hidden” layer. We can always go wider or deeper to further increase the model complexity.

from sklearn.neural_network import MLPRegressor

model = MLPRegressor(hidden_layer_sizes=(200), activation='logistic', 
                     max_iter=2000, learning_rate_init=0.01, random_state=0)
model.fit(X_train, y_train)

y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

plt.figure()
plt.imshow(y_train[:100].T)
plt.title("True values")

plt.figure()
plt.imshow(y_pred_train[:100].T)
plt.title("Predictions")

print("R^2 score on training data: ", coefficient_of_determination(y_train, y_pred_train))
R^2 score on training data:  0.8953384649747517
<Figure size 640x480 with 1 Axes><Figure size 640x480 with 1 Axes>

Hyperparameter tuning

We can imagine that an even more general model would have both regularizers, each with different strengths

Ltotal=Lleastsquares+λ1Llasso+λ2Lridge\mathcal{L}_{total} = \mathcal{L}_{least-squares} + \lambda_1 \mathcal{L}_{lasso} + \lambda_2 \mathcal{L}_{ridge}

This loss function is sometimes referred to as least-squares with an ElasticNet penalty.

Why not always use regularizers

  • The issue: we have two arbitrary factors, λ1\lambda_1 and λ2\lambda_2, which determine how important the L1 and L2 penalties are relative to the primary fitting. These change the available solution space and thus model class
  • These are not “fit” during training like ordinary parameters; rather they are specified beforehand, perhaps with a bit of intuition or domain knowledge, These therefore represent hyperparameters of the model
  • Generally speaking, any “choices” we make---amount of data, model type, model parameters, neural network depth, etc are all hyperparameters. How do we choose these in a principled manner?
  • A major question in machine learning: How do we choose the best hyperparameters for a model?

Validation set

  • Hold out some data just for hyperparameter tuning, separate from the test set

  • We usually don’t validate on test, that leads to data leakage and thus overfitting

from sklearn import linear_model

# define train, validation and test data sets
X_train, X_val, X_test = X_all[:400], X_all[400 : 450], X_all[450 :]
y_train, y_val, y_test = y_all[:400], y_all[400 : 450], y_all[450 :]

# the hyperparameter values to check
lambdas = np.logspace(-6, 4, 100)

all_validation_losses = list()

for lam in lambdas:
    model_l1 = linear_model.Lasso(alpha=lam)
    model_l1.fit(X_train, y_train)
    validation_loss = model_l1.score(X_val, y_val)
    all_validation_losses.append(validation_loss)

best_lambda = lambdas[np.argmax(all_validation_losses)]

plt.semilogx(lambdas, all_validation_losses, label="Validation")
plt.axvline(best_lambda, color='k', linestyle='--', label="Best lambda")
plt.xlabel("Regularization strength $\lambda$")
plt.ylabel("Validation Performance $R^2$")

print("Best lambda on validation set", best_lambda)
Best lambda on validation set 0.022051307399030457
<Figure size 640x480 with 1 Axes>
# fit best model 
model_l1 = linear_model.Lasso(alpha=best_lambda)
model_l1.fit(X_train, y_train)

y_test_predict = model_l1.predict(X_test)

plt.figure()
plt.imshow(y_test[:100].T, vmin=0, vmax=1)

plt.figure()
plt.imshow(y_test_predict[:100].T)


# # plot Ising interaction J
# J_lasso = np.array(model_l1.coef_).reshape((L, L))
# plt.figure()
# plt.imshow(J_lasso,  cmap="RdBu_r", vmin=-1, vmax=1)
# plt.gca().set_aspect(1)
<Figure size 640x480 with 1 Axes><Figure size 640x480 with 1 Axes>

Cross-validation

  • We repeatedly split-up the train into multiple sub-splits for training and validation
  • For example, if we have 100 train points, we can create five 80:20 “splits”, and average the best hyperparameter across the splits
  • If we perform kk subsplits, we refer to our procedure as k-fold cross-validation
  • More elaborate splitting methods (random Monte Carlo, importance weighted, etc)
supervised_learning

Image from source

## Cross validation

from sklearn import linear_model

# define train, validation and test data sets
X_train, X_test = X_all[:400], X_all[400 :]
y_train, y_test = y_all[:400], y_all[400 :]

# the hyperparameter values to check
lambdas = np.logspace(-6, 4, 9)

all_validation_losses = list()


## Outer loop over hyperparameter values
for lam in lambdas:

    all_val_loss_lam = list()

    ## Inner loop over k-folds
    for k in range(4):

        # Create the training and validation subsets from the training data
        X_train_k = np.concatenate([X_train[:k*100], X_train[(k+1)*100:]])
        y_train_k = np.concatenate([y_train[:k*100], y_train[(k+1)*100:]])
        X_val_k = X_train[k*100:(k+1)*100]
        y_val_k = y_train[k*100:(k+1)*100]


        model_l1 = linear_model.Lasso(alpha=lam)
        model_l1.fit(X_train_k, y_train_k)
        validation_loss = model_l1.score(X_val_k, y_val_k)
        all_val_loss_lam.append(validation_loss)

    all_validation_losses.append(np.mean(all_val_loss_lam))

best_lambda = lambdas[np.argmax(all_validation_losses)]
plt.figure()
plt.semilogx(lambdas, all_validation_losses, label="Validation")
plt.xlabel("Regularization strength $\lambda$")
plt.ylabel("Validation Accuracy $R^2$")
print("Best lambda on validation set", best_lambda)
Best lambda on validation set 0.005623413251903491
<Figure size 640x480 with 1 Axes>
# fit best model 
model_l1 = linear_model.Lasso(alpha=best_lambda)
model_l1.fit(X_train, y_train)

y_test_predict = model_l1.predict(X_test)

plt.figure()
plt.imshow(y_test[:100].T, vmin=0, vmax=1)

plt.figure()
plt.imshow(y_test_predict[:100].T)

# Plot interaction matrix
plt.figure()
plt.imshow(model_l1.coef_[:50, :50], cmap='RdBu_r')
<Figure size 640x480 with 1 Axes><Figure size 640x480 with 1 Axes><Figure size 640x480 with 1 Axes>

Appendix

  • Build a programmatic CA model, which takes an instruction set in binary form and implements the corresponding neighborhood 3 cellular automaton.

  • Run this model for a few time steps, and visualize and save the results

class CellularAutomaton:
    """
    A base class for cellular automata. Subclasses must implement the step method.

    Parameters
        n (int): The number of cells in the system
        n_states (int): The number of states in the system
        mutation_rate (int): The number of spins that randomly flip at each time step
        random_state (None or int): The seed for the random number generator. If None,
            the random number generator is not seeded.
        initial_state (None or array): The initial state of the system. If None, a 
            random initial state is used.
        
    """
    def __init__(self, n, n_states, mutation_rate=0, random_state=None, initial_state=None):
        self.n_states = n_states
        self.mutation_rate = mutation_rate
        self.n = n
        self.random_state = random_state
        np.random.seed(random_state)

        ## The universe is a 2D array of integers
        if initial_state is None:
            self.initial_state = np.random.choice(self.n_states, size=self.n)
        else:
            self.initial_state = initial_state
        self.state = self.initial_state

        self.history = [self.state]

    def next_state(self):
        """
        Output the next state of the entire board
        """
        return NotImplementedError

    def simulate(self, n_steps):
        """
        Iterate the dynamics for n_steps, and return the results as an array
        """
        for i in range(n_steps):
            self.state = self.next_state()
            ## Add thermal noise
            flip_inds = np.random.choice(len(self.state), self.mutation_rate)
            self.state[flip_inds] = 1 - self.state[flip_inds]
            self.history.append(self.state.copy())
        return self.state
    

from scipy.ndimage import convolve1d

class ProgrammaticCA(CellularAutomaton):

    def __init__(self, n, ruleset,  **kwargs):
        k = np.unique(ruleset).size
        super().__init__(n, k, **kwargs)
        self.ruleset = ruleset

        ## A special convolutional kernel for converting a binary neighborhood 
        ## to an integer
        self.powers = 2 ** np.arange(3)


    def next_state(self):

        # Compute the next state
        next_state = np.zeros_like(self.state)
        
        # convolve with periodic boundary conditions
        rule_indices = convolve1d(self.state, self.powers, mode='wrap')

        ## look up the rule for each cell
        next_state = self.ruleset[rule_indices.astype(int)]

        return next_state


ruleset = np.array([0, 1, 1, 1, 0, 0, 0, 0]) # Rule good
ca = ProgrammaticCA(160, ruleset, mutation_rate=4, random_state=0)
ca.simulate(3001)

X_all = np.array(ca.history)

plt.figure(figsize=(8, 8))
plt.imshow(X_all.T[:, :501])
plt.xlabel("Position")
plt.ylabel("Time")

# X_all.dump('../resources/spin_chain_microstates.npy')
<Figure size 800x800 with 1 Axes>
ruleset = np.array([0, 1, 1, 1, 1, 0, 0, 0]) # Rule 30
ca = ProgrammaticCA(50, ruleset, mutation_rate=0, random_state=0)
ca.simulate(1001)

X_all = np.array(ca.history)

plt.figure(figsize=(8, 8))
plt.imshow(X_all.T[:, :301])
plt.xlabel("Position")
plt.ylabel("Time")

# X_all.dump('../resources/rule30_microstates.npy')
<Figure size 800x800 with 1 Axes>

One-dimensional cellular automata

  • A one-dimensional cellular automaton is a discrete dynamical system that evolves a discrete lattice of cells in discrete time steps. Each cell can be in one of NN states, and the state of each cell at time tt is determined by the states of its neighbors at time t1t-1 according to a rule.

  • For example, a neighborhood-3 cellular automaton has a rule that specifies the state of a cell based on the states of itself and its two neighbors. The rule can be specified by a lookup table that gives the state of the cell for each possible configuration of the neighborhood.

  • We’ve seen this code before; we implemented a “compiled” cellular automaton function in the genetic algorithms module

One-dimensional cellular automata

  • We are going to consider one of the simplest possible physical systems: one-dimensional cellular automata

  • The update rule for a given 3 x 1 input can be though of as a Boolean truth table, which maps 3 Boolean values to 1 Boolean value. There are therefore 23=82^3 = 8 possible inputs, and so a given 1D cellular automaton consists of a rule table of length 8

  • How many possible unique 1D cellular automata are there? If each CA consists of 8 rules, then the total number of cellular automata corresponds to all possible ways of assigning one of 2 possible output values to these 8 rules. So the total number of possible 1D binary cellular autoamata corresponds to 223=2562^{2^3} = 256 possible cellular automata

  • On some level, we can think of there being 256 possible physical laws in a purely discrete-time, discrete-space “universe,” subject to the constraint that the field is constrained to have two possible values, and the “speed of information” is a maximum of one unit cell per time.

  • In the 1980s, Stephen Wolfram proposed a classification scheme for all 256 cellular automata. For a given cellular automaton ruleset, we write the set of possible 3x13 x 1 as a set of binary numbers. We sort these strings by their value when converted into base-10 integers:

xxx

We then write out the outputs that each string maps onto under the particular cellular automaton ruleset

We next interpret this as a binary integer of length 8

We conclude by converting this binary integer to base 10. For example, the representation above corresponds to “Rule 30”. Every 1D cellular automaton therefore corresponds to a unique rule between 0 and 256

Implementing one-dimensional binary cellular automata

  • To make the dynamics slightly more interesting, we will introduce noise into the system. Since the universe of CA consists only of binary values, noise correspond to random bit flips in the dynamics

Learning a cellular automaton from data

  • We saw below how to generate XXX. We can see this as a vignette of the problems we’ve encountered so far in this course: We know the rules that govern a physical system, and we write code that implements those rules, or a close approximation

  • Often, however, we are interested in the inverse problem: We have experimental data, and we want to learn the physical rules that generated it.

  • Suppose that we are purely given a time series of cellular automaton states. Can we learn the rule directly from it?

  • Our data consists of a time series consisting of NdataN_\text{data} timepoints. Because we want to learn the rules that generated this data, we will split it up into two groups a set of timepoints, and a set of timepoints that immediately follow them in the time series. We refer to these datasets as XX and yy

  • We can think of the learning problem as inferring the unknown function f(.)\mathbf{f}(.) such that y=f(x)\mathbf{y} = \mathbf{f}(\mathbf{x})