Predicting one-dimensional cellular automata¶
Preamble: Run the cells below to import the necessary Python packages
- This notebook is modified from notebooks originally developed and used in Pankaj Mehta's ML for physics course at Boston University. Please check out those notebooks and associated textbooks for additional details and exercises.
## Preamble / required packages
import numpy as np
np.random.seed(0)
# Import local plotting functions and in-notebook display functions
import matplotlib.pyplot as plt
from IPython.display import Image, display
%matplotlib inline
import warnings
# Comment this out to activate warnings
warnings.filterwarnings('ignore')
## Set nicer default colors
plt.rcParams['image.cmap'] = 'PuBu'
plt.rcParams['axes.prop_cycle'] = plt.cycler(color=[[1.0, .3882, .2784],
[0.0, 0.6, 0.5],
[1.0, .6471, 0.0],
[0.0, 0.4471, 0.7412],
[0.5, 0.5, 0.5],
[0.7490, 0, 0.7490],
[0.0, 0.0, 0.0]])
plt.rcParams['lines.markersize'] = 10
Supervised learning¶
Given an input $X$, construct a function that assigns it a label $\hat{y}$.
Regression: $\hat{y}$ is a continuous variable. For example, given a picture of a person, predict their age. In physics, a common example is to predict the energy of a particle given its momentum, or forecast the next step in a time series.
Classification: $\hat{y}$ is a discrete variable. For example, given a picture of an animal, predict whether it is a cat or a dog. In physics, a common example is to predict whether a phase is ordered or disordered, or to detect whether a signal point is a background, signal, or anomaly.
The function $\hat{y} = f_\theta(X)$ is learned from many instances of labelled data comprising $X \in \mathbb{R}^{N_\text{data} \times N_\text{features}}$ and known $y \in \mathbb{R}^{N_\text{data}}$ pairs. The "weights" or parameters $\theta$ are adjusted during training.
Image from source
Supervised learning as inferring a generator¶
- We can see supervised learning as the process of learning a generator for a dataset: given a set of points, can we approximate the underlying process that produced those points?
- Forecasting: given some past values, predict future values
- Regression: Given a known generator with unknown parameters (like a quadratic Hamiltonian with unknown amplitudes), infer those amplitudes
- Classification (most famous example in ML): Given examples of labelled images, states, etc, predict the class of unlabelled data. We can think of the learned decision boundary as defining a generator of new examples belonging to that class (see conditional GANs, etc)
Load experimental measurements of an unknown spin chain¶
Suppose we have a spin chain with $L$ spins evolving in time
Our data consists of $N_\text{data}$ measurements of binary microstates $\mathbf{s} \in \{0,1\}^L$ corresponding to spins.
microstates = np.load('../resources/spin_chain_microstates.npy', allow_pickle=True)[:500]
plt.figure(figsize=(8, 8))
plt.imshow(microstates.T)
plt.ylabel("Lattice Position")
plt.xlabel("Time")
Text(0.5, 0, 'Time')
plt.plot(microstates[0])
plt.plot(microstates[1])
[<matplotlib.lines.Line2D at 0x16d668850>]
Forecasting as a supervised learning problem¶
Looking at the data, we notice that the spins appear to exhibit temporal correlations: the state at time $t$ is correlated with the state at time $t-1$.
One hypothesis is that the spins are evolving according to Markovian dynamics, in which the state of the lattice at time $t$ is depends only the state at time $t-1$.
$$ \mathbf{s}_i = f(\mathbf{s}_{i-1}) $$
We don't know the form of $f$, but we can try to learn it from the data. In order to frame this as a supervised learning problem, we therefore think of each data point as a given spin configuration $\mathbf{s}_{i}$ and its label as the next spin configuration $\mathbf{s}_{i+1}$.
Note how this concept of supervised learning differs from image classification: here, the label is not a single number, like an integer or a class label, but an entire vector with shape equal to the input data.
We will start by inspecting the data $\mathbf{s}_i$ and the regression target $\mathbf{s}_{i+1}$.
plt.plot(microstates[10], label="Time 10")
plt.plot(microstates[11], label="Time 11")
plt.xlabel("Lattice Position")
plt.ylabel("Spin Value")
plt.legend()
<matplotlib.legend.Legend at 0x16d5dc990>
Training and testing splits¶
In order to frame this problem as supervised learning, we need to split the data into training and testing sets. We will use the first 80% of the data for training and the last 20% for testing.
By convention, in supervised learning we refer to the input data as $X$ and the target as $y$. In this case, $X \in \mathbb{Z}^L$ will be the spins at time $t$ and $y \in \mathbb{Z}^L$ will be the spins at time $t+1$.
X_all, y_all = microstates.copy()[:-1], microstates.copy()[1:]
print("X_all shape: ", X_all.shape)
print("y_all shape: ", y_all.shape, "\n")
X_all shape: (499, 160) y_all shape: (499, 160)
# Data matrix / design matrix always has shape (n_samples, n_features)
X_train, X_test = X_all[:400], X_all[400:]
y_train, y_test = y_all[:400], y_all[400:]
print("X_train shape: ", X_train.shape)
print("y_train shape: ", y_train.shape, "\n")
print("X_test shape: ", X_test.shape)
print("y_test shape: ", y_test.shape)
X_train shape: (400, 160) y_train shape: (400, 160) X_test shape: (99, 160) y_test shape: (99, 160)
A note on flattening¶
For our problem, the input data is a binary vector $\mathbf{s}(t)$ of length $L$, and the target is a binary vector $\mathbf{s}(t+1)$ of length $L$. Our training dataset $X \in \mathbb{N_\text{data} \times L}$ has the typical form of a supervised learning dataset.
However, if we had a multidimensional dataset, we would need to flatten it into a 1D vector. For example, if we were studying the 2D, each snapshot of the system would be an $L \times L$ matrix, and we might have $N_\text{data}$ snapshots. Our training data would then be $X \in \mathbb{R}^{N_\text{data} \times L \times L}$, and we would need to flatten it into a 1D vector $X \in \mathbb{R}^{N_\text{data} \times L^2}$.
Fitting a linear model with least squares¶
What is the simplest function we can think of that maps a vector $\mathbf{s}_{i} \in \mathbb{R}^{L}$ to another vector $\mathbf{s}_{i+1} \in \mathbb{R}^{L}$? Recall that the linear model is defined as $$ \mathbf{s}_{i + 1} = A \cdot \mathbf{s}_{i} $$ where $A \in \mathbb{R}^{L \times L}$.
In our case, the features are the current state matrices $\mathbf{y}_{i}$, and the predictions are the subsequent microstate matrics $\mathbf{y}_{i+1}$. We therefore view linear regression as a data-driven method for learning a Markovian dynamical system that evolves the spins in time.
Given a dataset $\mathbf{X} \in \mathbb{R}^{N_\text{data} \times L}$ and regression target $\mathbf{y} \in \mathbb{R}^{N_\text{data} \times L}$, we can use the least squares method to find the matrix $A$ that best fits the data. The least squares solution is given by $$ A = \left( \mathbf{X}^T \mathbf{X} \right)^{-1} \mathbf{X}^T \mathbf{y} $$ This is called the Moore-Penrose pseudoinverse because the matrix $\mathbf{X}^T \mathbf{X}$ is not square, and so we can't directly invert it. Instead, the matrix $\left( \mathbf{X}^T \mathbf{X} \right)^{-1} \mathbf{X}^T$ minimizes the squared error between the predicted $\hat{\mathbf{s}}_{i+1}$ for each $\mathbf{s}_i$ and the true $\mathbf{s}_{i+1}$.
The best-fit predictions of the trained model are given by $$ \mathbf{\hat{s}}_{i + 1} = A \cdot \mathbf{s}_{i} $$ Usually, in supervised learning we generically use $\mathbf{\hat{y}}$ to denote the predictions of the model, and $\mathbf{y}$ to denote the true labels, and $\mathbf{X}$ to denote the input data. So we can also write the generic form of prediction as $\mathbf{\hat{y}} = A \cdot \mathbf{X}$.
Python syntax: the scikit-learn
library¶
Rather than using
numpy
, we will use the Python machine learning libraryscikit-learn
to perform the linear regression.scikit-learn
uses a consistent API for both simple models, like linear regression, and more complex models, like neural networks. Machine learning models are objects that are first instantiated with all hyperparameters, then trained on data using thefit
method to determine the values of all parameters ($A$ in our case), and finally used to make predictions (here denoted $\mathbf{\hat{s}}_{i+1}$ or $\mathbf{\hat{y}}$) using thepredict
method.
from sklearn.linear_model import LinearRegression#, RidgeCV, Lasso
# model = RidgeCV()
# model = Lasso(alpha=6e-2)
model = LinearRegression()
model.fit(X_train, y_train)
y_pred_train = model.predict(X_train)
# plt.imshow(model.coef_)
plt.plot(X_train[0])
# plt.plot(y_train[0])
plt.plot(y_pred_train[0])
[<matplotlib.lines.Line2D at 0x16d6060d0>]
plt.figure()
plt.imshow(y_train[:100])
plt.xlabel("Time")
plt.ylabel("Lattice Position")
plt.title("True values")
plt.figure()
plt.imshow(y_pred_train[:100])
plt.title("Predictions")
plt.figure()
plt.imshow(np.round(y_pred_train[:100]))
plt.title("Rounded Predictions")
plt.figure()
plt.imshow(y_train[-60:] - np.round(y_pred_train[-60:]))
plt.title("Difference between true and predicted values")
# plt.colorbar()
Text(0.5, 1.0, 'Difference between true and predicted values')
Scoring a regression model¶
We will use the $R^2$ score to evaluate the performance of our model. The $R^2$ score is defined as $$ R^2 = 1 - \frac{\sum_i (\mathbf{y}_i - \hat{\mathbf{y}}_i)^2}{\sum_i (\mathbf{y}_i - \bar{\mathbf{y}})^2} $$ where $i$ indexes datapoints, $\mathbf{y}_i$ are the true regression targets, $\hat{\mathbf{y}}_i$ are the predicted values, and $\bar{\mathbf{y}}$ is the elementwise mean of the true values. The $R^2$ score is a measure of how well the model explains the variance in the data. A score of 1 indicates a perfect fit, while a score of 0 indicates that the model is no better than predicting the mean of the data.
def coefficient_of_determination(y_true, y_pred):
"""The R^2 score, or coefficient of determination, is a measure of how well
future samples are likely to be predicted by the model.
Args:
y_true (np.ndarray): True values
y_pred (np.ndarray): Predicted values
Returns:
float: R^2 score
"""
ss_res = np