# Inferring Hamiltonians from experimental data with supervised learningÂ¶

Preamble: Run the cells below to import the necessary Python packages

- This notebook is modified from notebooks originally developed and used in Pankaj Mehta's ML for physics course at Boston University. Please check out those notebooks and associated textbooks for additional details and exercises.

```
## Preamble / required packages
import numpy as np
np.random.seed(0)
# Import local plotting functions and in-notebook display functions
import matplotlib.pyplot as plt
from IPython.display import Image, display
%matplotlib inline
import warnings
# Comment this out to activate warnings
warnings.filterwarnings('ignore')
```

# Supervised learningÂ¶

Given an input $X$, construct a function that assigns it a label $\hat{y}$.

**Regression:**$\hat{y}$ is a continuous variable. For example, given a picture of a person, predict their age. In physics, a common example is to predict the energy of a particle given its momentum, or forecast the next step in a time series.**Classification:**$\hat{y}$ is a discrete variable. For example, given a picture of an animal, predict whether it is a cat or a dog. In physics, a common example is to predict whether a phase is ordered or disordered, or to detect whether a signal point is a background, signal, or anomaly.

The function $\hat{y} = f_\theta(X)$ is learned from many instances of labelled data comprising $X \in \mathbb{N_\text{data} \times N_\text{features}}$ and known $y \in \mathbb{N_\text{data}}$ pairs. The "weights" or parameters $\theta$ are adjusted during *training*.

*Image from source*

## Supervised learning as inferring a generatorÂ¶

- We can see supervised learning as the process of learning a
*generator*for a dataset: given a set of points, can we approximate the underlying process that produced those points? - Forecasting: given some past values, predict future values

- Regression: Given a known generator with unknonw parameters (like a quadratic Hamiltonian with unknown amplitudes), infer those amplitudes

- Classification (most famous example in ML): Given examples of labelled images, states, etc, predict the class of unlabelled data. We can think of the learned decision boundary as defining a generator of new examples belonging to that class (see conditional GANs, etc)

```
```

```
```

## Spin glassesÂ¶

A spin glass has a Hamiltonian of the form

$$ H(\mathbf{s}^{(i)}) = - \sum_{jk} J_{jk} s_{j}^{(i)} \, s_{k}^{(i)}. $$ where $s_{j}^{(i)} \in \{-1, 1\}$ is the spin of the $j^{th}$ spin in the $i^{th}$ experimental sample $\mathbf{s}^{(i)}$, and $J_{jk}$ is the interaction strength between the $j^{th}$ and $k^{th}$ spins. We assume that there are $L$ spins, and that the sum is over all $L(L-1)/2$ pairs of spins.

In general, the coupling matrix is not symmetric $J_{jk} \neq J_{kj}$, and the diagonal elements are zero $J_{jj} = 0$ (no self-interactions)

Depending on the coupling matrix $J_{j,k}$, the system can be ferromagnetic, anti-ferromagnetic, or a spin glass. In a ferromagnet, all spins prefer to be aligned. In an anti-ferromagnet, all spins prefer to be anti-aligned. In a spin glass, there is no global preference for alignment or anti-alignment. Instead, the system is frustrated, and the spins cannot simultaneously minimize their energy.

```
```

### Lod experimental measurements of an unknown spin glassÂ¶

Suppose we have a spin glass with $L$ spins.

Our data consists of $N_\text{data}$ measurements of microstates $\mathbf{s}^{(i)}$, and their respective energies $E^{(i)} = H(\mathbf{s}^{(i)})$.

```
microstates = np.load('../resources/spin_microstates.npy', allow_pickle=True)
energies = np.load('../resources/spin_energies.npy', allow_pickle=True)
print("Microstates shape: ", microstates.shape)
print("Energies shape: ", energies.shape)
```

Microstates shape: (10000, 40) Energies shape: (10000,)

```
plt.figure(figsize=(10, 5))
plt.imshow(
microstates[np.argsort(energies)].T,
interpolation='nearest', aspect='auto', cmap='binary'
)
plt.ylabel("Spin position")
plt.xlabel("Microstate")
plt.title("Spin configurations sorted by energy")
## aspect ratio
# plt.gca().set_aspect('auto')
plt.figure(figsize=(10, 5))
plt.hist(energies, bins=100)
plt.xlabel("Energy")
plt.ylabel("Number of microstates")
```

Text(0, 0.5, 'Number of microstates')

## Can we infer the unknown $J_{jk}$, given only the samples and their energies?Â¶

Our model class is defined as our known energy function for each sample,

$$ H(\mathbf{s}^{(i)}) = - \sum_{jk} J_{jk} s_{j}^{(i)} \, s_{k}^{(i)}. $$

We can define the vector $\mathbf{X}^i$ with components

$$ \mathbf{X}^{(i)}_{jk} = s_j^{(i)} s_k^{(i)}. $$

or, in vector notation,

$$ \mathbf{X}^{(i)} = \mathbf{s}^{(i)} \otimes \mathbf{s}^{(i)}. $$

Then the energy of each sample is

$$ E^{(i)} = - \sum_{jk} J_{jk} \mathbf{X}^{(i)}_{jk}. $$

Notice that we've exploited our prior knowledge of the physics of this problem in order to put the problem into a linear form. This represents a sort of *feature engineering* where we've used an *inductive bias* to make the problem easier to solve.

### A note on flatteningÂ¶

Notice that our Hamiltonian contains a double sum, which indexes into two matrices. Rather than keeping track of two indices, we can flatten the matrices into vectors, and use a single index.

Flattening over features space is a common technique in machine learning, which we previously used in order to apply PCA to image data.

```
X_all = microstates[:, :, None] * microstates[:, None, :] # outer product creates neighbor matrix
print("X_all shape: ", X_all.shape)
# Data matrix / design matrix always has shape (n_samples, n_features)
X_all = np.reshape(X_all, (X_all.shape[0], -1))
print("X_all shape after flattening into data matrix: ", X_all.shape)
# Match our label shape
y_all = energies
print("y_all shape: ", y_all.shape)
```

X_all shape: (10000, 40, 40) X_all shape after flattening into data matrix: (10000, 1600) y_all shape: (10000,)

### Training and testing dataÂ¶

Rather than fitting our Hamiltonian model to all of the data, we will split the data into a training set and a test set.

We will fit the model to the training set, and then evaluate the trained model on the test set.

This is a common technique in machine learning, and is used to avoid overfitting.

```
# define subset of samples
n_samples = 400
# define train and test data sets
X_train, X_test = X_all[:n_samples], X_all[n_samples : 3 * n_samples // 2]
y_train, y_test = y_all[:n_samples], y_all[n_samples : 3 * n_samples // 2]
print("X_train shape: ", X_train.shape)
print("y_train shape: ", y_train.shape, "\n")
print("X_test shape: ", X_test.shape)
print("y_test shape: ", y_test.shape)
```

X_train shape: (400, 1600) y_train shape: (400,) X_test shape: (200, 1600) y_test shape: (200,)

# Fitting a linear model with least squaresÂ¶

Because our Hamiltonian is linear with respect to the

*outer product*state matrices, we expect that linear regression is a good way to infer the coupling matrix $J_{jk}$.Recall that the linear model is defined as

$$ \hat{y} = A \cdot \mathbf{x} $$ where $A$ is a matrix, then $\mathbf{x}$ is a vector of features, and $\hat{y}$ is a vector of predictions.

In our case, the features are the outer product state matrices $\mathbf{X}^{(i)}$, and the predictions are the energies $E^{(i)}$. We can directly solve for the matrix $A$ using the least squares method: $$ A = \left( \mathbf{X}^T \mathbf{X} \right)^{-1} \mathbf{X}^T \mathbf{y} $$

Recall that we need to use the Moore-Penrose pseudoinverse because the matrix $\mathbf{X}^T \mathbf{X}$ is not square, and the pseudoinverse is the closest thing to an inverse that we can get.

`scikit-learn`

Â¶

Rather than using numpy, we will use the Python machine learning library

`scikit-learn`

to perform the linear regression.`scikit-learn`

uses a consistent API for both simple models, like linear regression, and more complex models, like neural networks.

```
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)
y_pred_train = model.predict(X_train)
```

```
plt.figure(figsize=(6, 6))
plt.plot(y_train, y_pred_train, ".k")
plt.xlabel("True energy (Training Data)")
plt.ylabel("Predicted energy (Training Data)")
plt.title("Linear regression on training data")
plt.gca().set_aspect('auto')
```

What about experiments that the model hasn't seen before?

```
y_pred_test = model.predict(X_test)
# y_pred_test = model.predict(X_test)
plt.figure(figsize=(6, 6))
plt.plot(y_test, y_pred_test, ".k")
plt.xlabel("True energy (Test Data)")
plt.ylabel("Predicted energy (Test Data)")
plt.title("Linear regression on test data")
plt.gca().set_aspect('auto')
```

### OverfittingÂ¶

High train accuracy just tells us that our model class is capable of expressing patterns found the training data

For all datasets, there exists a way to get 100% train accuracy as long as I have access to memory equal to the size of the training dataset (1-nearest-neighbor lookup table)

We therefore need to either regularize (training data can't be perfectly fit) or use a test dataset to see how good our model actually is

A reasonable heuristic when choosing model complexity is to find one that can just barely overfit train (suggests sufficient power)

```
```

```
```

```
```

### Scoring a trained regression modelÂ¶

We can summarize the performance of a regression model by computing the coefficient of determination, $R^2$. This is a measure of how much of the variance in the data is explained by the model. It is defined as $$ R^2 = 1 - \frac{\sum_i (y_i - \hat{y}_i)^2}{\sum_i (y_i - \langle y \rangle)^2} $$ where $y_i$ is the true value of the $i^{th}$ sample, $\hat{y}_i$ is the predicted value of the $i^{th}$ sample, and $\langle y \rangle$ is the mean of the true values. An $R^2$ of 1 indicates that the model perfectly predicts the data, while an $R^2$ of 0 indicates that the model is no better than predicting the mean of the data.

```
print("Train error was:", model.score(X_train, y_train))
print("Test error was:", model.score(X_test, y_test))
```

Train error was: 1.0 Test error was: 0.4943462162181823

```
```

## But raw score doesn't tell the whole storyÂ¶

- We can get a great fit, but our model might have a lot of free parameters
- There might be multiple valid coupling matrices $J$ that explain the observed data
- Our model might be predictive but not interpretable, or physical
- We either need more data, better data (sample rarer states), or a better model

## Let's look at the learned coupling matrixÂ¶

```
## Because we flattened the data, we need to reshape the coefficients to get the couplings
L = microstates.shape[1]
couplings_estimated = np.array(model.coef_).reshape((L, L))
plt.figure(figsize=(6, 6))
plt.imshow(couplings_estimated, cmap='RdBu_r')
```

<matplotlib.image.AxesImage at 0x16bd20a90>

## Let's try repeating the model fitting several times on different subsets of our experimental dataÂ¶

```
plt.figure(figsize=(9, 9))
## Plot 3 x 3 subplots
n_samples = 200
for i in range(9):
## Pick random training data set
selection_inds = np.random.choice(range(X_all.shape[0]), size=n_samples, replace=False)
X_train, y_train = X_all[selection_inds], y_all[selection_inds]
model = LinearRegression()
model.fit(X_all[selection_inds], y_all[selection_inds])
couplings_estimated = np.array(model.coef_).reshape((L, L))
## Plot learned coupling matrix
plt.subplot(3, 3, i + 1)
plt.imshow(couplings_estimated, cmap='RdBu_r')
plt.axis('off')
# spacing between subplots
plt.subplots_adjust(wspace=0.1, hspace=0.1)
```

```
```

```
```

```
```

```
```

# Narrowing the model class with regularizationÂ¶

We constrain the model's allowed space of valid representations in order to select for more parsimonious models

Operationally, regularizers/constraints reduce the "effective" number of parameters, and thus complexity, of our model

Imposing preferred basis functions or symmetries can be forms of regularization

### Ridge regression and Lasso:Â¶

We can think of our least-squares problem as choosing the optimal $J$ that minimizes the following objective function, the mean squared error between the model energies and true energies $$ \mathcal{L} = \sum_{i=1}^{N_{data}} (\mathbf{X}^{(i)} \cdot \mathbf{J}- E^{(i)})^2 $$

where $i$ indicates different training examples, which have predicted energies given by $\mathbf{X}^i \cdot \mathbf{J}$ and observed energies of $H^i$

Common regularizers associated loss with the trainable parameters of the model.

**Ridge regression**is also known as $L2$ regularization, and it discourages any particular weight in the coefficient matrix from becoming too large. Ridge imposes a degree of smoothness or regularity across how a model treats its various inputs. Models that take continuous data as inputs (such as time series, or images), may benefit from the ridge term.- If $\mathbf{J}$ is our trainable linear regression weight matrix (and, in this context, our best estimate for the spin-spin interaction matrix), then we can modify the the losses as follows: $$ \mathcal{L}_{Lasso} = \sum_{i=1}^{N_{data}} (\mathbf{X}^{(i)} \cdot \mathbf{J}- E^{(i)})^2 + \lambda \sum_{jk} | J_{jk} | $$

**Lasso**is also known as $L1$ regularization, and it encourages*sparsity*in weight space: it incentivizes models were most coefficients go to zero, thereby reducing the models dependencies on features. Lasso is often used in feature selection, where we want to identify the most important features in a dataset.

$$ \mathcal{L}_{Ridge} = \sum_{i=1}^{N_{data}} (\mathbf{X}^{(i)} \cdot \mathbf{J}- E^{(i)})^2 + \lambda \sum_{jk} J_{jk}^2 $$ where the hyperparameter $\lambda$ determines the "strength" of the penalty terms.

```
```

Let's try re-fitting the model with these different regularizers. We will vary $\lambda$, the strength of the regularization, and see how the learned coupling matrix changes.

```
# Instantiate models
from sklearn import linear_model
model_ols = linear_model.LinearRegression()
model_l2 = linear_model.Ridge()
model_l1= linear_model.Lasso()
# Set regularization range
lambdas = np.logspace(-4, 5, 10)
# Load data
# define subset of samples
n_samples = 400
X_train, X_test = X_all[:n_samples], X_all[n_samples : 3 * n_samples // 2]
y_train, y_test = y_all[:n_samples], y_all[n_samples : 3 * n_samples // 2]
# define error lists
train_error_ols, test_error_ols = list(), list()
train_error_l2, test_error_l2 = list(), list()
train_error_l1, test_error_l1 = list(), list()
#Initialize coefficients for ridge regression and Lasso
coeffs_ols, coeffs_ridge, coeffs_lasso = list(), list(), list()
for lam in lambdas:
### ordinary least squares
model_ols.fit(X_train, y_train) # fit model
coeffs_ols.append(model_ols.coef_) # store weights
# use the coefficient of determination R^2 as the performance of prediction.
train_error_ols.append(model_ols.score(X_train, y_train))
test_error_ols.append(model_ols.score(X_test, y_test))
### ridge regression
model_l2.set_params(alpha=lam) # set regularisation strength
model_l2.fit(X_train, y_train) # fit model
coeffs_ridge.append(model_l2.coef_) # store weights
train_error_l2.append(model_l2.score(X_train, y_train))
test_error_l2.append(model_l2.score(X_test, y_test))
### lasso
model_l1.set_params(alpha=lam) # set regularisation strength
model_l1.fit(X_train, y_train) # fit model
coeffs_lasso.append(model_l1.coef_) # store weights
train_error_l1.append(model_l1.score(X_train, y_train))
test_error_l1.append(model_l1.score(X_test, y_test))
### plot Ising interaction J
J_leastsq = np.array(model_ols.coef_).reshape((L, L))
J_ridge = np.array(model_l2.coef_).reshape((L, L))
J_lasso = np.array(model_l1.coef_).reshape((L, L))
fig, axarr = plt.subplots(nrows=1, ncols=3)
axarr[0].imshow(J_leastsq, cmap="RdBu_r", vmin=-1, vmax=1)
axarr[0].set_title(f"OLS \n Train={train_error_ols[-1]}, Test={test_error_ols[-1]}")
## 3 sig figs
# axarr[0].set_title('OLS \n Train$=%.3f$, Test$=%.3f$' %(train_error_ols[-1],test_error_ols[-1]))
axarr[1].set_title('OLS $\lambda=%.4f$\n Train$=%.3f$, Test$=%.3f$' %(lam, train_error_ols[-1],test_error_ols[-1]))
axarr[0].tick_params(labelsize=16)
axarr[1].imshow(J_ridge, cmap="RdBu_r", vmin=-1, vmax=1)
axarr[1].set_title('Ridge $\lambda=%.4f$\n Train$=%.3f$, Test$=%.3f$' %(lam, train_error_l2[-1],test_error_l2[-1]))
axarr[1].tick_params(labelsize=16)
im=axarr[2].imshow(J_lasso, cmap="RdBu_r", vmin=-1, vmax=1)
axarr[2].set_title('LASSO $\lambda=%.4f$\n Train$=%.3f$, Test$=%.3f$' %(lam, train_error_l1[-1],test_error_l1[-1]))
axarr[2].tick_params(labelsize=16)
# divider = make_axes_locatable(axarr[2])
# cax = divider.append_axes("right", size="5%", pad=0.05, add_to_figure=True)
# cbar=fig.colorbar(im, cax=cax)
# cbar.ax.set_yticklabels(np.arange(-1.0, 1.0+0.25, 0.25),fontsize=14)
# cbar.set_label('$J_{i,j}$',labelpad=15, y=0.5,fontsize=20,rotation=0)
fig.subplots_adjust(right=2.0)
plt.show()
```