Skip to article frontmatterSkip to article content

Unsupervised learning: Clustering

The University of Texas at Austin
## Preamble / required packages
import numpy as np
np.random.seed(0)

## Import local plotting functions and in-notebook display functions
import matplotlib.pyplot as plt
# from IPython.display import Image, display
%matplotlib inline

import warnings
## Comment this out to activate warnings
warnings.filterwarnings('ignore')

Unsupervised learning

  • Two major types of unsupervised learning: clustering and dimensionality reduction (embedding).

  • Clustering: grouping similar data points together, based on their features.

  • Dimensionality reduction: mapping high-dimensional data to a lower-dimensional space, for visualization or for further analysis

Here, we will deal with a very high-dimensional dataset (many features per datapoint). We will first use embedding to reduce the dimensionality of the dataset, and we will then use clustering to group datapoints in the lower dimensional space. Importantly, we don’t always have to use embeddings and clusterings together---we might embed data for other reasons (like visualization), or we might choose to cluster data in high dimensions

  • Important: in unsupervised learning, we do not use labels when fitting our model. The methods look for structure within the data itself. However, if we do have labels, we can compare them to the known structure, to see how well the clustering algorithm performs.

  • Sometimes the outputs of an unsupervised learning algorithm are called “pseudolabels” because they are not the true labels, but they can be used as a proxy for true labels in a downstream supervised learning analysis. Some methods leverage a small amount of labelled data to tune an unsupervised learning algorithm, which in turn is used to produce pseudolabels for a large amount of data. This allows supervised learning methods to be trained on larger datasets than would otherwise be accessible. Collectively these methods are known as “semi-supervised” learning methods.

Ising Dataset

  • We now explore how to cluster the 2D Ising dataset and visualize the results using PCA.

  • The raw Ising dataset was generated via Monte Carlo simulations, and it contains snapshots of microstates at a temperature where the system is mostly ordered, a temperature where it is mostly disordered, and a temperature in the critical transition region.

  • For unsupervised learning we will use data from all three phases, but we will not use the labels for any of the operations. We will use the phase labels to evaluate the performance of our clustering algorithm.

  • In most real-world applications, we will not have labels. Here we only use labels to validate our clustering algorithm.

First let’s load the Ising dataset:

import pickle, os
from urllib.request import urlopen
import pickle,os
def fetch_data():
    """
    Retrieve Ising dataset from the web and load into a numpy array

    The data is stored in a binary format in order to maximize its compressibility, but
    we need to convert the zeros to -1 to match the Ising model conventions.

    Returns:
        X_all (np.array): 2D array of shape (N, 2**L) where N is the number of samples, 
            corresponding to a list of Ising microstates
        y_all (np.array): 1D array of shape (N,) corresponding to the temperature of
            each microstate
    """

    # Fixed URL parameters
    url_main = "https://physics.bu.edu/~pankajm/ML-Review-Datasets/isingMC/"
    # The data consists of 16*10000 samples taken in T=np.arange(0.25,4.0001,0.25):
    data_file_name = "Ising2DFM_reSample_L40_T=All.pkl"
    # The labels are obtained from the following file:
    label_file_name = "Ising2DFM_reSample_L40_T=All_labels.pkl"

    ## Download and store in local variable
    X_all = pickle.load(urlopen(url_main + data_file_name))  
    X_all = np.unpackbits(X_all).reshape(-1, 1600)
    X_all = X_all.astype("int")

    # Convert 0,1 binary data to -1,1 spin configuration
    X_all[np.where(X_all == 0)] = -1

    # Labels correspond to temperature values for each sample
    y_all = np.hstack([[t]*10000 for t in np.arange(0.25, 4.01, 0.25)]) 
    return X_all, y_all


## Load all data
X, y = fetch_data()

Before doing any learning, let’s explore our data a bit. We will first plot some example spin configurations, and then we will look at the distribution of labels



## Randomly subsample the data 

np.random.seed(0) # fix the random seed
idx = np.arange(len(X))
rand = np.random.choice(idx, size=8000, replace=False)

X_batch, y_batch = X[rand], y[rand]
print(f"X_batch shape: {X_batch.shape}")
print(f"y_batch shape: {y_batch.shape}")

### make a panel of 3x3 images
plt.figure(figsize=(12, 12))
for i in range(9):
    plt.subplot(3, 3, i+1)
    plt.imshow(X_batch[i].reshape(40, 40), cmap='binary')
    plt.title(f"Ising Model at T={y_batch[i]}")
    plt.axis('off')

plt.figure(figsize=(8, 4))
plt.hist(y_batch, bins=16);
plt.xlabel('Simulation Temperature');
plt.ylabel('Number of Samples');
X_batch shape: (8000, 1600)
y_batch shape: (8000,)
<Figure size 1200x1200 with 9 Axes><Figure size 800x400 with 1 Axes>

Embedding and dimensionality reduction

  • Our Ising model data lives in a feature space d=402=1600d = 40^2 = 1600 dimensions.
  • We want to visualize and analyze the data in a lower-dimensional space.

Fit embeddings and plot the results

from sklearn.decomposition import PCA

## Fit PCA and t-SNE embedding methods
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_batch)
component1 = pca.components_[0]

## Plot the results
plt.figure()
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y_batch, s=10, cmap="coolwarm")
plt.gca().set_title("PCA")
plt.xlabel('PC1');
plt.ylabel('PC2');
plt.show()
<Figure size 640x480 with 1 Axes>

Our results match recently published works...

wang et al header


wang et al pca ising

Image from Wang. Phys Rev B (2016)

Other embedding techniques exploit nonlinear transformations

  • PCA finds a lower dimensional space that is a linear combination of input features

  • t-SNE and UMAP finds a lower dimensional space where each new variable is a nonlinear combination of input features.

  • PCA maximizes the variance captured in the lower dimensional space

  • t-SNE and UMAP try to preserve the local structure of the data (like distributions of distances) in the lower dimensional space

from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from umap import UMAP

## Fit PCA and t-SNE embedding methods
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_batch)
component1 = pca.components_[0]

tsne = TSNE(n_components=2)
X_tsne = tsne.fit_transform(X_batch)

umap = UMAP(n_components=2)
X_umap = umap.fit_transform(X_batch)
Fetching long content....
# ## Plot the results
# fig, ax = plt.subplots(1, 3, figsize=(15, 5))
# ax[0].scatter(X_pca[:, 0], X_pca[:, 1], c=y_batch, s=10, cmap="coolwarm")
# ax[0].set_title("PCA")
# ax[1].scatter(X_tsne[:, 0], X_tsne[:, 1], c=y_batch, s=10, cmap="coolwarm")
# ax[1].set_title("t-SNE")
# ax[2].scatter(X_umap[:, 0], X_umap[:, 1], c=y_batch, s=10, cmap="coolwarm")
# ax[2].set_title("UMAP")
# plt.show()
## Plot the results
fig, ax = plt.subplots(1, 3, figsize=(15, 5))
ax[0].scatter(X_pca[:, 0], X_pca[:, 1], c=y_batch, s=10, cmap="coolwarm")
ax[0].set_title("PCA")
ax[1].scatter(X_tsne[:, 0], X_tsne[:, 1], c=y_batch, s=10, cmap="coolwarm")
ax[1].set_title("t-SNE")
ax[2].scatter(X_umap[:, 0], X_umap[:, 1], c=y_batch, s=10, cmap="coolwarm")
ax[2].set_title("UMAP")
plt.show()
<Figure size 1500x500 with 3 Axes>

What physics can we learn from our unsupervised analysis? We can inspect PCA-1 principal component learned by the fitted model:


plt.figure()
plt.imshow(component1.reshape(40, 40))

plt.figure()
plt.hist(component1, bins=50)
plt.xlabel('Component 1 Value');s
plt.ylabel('Frequency');
<Figure size 640x480 with 1 Axes><Figure size 640x480 with 1 Axes>

Discussion

  • The components of the first PCA vector are tightly distributed around a single value. This implies that the PCA vector is essentially a constant vector.

  • Indeed turns out that PC1 aligns strong with the magnetization order parameter. This is because PCA tends to find linear combinations of the features that have the largest variance, and the magnetization can be written as a linear combination of input features (the spins)

M=1NiNsiM = \frac{1}{N} \sum_i^N s_i
  • The magnetization represents the condition that was varied across simulations, so it makes sense that it would correspond to the highest variance component.

Embeddings are widely used for exploratory data analysis

elegans posture

Image from Stephens et al 2008

tabula muris

Image from Tabula Muris Consortium

44,949 cells, ~20,000 genes per cell

Clustering

  • Embedding methods are an unsupervised learning technique that maps high-dimensional data to a lower-dimensional space. For example, our Ising data has 1600 features, but we can map it to a 2D space using PCA. So we can loosely think of embedding as a transformation in the feature space, that acts on all datapoints.

  • Clustering methods are unsupervised learning techniques that group similar data points together. If two high-dimensional datapoints are close together, they belong to the same cluster. We can think of clustering as a labelling in the data space, that involves comparing features between datapoints.

Automatically identifying phases using K means clustering

We will now use the KK-means algorithm to cluster the Ising data. The KK-means algorithm is a simple algorithm that is easy to implement and understand. It is also very fast. The algorithm is as follows:

  1. Choose KK initial cluster centers
  2. Assign each point to the nearest cluster center
  3. Update the cluster centers to be the mean of the points in the cluster
  4. Repeat steps 2 and 3 until the cluster centers do not change

The KK-means algorithm is guaranteed to converge to a local minimum. However, the choice of initial cluster centers can lead to different results.

Unlike some other clustering methods, the KMeans algorithm requires that we specify the number of clusters KK in advance as a hyperparameter. However, there are external heuristics like the silhouette score that we cane use to determine the optimal number of clusters.

class KMeansClusterer:
    """
    An implementation of the K Means clustering algorithm.
    """

    def __init__(self, n_clusters=2, max_iter=100, tol=1e-4, store_history=False, random_state=None):
        """
        Initialize a KMeansClusterer object.

        Args:
            n_clusters (int): number of clusters to fit
            max_iter (int): maximum number of iterations to run
            tol (float): tolerance for convergence
        """
        self.n_clusters = n_clusters
        self.max_iter = max_iter
        self.tol = tol
        self.store_history = store_history
        self.random_state = random_state

    def fit(self, X):
        """
        Fit the KMeans clustering algorithm to the data.

        Args:
            X (np.array): 2D array of shape (N, D) containing the data to cluster

        Returns:
            self (KMeansClusterer): the fitted KMeansClusterer object
        """
        N, D = X.shape

        ## Initialize the cluster centers and assign all points to the first cluster
        if self.random_state is not None:
            np.random.seed(self.random_state)
        self.centers = np.random.choice(N, size=self.n_clusters, replace=False)
        self.centers = X[self.centers]
        self.labels = np.zeros(N)

        if self.store_history:
            self.history = []
            self.loss_history = []
            self.label_history = []

        for i in range(self.max_iter):

            # ## Assign each point to the nearest cluster center
            # distortion = np.sum((X[:, np.newaxis, :] - self.centers[np.newaxis, :, :]) ** 2, axis=2)
            # self.labels = np.argmin(distortion, axis=1)

            # ## Update the cluster centers based on the latest assignments
            # new_centers = np.array([np.mean(X[self.labels == j], axis=0) for j in range(self.n_clusters)])

            ## Run the k-means algorithm
            new_centers, distortion = self._update_centers(X)

            if self.store_history:
                self.history.append(self.centers)
                self.loss_history.append(np.mean(np.min(distortion, axis=1)))
                self.label_history.append(self.labels.copy())

            ## Stop if the centers have converged
            if np.sum((new_centers - self.centers) ** 2) < self.tol:
                print("Converged after {} iterations".format(i))
                break
            self.centers = new_centers

        return self
    
    def _update_centers(self, X):
        """
        Update the cluster centers based on the latest assignments.

        Args:
            X (np.array): 2D array of shape (N, D) containing the data to cluster
        """
        distortion = np.sum((X[:, np.newaxis, :] - self.centers[np.newaxis, :, :]) ** 2, axis=2)
        self.labels = np.argmin(distortion, axis=1)
        new_centers = np.array([np.mean(X[self.labels == j], axis=0) for j in range(self.n_clusters)])
        return new_centers, distortion

    def predict(self, X):
        """
        Predict the labels of the data.

        Args:
            X (np.array): 2D array of shape (N, D) containing the data to cluster

        Returns:
            labels (np.array): 1D array of shape (N,) containing the predicted labels
        """
        return np.argmin(np.sum((X[:, np.newaxis, :] - self.centers[np.newaxis, :, :]) ** 2, axis=2), axis=1)

    def fit_predict(self, X):
        """
        Fit the KMeans clustering algorithm to the data and predict the labels.

        Args:
            X (np.array): 2D array of shape (N, D) containing the data to cluster

        Returns:
            labels (np.array): 1D array of shape (N,) containing the predicted labels
        """
        self.fit(X)
        return self.predict(X)
        
        
plt.scatter(X_pca[:, 0], X_pca[:, 1])
plt.xlabel('PC1');
plt.ylabel('PC2');
plt.title('PCA embedding of Ising data before clustering')
<Figure size 640x480 with 1 Axes>
model = KMeansClusterer(n_clusters=3, max_iter=100, tol=1e-4, store_history=True, random_state=14)
labels = model.fit_predict(X_pca)

plt.figure(figsize=(8, 6))
for label_val in np.unique(labels):
    plt.scatter(X_pca[labels==label_val, 0], X_pca[labels==label_val, 1])
plt.xlabel('PC1');
plt.ylabel('PC2');
plt.title('PCA embedding of Ising data after clustering')
plt.show()

plt.figure()
plt.plot(model.loss_history)
plt.xlabel('Iteration')
plt.ylabel('Loss (Inertia)')
Converged after 4 iterations
<Figure size 800x600 with 1 Axes><Figure size 640x480 with 1 Axes>
from matplotlib.animation import FuncAnimation
from IPython.display import HTML

fig, ax = plt.subplots(figsize=(8, 6))

def init():
    return []

def update(i):
    ax.clear()
    labels = model.label_history[i]
    for label_val in np.unique(labels):
        ax.scatter(X_pca[labels == label_val, 0], X_pca[labels == label_val, 1])
    ax.plot(*model.history[i].T, '.k', markersize=20)
    ax.axis('off')
    return []

ani = FuncAnimation(fig, update, frames=range(len(model.history)), init_func=init, blit=True)

# Show as HTML video
HTML(ani.to_jshtml())
Loading...
class KMeansClustererMomentum(KMeansClusterer):
    """
    An implementation of the K Means clustering algorithm with momentum.

    Attributes:
        momentum (float): the momentum parameter for the update
    """

    def __init__(self, *args, momentum=0.9, **kwargs):
        super().__init__(*args, **kwargs)
        self.momentum = momentum

    def _update_centers(self, X):
        """
        Update the cluster centers based on the latest assignments.

        Args:
            X (np.array): 2D array of shape (N, D) containing the data to cluster
        """
        distortion = np.sum((X[:, np.newaxis, :] - self.centers[np.newaxis, :, :]) ** 2, axis=2)
        self.labels = np.argmin(distortion, axis=1)
        new_centers = np.array([np.mean(X[self.labels == j], axis=0) for j in range(self.n_clusters)])
        new_centers = self.momentum * self.centers + (1 - self.momentum) * new_centers
        return new_centers, distortion
model = KMeansClustererMomentum(n_clusters=3, max_iter=100, tol=1e-4, store_history=True, random_state=14)
labels = model.fit_predict(X_pca)
Converged after 59 iterations
from matplotlib.animation import FuncAnimation
from IPython.display import HTML

fig, ax = plt.subplots(figsize=(8, 6))

def init():
    return []

def update(i):
    ax.clear()
    labels = model.label_history[i]
    for label_val in np.unique(labels):
        ax.scatter(X_pca[labels == label_val, 0], X_pca[labels == label_val, 1])
    ax.plot(*model.history[i].T, '.k', markersize=20)
    ax.axis('off')
    return []

ani = FuncAnimation(fig, update, frames=range(len(model.history)), init_func=init, blit=True)

# Show as HTML video
HTML(ani.to_jshtml())
Loading...

Are the observed clusters physically meaningful?

Let’s now look at the mean of each cluster. Since we have obtained the cluster labels, we can go back to the original space, rather the lower dimensional PCA space that we have been working in so far.

For each pseudolabel (cluster ID), we will compute the mean magnetization of all the datapoints (microstates) in that cluster.

[np.mean(X_batch[labels==label_val]) for label_val in [0, 1, 2]]
[0.0010270856662933931, 0.9540760869565217, -0.9555427782888685]

Thus we found that left- and right-most clusters are well magnetized (ordered phase). The middle cluster is not magnetized and corresponds to the high-temperature states.

Clustering accuracy, and determining hyperparameters:

  • We want to use clustering to explore datasets where we don’t have labels. However, for the Ising dataset, we do have labels in the form of temperatures at which simulations were completed.
  1. Can we use these labels to define what a good cluster or a good clustering assignment is?

  2. For cases where we don’t have access to labels, how would we determine if a clustering algorithm is doing a good job?

Evaluating clustering with known labels

If we have labels for our datapoints, then we can evaluate the performance of our clustering algorithm the same way we would for a supervised learning algorithm like a classifier. In fact, the only difference between clustering and unsupervised classification is that a clustering model doesn’t get any label information that it can use to guide the function it learns. Instead, the clustering algorithm must infer the existence of discrete labels from the data itself.

A common method for scoring clustering with labels is the normalized mutual information.

NMI(ytrue,ypred)=2×I(W,C)H(W)+H(C)\text{NMI}(y_{\text{true}}, y_{\text{pred}}) = \frac{2 \times I(W, C)}{H(W) + H(C)}

where a clustering W={w1,w2,...,wN}W = \{w_1, w_2, ..., w_N\} and the ground-truth C={c1,c2,...,cM}C = \{c_1, c_2, ..., c_M\} are sets of sets of indices into pseudolabels and true labels, respectively. For example, if w1={1,2,5}w_1 = \{1, 2, 5\}, then the first cluster contains datapoints 1, 2, and 5.

We define P(wi,cj)P(w_i, c_j) as the number of datapoints that are in both cluster wiw_i and class cjc_j, and NN as the total number of datapoints.

P(wi,cj)=wicjNP(w_i, c_j) = \frac{|w_i \cap c_j|}{N}

where wicj|w_i \cap c_j| is the number of elements in the intersection of the two sets. This can be thought of as datapoints that are in the same cluster and have the same true label

The mutual information I(W,C)I(W, C) is defined as

I(W,C)=i=1Nj=1MP(wi,cj)log(P(wi,cj)P(wi)P(cj))I(W, C) = \sum_{i=1}^{N} \sum_{j=1}^{M} P(w_i, c_j) \log \left( \frac{P(w_i, c_j)}{P(w_i) P(c_j)} \right) \\

Notice that there is a double sum, implying that all possible pairs of pseudolabels and true labels are considered. The entropies H(.)H(.) are defined as

H(W)=i=1NP(wi)logP(wi)H(C)=j=1MP(cj)logP(cj)\begin{align*} H(W) &= - \sum_{i=1}^{N} P(w_i) \log P(w_i) \\ H(C) &= - \sum_{j=1}^{M} P(c_j) \log P(c_j) \end{align*}

This quantifies the dispersion of the pseudolabels and true labels, respectively. For example, if all datapoints are in the same cluster, then P(wi)=1P(w_i) = 1 for all ii, and H(W)=0H(W) = 0.

  • Notice how, a priori, the NMI score is invariant to permutations of the cluster labels, as well as different numbers of detected clusters and true labels. For this reason, it is preferable to metrics like the raw label accuracy.

  • As we will see, when doing clustering, there is usually no free lunch: if one chooses the wrong set of parameters (which can be quite sensitive), the clustering algorithm will likely fail or return a trivial answer (for instance all points assigned to the same cluster).

from collections import OrderedDict

def clustering(y):
    """
    Finds position of labels and returns a dictionary of cluster labels to data indices.

    Args:
        y (np.array): 1D array of shape (N,) containing the clustering assignments

    Returns:
        clustering (dict): dictionary of cluster labels to data indices
    """
    yu = np.sort(np.unique(y))
    clustering = OrderedDict()
    for ye in yu:
        clustering[ye] = np.where(y == ye)[0]
    return clustering

def entropy(c, n_sample):
    """
    Computes entropy of a clustering assignment.

    Args:
        c (dict): dictionary of cluster labels to data indices
        n_sample (int): number of samples

    Returns:
        h (float): entropy of the clustering assignment
    """
    h = 0.0
    for kc in c.keys():
        p=len(c[kc]) / n_sample
        h += p * np.log(p)
    h *= -1.0
    return h

# Normalized mutual information function
# Note that this deals with the label permutation problem
def NMI(y_true, y_pred):
    """ 
    Computes normalized mutual information: where y_true and y_pred are both clustering assignments

    Args:
        y_true (np.array): true clustering assignments
        y_pred (np.array): predicted clustering assignments

    Returns:
        nmi (float): the normalized mutual information
    """
    w = clustering(y_true)
    c = clustering(y_pred)
    n_sample = len(y_true)

    Iwc = 0.
    for kw in w.keys():
        for kc in c.keys():
            w_intersect_c=len(set(w[kw]).intersection(set(c[kc])))
            if w_intersect_c > 0:
                Iwc += w_intersect_c * np.log(n_sample * w_intersect_c/(len(w[kw]) * len(c[kc])))
    Iwc /= n_sample
    Hc = entropy(c, n_sample)
    Hw = entropy(w, n_sample)

    return 2 * Iwc / (Hc + Hw)

We can use this quantity to evaluate our Ising clustering results.

nmi = NMI(y_batch, labels)
print(f"NMI: {nmi}")

## Compare to a shuffled version of the labels
shuffled_labels = np.random.permutation(labels)
nmi = NMI(y_batch, shuffled_labels)
print(f"NMI: {nmi}")
NMI: 0.3466871043833046
NMI: 0.000994416700827419

Let’s now try other clustering methods

  • K-means is a simple and fast clustering algorithm, which we implemented ourselves above

  • There are many other clustering algorithms available in the scikit-learn library. We will now try a few of them and compare their performance on the Ising dataset.

  • The score of a clustering is given by NMI

This code is adapted from the scikit-learn documentation

import time
import warnings


from sklearn import cluster, datasets, mixture
from sklearn.neighbors import kneighbors_graph
from sklearn.preprocessing import StandardScaler
from itertools import cycle, islice


np.random.seed(1)

## Split data
X_batch, y_batch = X_pca[:400], y_batch[:400]
# y_batch = tt

# ============
# Set up cluster parameters
# ============
## We define the hyperparameters for the clustering algorithms
default_base = {'quantile': .3,
                'eps': .3,
                'damping': .9,
                'preference': -200,
                'n_neighbors': 10,
                'n_clusters': 3}
algo_params = {'damping': .9, 'preference': -240, 'quantile': .2, 'n_clusters': 3}
i_dataset = 0
# update parameters with dataset-specific values
params = default_base.copy()
params.update(algo_params)
# normalize dataset for easier parameter selection
X_batch = StandardScaler().fit_transform(X_batch)
bandwidth = cluster.estimate_bandwidth(X_batch, quantile=params['quantile'])
# connectivity matrix for structured Ward
connectivity = kneighbors_graph(
    X_batch, n_neighbors=params['n_neighbors'], include_self=False)
# make connectivity symmetric
connectivity = 0.5 * (connectivity + connectivity.T)

# ============
# Create clusterer objects
# ============
ms = cluster.MeanShift(bandwidth=bandwidth, bin_seeding=True)
two_means = cluster.MiniBatchKMeans(n_clusters=params['n_clusters'])
ward = cluster.AgglomerativeClustering(
    n_clusters=params['n_clusters'], linkage='ward',
    connectivity=connectivity)
spectral = cluster.SpectralClustering(
    n_clusters=params['n_clusters'], eigen_solver='arpack',
    affinity="nearest_neighbors")
dbscan = cluster.DBSCAN(eps=params['eps'])
affinity_propagation = cluster.AffinityPropagation(
    damping=params['damping'], preference=params['preference'])
average_linkage = cluster.AgglomerativeClustering(
    linkage="average", affinity="cityblock",
    n_clusters=params['n_clusters'], connectivity=connectivity)
birch = cluster.Birch(n_clusters=params['n_clusters'])
gmm = mixture.GaussianMixture(
    n_components=params['n_clusters'], covariance_type='full')

clustering_algorithms = (
    ('MiniBatchKMeans', two_means),
    ('AffinityPropagation', affinity_propagation),
    ('MeanShift', ms),
    ('SpectralClustering', spectral),
    ('Ward', ward),
    ('AgglomerativeClustering', average_linkage),
    ('DBSCAN', dbscan),
    ('Birch', birch),
    ('GaussianMixture', gmm)
)


plt.figure(figsize=(10, 10))
plt.subplots_adjust(left=.02, right=.98, bottom=.001, top=.96, wspace=.05,
                    hspace=.01)
for i, (name, algorithm) in enumerate(clustering_algorithms):
    t0 = time.time()

    # catch warnings related to kneighbors_graph
    with warnings.catch_warnings():
        warnings.filterwarnings(
            "ignore",
            message="the number of connected components of the " +
            "connectivity matrix is [0-9]{1,2}" +
            " > 1. Completing it to avoid stopping the tree early.",
            category=UserWarning)
        warnings.filterwarnings(
            "ignore",
            message="Graph is not fully connected, spectral embedding" +
            " may not work as expected.",
            category=UserWarning)
        algorithm.fit(X_batch)

    t1 = time.time()
    if hasattr(algorithm, 'labels_'):
        y_pred = algorithm.labels_.astype(np.int32)
    else:
        y_pred = algorithm.predict(X_batch)

    score = NMI(y_batch, y_pred)

    # if i_dataset == 0:
    #     plt.title(name, size=14)

    colors = np.array(list(islice(cycle(['#377eb8', '#ff7f00', '#4daf4a',
                                            '#f781bf', '#a65628', '#984ea3',
                                            '#999999', '#e41a1c', '#dede00']),
                                    int(max(y_pred) + 1))))

    plt.subplot(3, 3, i+1)

    plt.scatter(X_batch[:, 0], X_batch[:, 1], s=10, color=colors[y_pred])
    plt.xticks(())
    plt.yticks(())

    plt.title(name + "\n Time: " + ("%.2fs" % (t1 - t0)).lstrip("0") + " NMI:" + ("%.2f" % score).lstrip("0"), size=14, y=1.0, pad=-28)

# plt.show()
<Figure size 1000x1000 with 9 Axes>

Remarks and discussion

We can see that the NMI doesn’t perfectly match our expectations. It rewards algorithms that assign more labels than are necessary to fully describe the dataset. For this reason, other metrics like the silhouette score or the adjusted rand index are sometimes used instead.

In this simple setting, DBSCAN outperforms the other methods both in computational time and in prediction accuracy for almost all cases. Only Spectral Clustering appears to perform better on the PCA-featurized Ising dataset.

When tuning a clustering method it is important to understand what the implicit assumption of the clustering method are. For instance, methods based on density (local information), will typically fare well at clustering topological datasets, in which there is a distance metric relating different datapoints. Methods based on long-distance information (KK-means for instance), will typically perform poorly in such instances. Density-based methods will, however, have more difficulty dealing with datasets with large fluctuations in the density distribution of the dataset. Another drawback of density based methods is that they do not generalize well to high-dimensional space due to large sampling noise in the density estimates.

How do we choose the number of clusters?

  • For our Ising dataset, we knew to pick K=3K = 3 because we knew the ground truth number of states.
  • In general, we don’t have the ground truth. How do we choose the number of clusters in a principled way?

Evaluating clusterings without known labels

The silhouette score is a measure of how well a datapoint fits into its assigned cluster. It is defined as:

si=biaimax(ai,bi)s_i = \frac{b_i - a_i}{\max(a_i, b_i)}

where aia_i is the average distance between ii and all other points in the same cluster,

ai=1CijCid(i,j)a_i = \frac{1}{|C_i|} \sum_{j \in C_i} d(i, j)

and bib_i is the smallest average distance between ii and the points in any other cluster,

bi=minki1CkjCkd(i,j)b_i = \min_{k \neq i} \frac{1}{|C_k|} \sum_{j \in C_k} d(i, j)

The silhouette score is bounded between -1 and 1, with 1 indicating that the datapoint is well within its cluster, and -1 indicating that the datapoint is poorly assigned to its cluster.

Elbow method

The elbow method is a heuristic that attempts to find the optimal number of clusters by looking for the “elbow” in the plot of the silhouette score as a function of the number of clusters. The intuition is that the silhouette score will increase as the number of clusters increases, but at some point the increase will slow down. This is the “elbow” of the curve. The optimal number of clusters is the number of clusters at the elbow.

This approach is very similar to how we determined the appropriate number of principal components to include in our PCA analysis.

Gap statistic

The gap statistic is a heuristic that attempts to find the optimal number of clusters by looking for the “gap” in the plot of the log of the within-cluster sum of squares as a function of the number of clusters. The intuition is that the within-cluster sum of squares will decrease as the number of clusters increases, but at some point the decrease will slow down. This is the “gap” of the curve. The optimal number of clusters is the number of clusters at the gap.

Discussion

The elbow method and the gap statistic are heuristics that attempt to find the optimal number of clusters. There are other metrics that we can imagine using instead of the silouette score, such as the distortion (the kmeans error metric, which determined cluster purity). For these other metrics, we can still use the elbow method or the gap statistic to find the optimal number of clusters.

Hierarchical clustering: number of clusters depends on the resolution at which we want to analyze the data.

# Silouhette score

def pairwise_distances(X, Y=None, metric='euclidean'):
    """
    Compute pairwise distance matrix between two arrays.
    
    Args:
        X (array-like): array of shape (n_samples, n_features)
        Y (array-like): array of shape (n_samples, n_features)

    Returns:
        D (array-like): array of shape (n_samples, n_samples)
    """
    if Y is None:
        Y = X
    return np.sqrt(((X[:, None, :] - Y[None, :, :]) ** 2).sum(axis=2))

def silhouette_score(X, labels):
    """Compute the mean Silhouette Coefficient of all samples."""
    unique_labels = np.unique(labels)
    centroids = np.array([X[labels == i].mean(axis=0) for i in unique_labels])
    a = np.array([np.mean(pairwise_distances(X[labels == i], centroids[unique_labels == i], metric='euclidean')) for i in unique_labels])
    b = np.array([np.mean(pairwise_distances(X[labels == i], centroids[unique_labels != i], metric='euclidean')) for i in unique_labels])
    return np.mean((b - a) / np.maximum(a, b))



model = KMeansClusterer(n_clusters=3, max_iter=100, tol=1e-4, store_history=True, random_state=14)
labels = model.fit_predict(X_pca)

silhouette_score(X_pca, labels)
Converged after 4 iterations
0.9336642586403251

We vary KK in the range [2,10][2, 10] and plot the silhouette score as a function of KK


all_silouette_scores = []
kvals = np.arange(2, 11)
all_labels = list()
for k in kvals:
    model = KMeansClusterer(n_clusters=k, max_iter=100, tol=1e-4, store_history=True, random_state=14)
    labels = model.fit_predict(X_pca)
    all_silouette_scores.append(silhouette_score(X_pca, labels))
    all_labels.append(labels.copy())


plt.plot(kvals, all_silouette_scores, '-')
plt.plot(kvals, all_silouette_scores, '.', markersize=10)
plt.xlabel('k')
plt.ylabel('Silhouette score')
Converged after 3 iterations
Converged after 4 iterations
Converged after 24 iterations
Converged after 31 iterations
Converged after 16 iterations
Converged after 62 iterations
Converged after 56 iterations
Converged after 56 iterations
Converged after 53 iterations
<Figure size 640x480 with 1 Axes>

Overfitting versus underfitting

from itertools import cycle, islice
colors = np.array(list(islice(cycle(['#377eb8', '#ff7f00', '#4daf4a',
                                            '#f781bf', '#a65628', '#984ea3',
                                            '#999999', '#e41a1c', '#dede00']),
                                    int(max(kvals) + 1))))
for i in range(len(kvals)):
    plt.subplot(3, 3, i+1)
    plt.scatter(X_pca[:, 0], X_pca[:, 1], s=10, color=colors[all_labels[i]])
<Figure size 640x480 with 9 Axes>

Appendix

  • Leftover code from this tutorial
## Make an interactive video
from ipywidgets import interact, interactive, fixed, interact_manual, Layout
import ipywidgets as widgets

def plotter(i):
    plt.figure(figsize=(8, 6))
    labels = model.label_history[i]
    for label_val in np.unique(labels):
        plt.scatter(X_pca[labels==label_val, 0], X_pca[labels==label_val, 1])
    plt.plot(*model.history[i].T, '.k', markersize=20)
    plt.axis('off')
    plt.show()

interact(
    plotter, 
    i=widgets.IntSlider(0, 0, len(model.history) - 1, 1, layout=Layout(width='800px'))
)
Loading...
References
  1. Wang, L. (2016). Discovering phase transitions with unsupervised learning. Physical Review B, 94(19). 10.1103/physrevb.94.195105
  2. Stephens, G. J., Johnson-Kerner, B., Bialek, W., & Ryu, W. S. (2008). Dimensionality and Dynamics in the Behavior of C. elegans. PLoS Computational Biology, 4(4), e1000028. 10.1371/journal.pcbi.1000028