Preamble: import these plotting utilities

InÂ [375]:

```
import numpy as np
# Wipe all outputs from this notebook
from IPython.display import Image, clear_output, display
clear_output(True)
# Import local plotting functions and in-notebook display functions
import matplotlib.pyplot as plt
%matplotlib inline
```

# A large matrix dataset: coauthorship among physicistsÂ¶

- Coauthorship among physicists based arXiv postings in
`astro-ph`

- The graph contains $N = 18772$ nodes, which correspond to unique authors observed over the period 1993 -- 2003
- If Author i and Author j coauthored a paper during that period, the nodes are connected
- In order to analyze this large graph, we will downsample it to a smaller graph with $N = 1000$ nodes representing the most highly-connected authors
- This dataset is from the Stanford SNAP database

InÂ [74]:

```
## Load the full coauthorship network
fpath = "../resources/ca-AstroPh.txt.gz"
# fpath = "../resources/ca-CondMat.txt.gz"
g = nx.read_edgelist(fpath)
## Create a subgraph of the 1000 most connected authors
subgraph = sorted(g.degree, key=lambda x: x[1], reverse=True)[:4000]
subgraph = [x[0] for x in subgraph]
g2 = g.subgraph(subgraph)
# rename nodes to sequential integers as they would appear in an adjacency matrix
g2 = nx.convert_node_labels_to_integers(g2, first_label=0)
pos = nx.spring_layout(g2)
# pos = nx.kamada_kawai_layout(g2)
# nx.draw_spring(g2, pos=pos, node_size=10, node_color='black', edge_color='gray', width=0.5)
nx.draw(g2, pos=pos, node_size=5, node_color='black', edge_color='gray', width=0.5, alpha=0.5)
plt.show()
```

# We can think of graphs as large, (often) sparse matricesÂ¶

- The adjacency matrix $A$ is a matrix of size $N \times N$ where $N$ is the number of nodes (authors)
- $A_{ij} = 1$ if there is an edge between nodes $i$ and $j$ (i.e., if authors $i$ and $j$ have coauthored a paper). Otherwise, $A_{ij} = 0$
- Co-authorship is an undirected graph, so $A$ is symmetric. That means that there are at most $N(N-1)/2$ unique edges in the graph.
- Usually, there are far fewer than that.

InÂ [75]:

```
A = nx.adjacency_matrix(g2).todense()
# find sparsity
n = A.shape[0]
density = np.sum(A != 0) / (n * n)
print("Density: {:.2f}%".format(density * 100))
```

Density: 1.26%

Let's look at this matrix a little more closely...

InÂ [83]:

```
plt.figure(figsize=(10, 10))
plt.spy(A, markersize=0.05, color='k') # a spy plot is a plot of the sparsity pattern of a matrix
```

Out[83]:

<matplotlib.lines.Line2D at 0x136f8bbb0>