A Short Tutorial on Matrix Derivative - An Introduction to Algorithms for Physicists

Liu, Yuezhang¶

I notice that the matrix derivative technique might be useful for the contents of the course, and fortunately I know a good tutorial introducing the technique, but unfortunately the materials are in Chineses, therefore I translate and summarize the main idea of the tutorial and hope it could be helpful.

The contents are heavily borrowed from the post by 长躯鬼侠 (ECE PhD at CMU) on Zhihu (Chinese Quora).

The relation between differential and matrix derivative¶

Consider the matrix function $f(X): \mathcal{X}\to\mathbb{R}$ , which maps a matrix $X\in\mathbb{R}^{n\times d}$ into a real number. This is the common setting for calculus of variantions (e.g., backpropagation of neural networks, optimal control), as the loss/control objective should always be a real number to be optimized and evaluated.

Recall why we are good at calculating the usual derivative: we basically calculate the derivative based on the composition of simple rules. For the matrix case, the chain rule of derivatives could be error-prone, that’s what makes the problem complicated. But fortunately, the composition of differentials still holds. Indeed, the matrix derivative and differential are connected by the following relation

df = \textrm{tr}\left(\dfrac{\partial f}{\partial X}^TdX\right),

(1)

where $df\in\mathbb{R}$ has the same shape with $f$ and $dX,\dfrac{\partial f}{\partial X}\in\mathbb{R}^{n\times d}$ has the same shape with $X$ , $\textrm{tr}$ donotes the trace operator. Note that since $df$ is a scalar, $\textrm{tr}(df)=df$ , thus we could add trace on both sides.

In all, to calculate the matrix derivative $\dfrac{\partial f}{\partial X}$ , our plan would be:

Take the differential of $f$ w.r.t. $X$ , by the composition of differentials.
Add trace on both sides, and arange the terms into the key relation by trace tricks.
Directly readout the derivative from the relation between differential and matrix derivative.

Composition of differentials¶

Addition: $d(X\pm Y) = dX\pm dY$ ; Matrix multiplication: $d(XY)=(dX)Y + XdY$ ; Transpose: $d(X^T)=(dX)^T$ ; Trace: $d\textrm{tr}(X)=\textrm{tr}(dX)$ .
Inverse: $dX^{-1}=-X^{-1}dXX^{-1}$ .
Determinant: $d|X|=\textrm{tr}(X^*dX)$ , where $X^*=\overline{X}^T$ denotes the conjugate transpose.
Element-wise multiplication: $d(X\odot Y)=dX\odot Y + X\odot dY$ .
Element-wise function: $d\sigma(X)=\sigma'(X)\odot dX$ .

Trace tricks¶

Scalar: $a=\textrm{tr}(a)$ .
Transpose: $\textrm{tr}(A^T)=\textrm{tr}(A)$ .
Linear: $\textrm{tr}(A\pm B)=\textrm{tr}(A)\pm \textrm{tr}(B)$ .
Commutativity of matrix multiplication: $\textrm{tr}(AB)=\textrm{tr}(BA)$ , where $A$ and $B^T$ have the same shape.
Commutativity of matrix/element-wise multiplication: $\textrm{tr}(A^T(B\odot C))=\textrm{tr}((A\odot B)^TC)$ .

Example: non-linear regression¶

Consider a system $Y=\sigma(WX)$ , $X\in\mathbb{R}^{d\times n}$ consists of $n$ data points, each is a $d$ -dimensional vector, $W\in\mathbb{R}^{m\times d}$ , $Y\in\mathbb{R}^{m\times n}$ , $\sigma(\cdot)$ is some nonlinear function. Denote the label matrix as $\hat{Y}\in\mathbb{R}^{m\times n}$ , then the MSE loss

L = \frac{1}{2n}tr\left[(Y-\hat{Y})^T(Y-\hat{Y})\right].

(2)

Calculate $\dfrac{\partial L}{\partial W}$ .

By composition of differentials

dL = \frac{1}{2n}\textrm{tr}\left[dY^T(Y-\hat{Y})+(Y-\hat{Y})^TdY\right]=\frac{1}{n}\textrm{tr}\left[(Y-\hat{Y})^TdY\right]

(3)

=\frac{1}{n}\textrm{tr}\left[(Y-\hat{Y})^Td\sigma(WX)\right] = \frac{1}{n}\textrm{tr}\left[(Y-\hat{Y})^T(\sigma'(WX)\odot (dW X))\right].

(4)

By trace trick 5

dL = \frac{1}{n}\textrm{tr}\left[[(Y-\hat{Y})\odot\sigma'(WX)]^TdW X\right].

(5)

By trace trick 4

dL = \frac{1}{n}\textrm{tr}\left[X[(Y-\hat{Y})\odot\sigma'(WX)]^TdW\right].

(6)

Therefore, from the relation between differential and matrix derivative, we have

\dfrac{\partial L}{\partial W} = \left[(Y-\hat{Y})\odot\sigma'(WX)\right]X^T.

(7)