When Dirac meets the factorizer: machine learning in bra-ket notation

Let’s get something straight from the beginning: this is a fun post! The title is catchy, I’ll admit that, but there are no generalisations claims here. I’m not going to give you a complete formulation of machine learning in Dirac notation. This post is the story of how one can get inspired by Dirac notation and start writing down matrices in a weird way which turns out to be kind of useful to understand the specific problem of matrix factorization. I’m not sure if the notation is elegant or if it can be of some use to do other things, I’ll leave this decision to future generations. I think it has some weird beauty in it (not as much as the original Dirac notation, of course). Before I start, a warning: this post is not suitable for people that are highly sensitive to notation abuse. I abuse Dirac notation a lot here, and may the gods of mathematics have mercy of my soul.

Dirac notation in a nutshell

Let me just start by telling you what Dirac notation is. Dirac notation is a simple and concise way to write vectors and matrices. I mean, it is actually beautifully defined in a formal way, follow the link if you want to know more. But, for our porpuses, all we need to know here is that it is a nice way to write down and manipulate vectors. Let’s look into some examples.

We might want to start with the simplest case: 2-dimensional vectors. Let assume we have a 2-dimesional vector space (we will indicate this vector space as 2$\mathcal{H}^{\otimes 2}$). Now, if we want to write a 2-dimensional vector, we need to pick a basis (e1,e2)$(\vec{e_1}, \vec{e_2})$ in 2$\mathcal{H}^{\otimes 2}$ and we can write our vector v⃗ $\vec{v}$ as

v⃗ =α1e1+α2e2

For example, we can pick the euclidian basis, which is orthonormal and in 2-d can be represented as

e1=[10]e2=[01]

If we substitute the column expression and we use basic vector algebra, we obtain a column representation of the vector v⃗ $\vec{v}$ as

v⃗ =[α1α2]

Suppose now we have two vectors v⃗ $\vec{v}$ and v⃗ $\vec{v}^\prime$ and we want to calculate the scalar product of these two vectors. This is given by

v⃗ v⃗ =[α1α2][α1α2]=α1α1+α2α2

where v⃗ $\vec{v}^{\dagger}$ is the row corresponding to the column v⃗ ,$\vec{v},$ i.e. its hermitian conjugate.

So far so good. However, this notation can be heavy, especially if we have many vectors to write down and multiply in more complex ways. Luckily, there’s another way to write down column and rows vectors, which proved itself more powerful and clear, at least in some context (quantum mechanics): Dirac notation. Dirac notation is actually a symbolism to write vectors in any vector space (discrete or continuous ) of any dimensions. Moreover, it’s also possible to write the vectors in the dual space associated with the vector space in an immediate way. Here, let’s just focus on discrete dimensional spaces, where a vector is represented as a column and the corresponding vector in the dual space is a row.

Let’s start with our 2-dimensional example. It’s actually very simple: whenever we have column vectors, for example the basis vectors e1,e2,${\vec{e_1},\vec{e_2}},$ we write them inside a “ket” as

|e1=[10]|e2=[01]

The vector v⃗ $\vec{v}$ is a column too, so we can write it inside a ket too, as

|v=α1|e1+α2|e2

Now, if we want to transpose |v$|v\rangle$, it becomes a row (corresponding to a vector in the dual space). When we have a row, we can write it in Dirac notation inside a “bra” as

v|=α1e1|+α2e2|

where αi$\alpha_i^*$ is the complex conjugate of αi,$\alpha_i,$ in case we have complex coefficients. If the coefficients are real, obviously αi=αi.$\alpha_i^* = \alpha_i.$ The bra basis is now basis of rows, with

e1|=[10]

e2|=[01]

If we define a second vector |v$|v^\prime\rangle$ and we want to write the scalar product between the two, it is simply written as

v|v

Moreover, we can also combine the bras and the kets the other way around, as

|vv|

We just need to make sure the vectors have the same dimension k$k$, and the expression above represents a k×k$k \times k$ matrix. That’s more or less all we need to know about Dirac notation to go forward. Remember, kets pointing on the right (|.$|.\rangle$) are columns, and bras pointing on the left (.|$\langle . |$) are rows. By the way the words bra and ket comes from splitting in two the word bra(c)ket (the c got lost in the way for some reason).

Customized Dirac notation

What’s next? Now that we have learnt how to represent vectors in Dirac notation, we would like to use it to represent matrices! How can we do that? The natural environment for Dirac notation is quantum mechanics. In quantum mechanics, there are basically two types of matrices: Matrices representing operators on vectors, which are usually written as U$U$ because they are squared (i.e. of dimesions n×n$n\times n$) and unitary (i.e UU=$U^\dagger U = \mathcal{I}$, where $\mathcal{I}$ is the n×n$n\times n$ identity matrix), and matrices representing states, which are usually written as ρ,$\rho,$ which are squared too, Hermitian (i.e. ρ=ρ$\rho^\dagger = \rho$) and trace-one (i.e. the diagonal elements sum up to 1).

When one needs to write down density matrices ρ$\rho$ (that’s how they are called, density matrices), one can simply write them in Dirac notation as

ρ=i|ψiψi|

where |ψi${ |\psi_i\rangle }$ are the eigenvectors of ρ.$\rho.$

When one needs to write down an operator acting on a vector, (i.e. a matrix multiplied by a vector), Dirac notation comes very handy too, as the operation can be written as

|φ=U|φ

On the side, note that unitary matrices preserve the norm of a vector (which is a very important property in quantum mechanics). This is very easy to see in Dirac notation by writing down the norm of |φ$|\varphi^\prime\rangle$ as

φ|φ=φ|UU|φ=φ|φ

with the last equation coming from the fact that U$U$ is unitary and UU=.$U^\dagger U = \mathcal{I}.$

So far so good. However, in machine learning, the matrices we are dealing with are rarely squared and definitely not unitary or trace-one. We need to find another way to write them down.

Writing datasets using Dirac notation

A very common type of matrix in machine learning is the dataset matrix. It usually consists of a collection of vectors (the features vectors) arranged in rows (or columns, usually rows however), where each vector is a data point. Now the question is: is there a way to exploit Dirac notation for vectors to write down such matrices in a compact and informative way?

That’s the way I wrote them when I was working on my matrix factorization problem. As I mentioned above, in machine learning data points are usually represented as rows. But row vectors can be simply written as bras in Dirac notation. We can write down a n×k$n \times k$ matrix representing a dataset of n$n$ sample vectors with k$k$ features each in Dirac notation as

M(n,k)r=kφ1|kφ2|kφn|

Note that I put a suffix k$^k$ to indicate the dimensionality of the vector on the left of our bra. This is my first abuse (aehm, modification) of Dirac notation, and it will come handy in the following.
Let’s say now that we like to represents our data points as columns instead that as rows. It’s not big deal, as we can use the same rule and write the dataset matrix Mc$M_c$ using kets instead of bras

M(k,n)c=[|φ1k|φ2k|φnk]

Easy right? just remember that a bra k.|$^k\langle . |$ represents a row of dimension k$k$ and a ket |.k$| . \rangle_k$ represents a column of dimensions k$k$ and we can recover our matrices by simple substitution.

However, this notation is still cumbersome. We need to write columns or rows where their elements are bra or kets, and this might be heavy. Let us now abuse Dirac notation again, and write down M(n,k)r$M_r^{(n,k)}$ as

M(n,k)r=nkφ|

This notation contains exactly the same information as the one above. Moreover, if we like column vectors more than rows, we can simply write M(k,n)c$M_c^{(k,n)}$ as

M(k,n)c=|φkn

Noticing that M(n,k)r$M_r^{(n,k)}$ is the transpose of M(k,n)c$M_c^{(k,n)}$, we have found a nice correspondence between Dirac notation for vectors and this ‘homebrewed’ Dirac notation for matrices, in which the transpose operation is a simple transformation from the ‘weird’ ket |φkn$\rfloor |\varphi\rangle_k \rfloor_n$ to the ‘weird’ bra nkφ|$^n\lceil ^k\langle \varphi| \lceil$. We can get rid of the suffix in M$M$ and write

M=|φkn

M=nkφ|

Now one question could be: why would we write a dataset in this way? Well, because the notation is informative. We can see immediately, at first look, that it is a dataset. The inner bra-ket notation tell us that we are dealing with a collection of sample vectors and informs us about their dimensionality (i.e. the number of features k$k$). The outer hooks notation tell us how many samples we have (n$n$) and that the vectors are arranged in a matrix. The direction of the symbol tell us whether the vectors are arranged as columns or rows.

Matrix multiplication

To appreciate the usefulness of this notation, let’s suppose we want to write down a n×n$n\times n$ matrix D$D$ with all the scalar products between the n$n$ vectors in the dataset. Now, whether we write the vectors as rows or as columns, the matrix D$D$ is always written as

D=MM=nkφ||φkn

Now we need to be careful about one thing: that the inner suffixes in the expression (in this case the k$k$ suffixes) are the same in both sides. After we have checked that,
we can introduce the first rule to apply when you combine the symbols: when you have the two symbols $\lceil$ and ${\rfloor}$ next to each others, you can get rid of them. We apply the first rule to D$D$,

D=nkφ|φkn

The dimensions of D$D$ can be inferred by the outer suffix in the expression. Indeed, another rule to apply when one combines these objects is that the resulting matrix dimension is always given by the outer suffixes of the combination.

For example, we can also combine them as

D=MM=|φknnkφ|

and we introduce the second rule you can apply to manipulate the symbols: if you have the symbol $\rfloor$ on the far left of your equation, you can move it to the far right (or vice-versa, if you have $\lceil$ on the far right, you can move it to the far left).
Applying the second rule followed by the first rule, we obtain

D=|φknnkφ|

Now, remember that the dimensions of the matrix is given by the outer suffixes, so D$D^\prime$ has dimensions k×k.$k\times k.$ The matrix D$D^\prime$ (divided by n$n$) represents the covariance matrix of the dataset (I think!??). As before, we have checked that the inner suffixes (in this case the n$n$ suffixes) are the same.

Matrix factorization

The nice thing about this notation is that we can also work with matrices of different dimensions. To see that, let’s consider the specific problem of matrix factorization.

In matrix factorization, we have a target matrix of dimensions n×m$n \times m$. For example, in the context of recommendations, this matrix can be the matrix of ratings given by users (represented by rows) to items (represented by column). The problem is to find latent vectors of dimension k$k$ representing users and items, and such that the scalar product between the vector representing user i$i$ and the vector representing item j$j$ gives the corresponding rating in the target matrix. In matrix notation, we can write

u11u21un1u12u22un2u1ku2kunkv11v12v1kv21v22v2kvm1vm2vmk=Rn,m

where u⃗ i=[ui1,ui2,,uik]$\vec{u}^i = [u^i_1,u^i_2, \cdots, u^i_k]$ is the latent vector for user i,$i,$ v⃗ j=[vj1,vj2,,vjk]$\vec{v}^j = [v^j_1,v^j_2, \cdots, v^j_k]$ is the latent vector for item j\$and\(Rn,m$j and \(R^{n,m}$ represent the n×m$n \times m$ rating matrix. Now, is there a more compact way to write down this expression? Of course there is, with our weird Dirac-like notation. Let’s go again through the steps. We write the expression above as

ku1|ku2|kun|[|v1k|v2k|vmk]=Rn,m

Then, we can write the users latent vectors matrix as nku|$^n\lceil ^k\langle u | \lceil$ and the items latent vectors matrix as |vkm.$\rfloor | v \rangle_k \rfloor_m.$ Noticing that the LHS of the equation is the matrix of the scalar products, we can write the equation as

nku||vkm=Rn,m

Using our rule to remove the $\lceil$ and $\rfloor$, the equation can be written as

nku|vkm=Rn,m

In the same way as the notation for the dataset matrices contain useful information about the structure of the dataset, the notation above tells us important things about the equation. The inner bra-ket notation tells us immediately that the entries of the matrix are the scalar product between users and items latent vectors of dimension k$k$. The outer hooks tells us how many users and how many items are represented, and we can check the correspondence with the rating matrix R.$R.$

Take home

These are the key points you should take home from this post, if you are crazy enough to try this notation:

• Dirac notation for bras and kets works as usually: if you have an object like kφ|φk,$^k\langle \varphi^\prime | \varphi \rangle_k,$ this still represents the scalar product of two vectors. We just needs to be careful and check if the indices are the same, since in machine learning this might not be the case.

• The two rules to check whether your expression makes sense or not are:

• Rule 1: The outer indices in the whole combination denote the dimensions of the resulting matrix.
• Rule 2: The inner indices in the expression must be the same.
• The two rules to make your expressions pretty are simple:

• Rule 1-bis: if you have the two symbols $\lceil$ and $\rfloor$ next to each other, in this order (the shapes must match like in tetris) and with no indices, you can remove them from the equation.
• Rule 2-bis: if you have the symbol $\rfloor$ on the far left of the equation (or the symbol $\lceil$ in the far right) you can move it on the far right (far left).