Home

The Johnson-Lindenstrauss lemma for the brave

December 23, 2020 - machine learning functional analysis transformer

If you are interested in dimensionality reduction, chances are that you have come across the Johnson-Lindenstrauss lemma. I learned about it while studying the Linformer paper, which contains a result on dimensionality reduction for the Transformer. Essentially, they prove that self-attention is low-rank, by appealing to the Johnson-Lindenstrauss lemma.

Online, you can find many elementary proofs of this lemma. Unfortunately, most of them have sacrificed some intuition and generality in order to simplify the proofs, for instance by using Gaussian distributions. I was left a bit unsatisfied by this, so I started to study the proof in the original paper 1. According to the eponymous authors, the lemma is "a simple consequence of the isoperimetric inequality for the nn-sphere". But what does that mean? That is the content of this post.

Statement of the lemma

The Johnson-Lindenstrauss lemma says the following. Given an integer NN, there is a small dimension kk, such that for any dimension nn and every every NN points x1,,xNRn{x_1, \dots, x_N} \in \mathbb{R}^n, there exists a linear map L:RnRk L : \mathbb{R}^n \to \mathbb{R}^k which is very close to preserving all distances between each pair of the NN points. The lemma specifies how small kk can be, namely O(K(τ)logN)O(K(\tau)\log N). Here τ\tau makes precise what it means to be close to distance-preserving. We will actually see that K(τ)K(\tau) can be taken to be 1/τ21/\tau^2.

The formal statement of the lemma goes like this.

Lemma. Let 0<τ<10 < \tau < 1. Then there is a constant K>0K > 0 so that if x1,,xNRn{x_1, \dots, x_N} \in \mathbb{R}^n for N2N \geq 2 and kKlogN/τ2k \geq \lceil K \log N / \tau^2 \rceil, there is a linear map L:RnRkL : \mathbb{R}^n \to \mathbb{R}^k such that (1τ)xixjL(xi)L(xj)(1+τ)xixj (1-\tau) \lVert x_i - x_j \rVert \leq \lVert L(x_i) - L(x_j) \rVert \leq (1+\tau) \lVert x_i - x_j\rVert for all 1i,jN1 \leq i , j \leq N.

Before I start with the proof, observe that is enough to find LL which satisfies for all i,ji, j: (1τ)MxixjL(xi)L(xj)(1+τ)Mxixj. (1-\tau) M \lVert x_i - x_j \rVert \leq \lVert L(x_i) - L(x_j) \rVert \leq (1+\tau) M\lVert x_i - x_j\rVert. For some constant MM. Because then LL can just be rescaled to satisfy the condition of the lemma. So it is enough to focus on this slightly more relaxed statement.

Like its many adaptations, the original proof is probabilistic. That means that the strategy is to show that a random LL (sampled from a suitable space) satisfies the (relaxed) statement with nonzero probability. Therefore, there must exist one. A tighter estimate for the probability is nice to have, because this will give an estimate of how quickly you can find a suitable witness. It turns out that you can actually find one in randomized polynomial time, which is reassuring for applications in computer science.

Contribution

Besides presenting the original proof, I will derive an explicit witness KK. Such a witness is not given in the original papers (at least not one that works for small NN also). To obtain a reasonably low value for KK, I did some of the estimates differently. This value is nice to have because there are different ones around in literature, and not all of them apply to the lemma in the form stated here (see also the remark at the end). I do think there is room for improvement in the value I'm getting currently!

Summary of the proof

Before we dive in, we can summarise the structure of the proof in the following steps:

  1. Restate the goal to a probability that a certain seminorm on Rn\mathbb{R}^n is almost constant on a finite set of points on the (n1)(n-1)-sphere.
  2. Estimate this probability from below using the isoperimetric inequality, which is the fact that a 'cap' on the (n1)(n-1)-sphere has the lowest perimeter-to-area ratio of all subsurfaces of the (n1)(n-1)-sphere.
  3. Find a lower bound for the constant in (1.) and use it to show that the probability is nonzero.
  4. Conclude.

To make the proof more accessible, I have replaced all measure-theoretic notation by the more intuitive probabilities or (normalized) surface area. For measurable subsets ASn1A \subset S^{n-1} of the (n1)(n-1)-sphere I denote their normalized 'surface area' by voln1(A)\text{vol}_{n-1}(A), referring to the unique rotation invariant normalized (or probability) Borel measure on Sn1S^{n-1}. But you can think of it as surface area on a sphere.

The proof

As a preliminary, we need a definition of random linear map L:RnRkL : \mathbb{R}^n \to \mathbb{R}^k, which we will just interpret as a rank kk matrix L:RnRnL : \mathbb{R}^n \to \mathbb{R}^n. Many elementary proofs use a Gaussian distribution on the components of the matrix. But the original paper does something a bit more elegant.

Namely, although there is no standard notion of random matrix, there is a canonical choice of random orthogonal matrix. Recall that a matrix UU is orthogonal if UTU=UUT=1U^TU = UU^T = 1. These are the linear operations on Rn\mathbb{R}^n that preserve inner products. They can be thought of as compositions of rotations around an axis and reflections along an axis. And it turns out that there is a well-defined and unique notion of random variable with values in O(n)O(n), satisfying some nice conditions. For instance, this notion of randomness is invariant along postcomposition with a fixed orthogonal transformation. The mathematical reason behind this is that O(n)O(n) forms a compact Lie group on which there is a unique normalized Haar measure. But for the rest of this post, we will only need the fact that orthogonal matrices can be sampled and that they enjoy rotation invariance.

We can use this to obtain a notion of 'random rank kk projection' matrix LL as follows. Suppose UO(n)U \in O(n) is a random orthogonal matrix. Then define L:=UTQU L := U^TQU where QQ is the projection on the first kk coordinates. That's it! Note that not all matrices have this shape. But to find a matrix, we will see that it is enough to sample LL in this way. We can now take our problem to the unit sphere.

1. Reformulation to a problem on the (n1)(n-1)-sphere

We use the following notation for taking the (scaled) norm of the first kk components of an nn-vector: r(y)=bQy=bi=1kyi2 r(y) = b \lVert Qy\rVert = b\sqrt{\sum_{i=1}^k y_i^2} where b>0b > 0 could be anything for now. Observe that r(y)byr(y) \leq b \lVert y\rVert.

The difference vectors xixjx_i - x_j of the nn points can be projected down to the surface of the unit ball, the sphere Sn1S^{n-1}: B={xixjxixj:1i<jn}Sn1 B = \left\{\frac{x_i - x_j}{\lVert x_i - x_j\rVert} : 1 \leq i < j \leq n\right\} \subset S^{n-1} Then, our goal can be formulated as follows. We want to find UO(n)U \in O(n), such that for every yBy \in B, we have: (1τ)Mr(Uy)(1+τ)M (1-\tau)M \leq r(Uy) \leq (1+\tau)M for some constant MM. This can be interpreted as: given a set of points on the (n1)(n-1)-sphere, find an orthogonal transformation so that the norms of the first kk coordinates of all transformed points are close to the same constant. In the following, we will show that a random UO(n)U \in O(n) satisfies the condition with positive probability.

2. Using isoperimetric inequality to estimate probability

To make things easier, we will first study r(y)r(y) for yy randomly sampled from Sn1S^{n-1}. This is the same thing as studying r(Uy0)r(Uy_0) for fixed y0Sn1y_0 \in S^{n-1} and UO(n)U \in O(n) random, thanks to rotation invariance.

Seminorms concentrated near their median

The function rr defined above defines a seminorm on Rn\mathbb{R}^{n}. The precursor paper 2 proves an interesting property about (semi)norms that satisfy r(y)byr(y) \leq b \lVert y\rVert for some bound bb, like our rr. Namely, their value for random vectors on the unit sphere is increasingly concentrated around their median as nn increases.

With the median MrM_r of rr, we mean any value (there exists one) MrM_r such that P(r(y)Mr)12 and P(r(y)Mr)12 \mathbb{P}(r(y) \geq M_r) \geq \frac{1}{2} \text{ and }\mathbb{P}(r(y) \leq M_r) \geq \frac{1}{2} (This is the slightly awkward mathematical definition of median).

For a subset ASn1A \subset S^{n-1}, denote by AϵA_\epsilon the set of points in Sn1S^{n-1} that have (geodesic) distance <ϵ<\epsilon to some point in AA. In the next section, we will estimate the probability P(yAϵ) \mathbb{P}(y \in A_\epsilon) for a suitable set AA from below, using a geometric fact about spheres.

Isoperimetric inequality for spheres

Specifically, the authors make use of the isoperimetric inequality. This is the intuitively (at least for n=3n = 3) obvious fact that a 'cap' on the (n1)(n-1)-sphere has the smallest perimeter-to-area ratio of all possible measurable subsurfaces of the (n1)(n-1)-sphere. Denote by B(r,x0)B(r,x_0) the 'cap' centred at a point x0x_0 of radius rr (measured in radians, so 0<r<π0 < r < \pi). For n=3n = 3 and r=π/2r=\pi/2, this is the cap which covers half the sphere, drawn as the blue area on the surface of the sphere below. The red band indicates all points in the complement that are a geodesic distance <ϵ<\epsilon from B(r,x0)B(r,x_0). Both areas combined form the set B(r,x0)ϵB(r,x_0)_\epsilon.

Then we use the isoperimetric inequality in the following form: when AA is a subsurface with voln1(A)=voln1(B(r,x0)) \text{vol}_{n-1}(A) = \text{vol}_{n-1}(B(r,x_0)) then voln1(Aϵ)voln1(B(r,x0)ϵ). \text{vol}_{n-1}(A_\epsilon) \geq \text{vol}_{n-1}(B(r,x_0)_\epsilon). For all ϵ>0\epsilon > 0. This brings us to the median. Suppose A+={ySn1r(y)Mr}A^+ = \{y \in S^{n-1} \mid r(y) \geq M_r\}, and let B(π/2,x0)B(\pi/2,x_0) be the 'half cap' as above. Then for all ϵ\epsilon: voln1(Aϵ+)voln1(B(π/2,x0)ϵ) \text{vol}_{n-1}(A^+_\epsilon) \geq \text{vol}_{n-1}(B(\pi/2, x_0)_\epsilon) by the isoperimetric inequality. For A={ySn1r(y)Mr}A^- = \{y \in S^{n-1} \mid r(y) \leq M_r \}, we have a similar estimate voln1(Aϵ)voln1(B(π/2,x0)ϵ). \text{vol}_{n-1}(A^-_\epsilon) \geq \text{vol}_{n-1}(B(\pi/2, x_0)_\epsilon). So for the join of the complements of the two areas, we get: voln1(Sn1AϵSn1Aϵ+)2(voln1(Sn1)voln1(B(π/2,x0)ϵ))=2(1voln1(B(π/2,s0)ϵ)) \begin{aligned} \text{vol}_{n-1}(S^{n-1} \setminus A^-_\epsilon \cup S^{n-1} \setminus A^+_\epsilon) &\leq 2(\text{vol}_{n-1}(S^{n-1}) - \text{vol}_{n-1}(B(\pi/2, x_0)_\epsilon))\\ &= 2(1 - \text{vol}_{n-1}(B(\pi/2, s_0)_{\epsilon})) \end{aligned} But the left-hand side is precisely the area voln1(Sn1Aϵ)\text{vol}_{n-1}(S^{n-1} \setminus A_\epsilon), where A:={ySn1r(y)=Mr}A := \{y \in S^{n-1} \mid r(y) = M_r\}. Hence we obtain our estimate as: voln1(Aϵ)2voln1(B(π/2,x0)ϵ)1 \text{vol}_{n-1}(A_\epsilon) \geq 2\cdot \text{vol}_{n-1}(B(\pi/2, x_0)_\epsilon) - 1 which in terms of probabilities of random points on the sphere becomes P(yAϵ)2P(yB(π/2,x0)ϵ)1. \mathbb{P}(y \in A_\epsilon) \geq 2\cdot \mathbb{P}(y \in B(\pi/2,x_0)_\epsilon) - 1. This result is also known as Levy's lemma. Now since the geodesic distance to AA is always bigger than the euclidean distance, this implies: P(Mrbϵr(y)Mr+bϵ)2P(yB(π/2,x0)ϵ)1 \mathbb{P}(M_r - b\epsilon \leq r(y) \leq M_r + b \epsilon) \geq 2\cdot \mathbb{P}(y \in B(\pi/2,x_0)_\epsilon) - 1 (you have to use the triangle inequality twice - here we need that rr is a seminorm).

Now, one can evaluate the right-hand side using ordinary calculus. I will not go over the calculations, but the authors use it to rewrite the estimate as P(yAϵ)2γn0ϵcosn2tdt \mathbb{P}(y \in A_\epsilon) \geq 2 \gamma_n\int_{0}^\epsilon \cos^{n-2}t dt and then estimate 2γn0ϵcosn2tdt14enϵ2/2. 2\gamma_n\int_0^\epsilon \cos^{n-2}t dt \geq 1 - 4 e^{-n\epsilon^2/2}. The curious property this exhibits is that for higher dimensional spheres, most of their 'surface area' is centred around their 'perimeter'. That is, for high nn, the normalized area voln1(B(x0,π/2)ϵ) \text{vol}_{n-1}(B(x_0, \pi/2)_\epsilon) gets close to 11, even for small ϵ\epsilon, contrasting the intuition from the illustration above. For us, the estimate implies P(Mrbϵr(y)Mr+bϵ)14enϵ2/2. \mathbb{P}(M_r - b\epsilon \leq r(y) \leq M_r + b \epsilon) \geq 1 - 4 e^{-n\epsilon^2/2}. Note this holds for all ϵ\epsilon. Now say we can find a lower bound for MrM_r, e.g. LMrL \leq M_r for a constant LL. Then by letting ϵ:=τL/b\epsilon := \tau L/b, we get: P((1τ)Mrr(Uy)(1+τ)Mr)P(MrτLr(Uy)Mr+τL)P(Mrϵbr(Uy)Mr+ϵb)14enϵ2/214enτ2L2/2b2 \begin{aligned} &\mathbb{P}((1-\tau)M_r \leq r(Uy) \leq (1+\tau)M_r) \\ &\geq \mathbb{P}(M_r - \tau L \leq r(Uy) \leq M_r + \tau L) \\ &\geq \mathbb{P}(M_r - \epsilon b \leq r(Uy) \leq M_r + \epsilon b) \geq 1 - 4 e^{-n\epsilon^2/2} \\ &\geq 1 - 4 e^{-n\tau^2 L^2 / 2 b^2} \end{aligned} For a fixed yy and UO(n)U \in O(n) random. But recall we want this estimate to be greater than zero for all yBy \in B, a set of N(N+1)/2N(N+1)/2 points (all pairs). In the worst case, each yBy \in B has its own region of UU's where the above condition is violated. This gives a new estimate: P((1τ)Mrr(Uy)(1+τ)Mr for all yB)12N(N+1)enτ2L2/2b2 \begin{aligned} &\mathbb{P}((1-\tau)M_r \leq r(Uy) \leq (1+\tau)M_r \text{ for all } y \in B) \\ &\geq 1-2 N (N+1) e^{-n\tau^2L^2 / 2 b^2} \end{aligned} All we want is this to be strictly greater than zero. What remains is to find a high enough lower bound LL for MrM_r.

3. Finding a high enough lower bound for MrM_r

To find a lower bound for MrM_r, we derive some more statistical properties of r(y)r(y). What can we say about its expected value over the (n1)(n-1)-sphere? Firstly, we know that rr is concentrated around the median, so it should probably be somewhere close to the median! Indeed, observe that P(mr(y)Mr)4enm22b2 \mathbb{P}(m \leq |r(y) - M_r|) \leq 4 e^{-\frac{nm^2}{2b^2}} for every integer m0m\geq 0, by choosing ϵ:=m/b\epsilon := m / b. So: EySn1r(y)MrEySn1r(y)Mrm=0PySn1(r(y)Mr>m)m=0PySn1(r(y)Mrm)1+m=1min{4enm22b2,1} \begin{aligned} &\mathbb{E}_{y \in S^{n-1}} \left| r(y) - M_r \right| \\ &\leq\mathbb{E}_{y \in S^{n-1}} \lceil r(y) - M_r \rceil \\ &\leq \sum_{m = 0}^\infty \mathbb{P}_{y \in S^{n-1}}(\lceil r(y) - M_r \rceil > m) \\ &\leq \sum_{m = 0}^\infty \mathbb{P}_{y \in S^{n-1}}(\left| r(y) - M_r \right| \geq m) \\ &\leq 1 + \sum_{m = 1}^{\infty} \min\{4e^{-\frac{nm^2}{2b^2}}, 1\} \end{aligned} Therefore we get: MrEySn1(r(y))<C(σ) |M_r - \mathbb{E}_{y \in S^{n-1}} (r(y))| < C(\sigma) writing σ:=b/n\sigma := b / \sqrt{n}.

Working a bit harder, we can squeeze out a rough upper bound for C(σ)C(\sigma): C=1+m=1min{4em22σ2,1}1+40min{em22σ2,1/4}dm=1+σ2log4+4σ2log4em22σ2dm=1+σ2log4+4σ2π(1Φ(2log4))σ2log4+4σ2π0.05+12.15σ+1 \begin{aligned} C &= 1 + \sum_{m = 1}^{\infty} \min\{4e^{-\frac{m^2}{2\sigma^2}}, 1\} \\ &\leq 1 + 4 \int_0^\infty \min\{e^{-\frac{m^2}{2\sigma^2}}, 1/4\} dm \\ &= 1 + \sigma \sqrt{2 \log 4} + 4\int_{\sigma \sqrt{2 \log 4}}^\infty e^{-\frac{m^2}{2\sigma^2}} dm \\ &= 1 + \sigma \sqrt{2 \log 4} + 4 \sigma \sqrt{2\pi} \left( 1 - \Phi({\sqrt{2 \log 4}}) \right) \\ &\leq \sigma \sqrt{2 \log 4} + 4 \sigma \sqrt{2\pi} \cdot 0.05 + 1 \\ &\leq 2.15\sigma + 1 \end{aligned} where Φ\Phi is the CDF of the standard normal distribution.

The next section gives another estimate for the expected value of r(y)r(y).

A further two-sided estimate for the average of r(y)r(y)

Using Khintchine's inequality, the expectation can be estimated from above and from below: bEySn1,±(i=1k±yi)EySn1(r(y))=bEySn1(i=1kyi2)2bEySn1,±(i=1k±yi)\begin{aligned} &b \cdot \mathbb{E}_{y \in S^{n-1}, \pm} \left( \left| \sum_{i=1}^k \pm y_i \right| \right) \\ &\leq \mathbb{E}_{y \in S^{n-1}}(r(y)) = b \cdot \mathbb{E}_{y \in S^{n-1}}\left( \sqrt{\sum_{i=1}^k y_i^2} \right) \\ &\leq \sqrt{2} b \cdot \mathbb{E}_{y \in S^{n-1}, \pm} \left( \left| \sum_{i=1}^k \pm y_i \right| \right) \end{aligned} Where ±\pm denotes a uniformly drawn element of {1,1}\{-1,1\}. Using symmetry of the sphere, we can pull out the dependency on kk in the expectation: EySn1,±(i=1k±yi)=EySn1,±(y,i=1k±ei)=kEySn1(y,e1) \mathbb{E}_{y \in S^{n-1}, \pm}\left( \left| \sum_{i=1}^k \pm y_i \right| \right) = \mathbb{E}_{y \in S^{n-1}, \pm} \left( \left| \langle y, \sum_{i=1}^k \pm e_i\rangle \right| \right) = \sqrt{k} \mathbb{E}_{y \in S^{n-1}} \left( | \langle y, e_1 \rangle | \right) Because, by rotation invariance, i=1k±ei\sum_{i=1}^k \pm e_i can be replaced by any length k\sqrt{k} vector (we pick ke1\sqrt{k}e_1). So define an:=EySn1(y,e1). a_n := \mathbb{E}_{y \in S^{n-1}}\left( |\langle y, e_1 \rangle | \right). Then we obtain bkanEySn1(r(y))b2kan. b\sqrt{k} a_n \leq \mathbb{E}_{y \in S^{n-1}} (r(y)) \leq b\sqrt{2k}a_n. The trick is now that r(y)=br(y) = b for k=nk = n. So then the above yields that nan1\sqrt{n} a_n \leq 1 and 12nan1 \leq \sqrt{2n}a_n. So 1/2nan1/n1/\sqrt{2n} \leq a_n \leq 1/\sqrt{n} and: bk/2nEySn1(r(y))b2k/n. b\sqrt{k/2n} \leq \mathbb{E}_{y \in S^{n-1}} (r(y)) \leq b\sqrt{2k/n}. Very nice. Now let's finish the proof.

4. Finishing the proof

The two constraints on the expectation of r(y)r(y) imply that σk/2C(σ)Mr. \sigma\sqrt{k/2} - C(\sigma) \leq M_r. for σ:=b/n\sigma := b/\sqrt{n} as above.

Using this lower bound, we fill in the estimate obtained at the end of section 2: P((1τ)Mrr(Uy)(1+τ)Mr for all yB)12N(N+1)enτ2(σk/2C(σ))22b2=12N(N+1)eτ2(k/2C(σ)/σ)22. \begin{aligned} &\mathbb{P} \left( (1-\tau) M_r \leq r(Uy) \leq (1+\tau) M_r \text{ for all } y \in B\right) \\ &\geq 1 - 2 N (N+1) e^{-\frac{n\tau^2 (\sigma\sqrt{k/2}-C(\sigma))^2}{2b^2}} \\ &= 1 - 2 N (N+1) e^{-\frac{\tau^2 (\sqrt{k/2}-C(\sigma)/\sigma)^2}{2}}. \end{aligned} For N>2N > 2, this is always positive when (k/2C(σ)/σ)2>6logN/τ2(\sqrt{k/2}-C(\sigma)/\sigma)^2 > 6 \log N / \tau^2, or k>2(6log(N)τ+C(σ)/σ)2 k > 2 \left( \frac{\sqrt{6\log(N)}}{\tau} + C(\sigma)/\sigma \right)^2 For k>KlogN/τ2k > K \log N / \tau^2, this holds for all 0<τ<10 < \tau < 1 if: K2(6+C(σ)σlogN)2. K \geq 2 \left(\sqrt{6} + \frac{C(\sigma)}{\sigma\sqrt{\log N}} \right)^2. Using the upper bound for C(σ)C(\sigma), it is also enough when K2(6+2.15σ+1σlog3)2 \begin{aligned} K &\geq 2 \left(\sqrt{6} + \frac{2.15\sigma + 1}{\sigma \sqrt{\log 3}} \right)^2 \end{aligned} when N3N \geq 3. Since this gives a sufficient condition for b=nσb = \sqrt{n} \sigma arbitrarily large, a sufficient condition is also K2(6+2.15log3)2=40.51... K \geq 2 \left(\sqrt{6} + \frac{2.15}{\sqrt{\log 3}} \right)^2 = 40.51... This yields the concrete condition k41logN/τ2k \geq \lceil 41 \log N / \tau^2 \rceil. Trivially, the lemma holds for N=2N=2. For higher NN, it is of course possible to use a lower KK.

As a bonus, we also see that a suitable UU can be found in randomized polynomial time, since the right term in the estimate is O(1/N)\leq O(1/N) under this condition.

5. Final remarks

  1. Many sources (like Wikipedia) state the lemma in terms of the norm squared rather than the norm. This makes things a bit easier, because it is easy to see that E(r(y)2)=b2k/n\mathbb{E}(r(y)^2) = b^2 k / n. But the result is not entirely the same. For example, the values for KK derived for this other version are not applicable to the version stated here. Also, I quite like that this version produces an almost-isometry in the most proper sense.
  2. The fact that there exists a probability distribution on the space of matrices such that the condition of the lemma has nonzero probability is sometimes called the distributional Johnson-Lindenstrauss lemma. The version presented here is essentially the same, where the distribution is induced by the normalized Haar measure on O(n)O(n).
1

Johnson, Lindenstrauss. "Extensions of Lipschitz mappings into a Hilbert space". Contemporary Mathematics, vol. 26, 1984.

2

Figiel, Lindenstrauss, Milman. "The dimension of almost spherical sections of convex bodies". Acta Mathematica, vol. 139, 1977.