Median trick in high dimension

Tyler Chen

Introduction

The “median trick” (sometimes called “boosting”) is a clever way of turning a (scaler valued) randomized algorithm that succeeds with some constant probability (e.g. 2/3) into one that succeeds with arbitrarily high probability 1δ1-\delta with only logarithmic overhead.

The idea is simple: suppose we have a randomized algorithm that produces an approximation xx to xx^* satisfying P[xx<ε]>2/3\mathbb{P}[ |x^*-x| < \varepsilon ] > 2/3. We can boost the success probability by running the algorithm multiple times and taking the median of the results.

  1. Run the algorithm mm times to get estimates x1,x2,,xmx_1, x_2, \ldots, x_m
  2. Let x^=median(x1,,xm)\widehat{x} = \operatorname{median}(x_1, \ldots, x_m)

Fact. For m=O(log(1/δ))m = O(\log(1/\delta)), we have that P[xx^<ε]>1δ\mathbb{P}[ |x^*-\widehat{x}| < \varepsilon ] > 1-\delta.

Notably, the dependence on the target failure probability δ\delta is only logarithmic.

The proof is simple and standard, so we will just provide a high level overview.

Proof. By definition of the median, half of the estimates xix_i are below x^\widehat{x} and half are above. Thus, the only way for x^\widehat{x} to be outside of [xε,x+ε][x^*-\varepsilon, x^*+\varepsilon] is if at least half of the estimates xix_i are outside of this interval. However, since each xix_i live within this interval with probability at least 2/32/3, it is (exponentially) unlikely that half of them live outside of the interval.   ~~\square

High dimensions

What about in high dimensions? I.e. if we have a randomized algorithm that produces an approximation x\mathbf{x} to x^Rd\widehat{\mathbf{x}}\in\mathbb{R}^d satisfying P[x^x<ε]>2/3\mathbb{P}[ \|\widehat{\mathbf{x}}-\mathbf{x}\| < \varepsilon ] > 2/3? While the median trick in 1d appears in almost all course notes for a class on randomized algorithms, finding a statement of a high-dimensional analog is surprisingly much more difficult.

The simplest approach is to apply the one-dimensional estimator along each dimension. The main issue with this is that xx^\|\mathbf{x} - \widehat{\mathbf{x}}\| being small is not necessarily the same thing as each coordinate being close (i.e. xx^\|\mathbf{x} - \widehat{\mathbf{x}}\|_\infty being small). So we will lose something in the accuracy from converting between norms. For instance, if \| \cdot \| is the standard Euclidan norm, then we will lose a factor of d\sqrt{d} in the accuracy.

A better estimator

Here is a better approach that only loses a factor of 3 on the accuracy. I was introduced to this approach by Chris Musco, but I’d expect its a folklore type result.

  1. Make independent copies x1,,xm\mathbf{x}_1, \ldots, \mathbf{x}_m of x\mathbf{x}
  2. For all i,j[m]i,j\in[m], define d(i,j)=xixjd(i,j) = \| \mathbf{x}_i - \mathbf{x}_j \|
  3. For all i[m]i\in[m], define c(i)=median(d(i,1),,d(i,t))c(i) = \operatorname{median}(d(i,1),\ldots, d(i,t))
  4. i=argmini[m]c(i)i^* = \operatorname{argmin}_{i\in[m]} c(i)
  5. x^=xi\widehat{\mathbf{x}} = \mathbf{x}_{i^*}

Fact. For m=O(log(1/δ))m = O(\log(1/\delta)), we have that P[xx^<3ε]>1δ\mathbb{P}[ |\mathbf{x}^*-\widehat{\mathbf{x}}| < 3\varepsilon ] > 1-\delta.

Proof. Let G={i:xxiε}G = \{ i : \| \mathbf{x}^* - \mathbf{x}_i \| \leq \varepsilon\}. Since E[G]>2m/3\mathbb{E}[|G|] > 2 m/3, it follows from a standard Chernoff Bound that Pr(Gt/2)δ\Pr(|G| \leq t/2)\leq \delta for m=O(log(1/δ))m = O(\log(1/\delta)). Therefore, it suffices to show that if G>m/2|G|> m/2, then xx^<3ε\| \mathbf{x}^*-\widehat{\mathbf{x}} \| < 3\varepsilon.

Suppose G>m/2|G| > m/2. By the triangle inequality, for each i,j[m]i,j\in[m], d(i,j)=xixjxxi+xxj. d(i,j) = \|\mathbf{x}_i - \mathbf{x}_j\| \leq \| \mathbf{x}^* - \mathbf{x}_i \| + \| \mathbf{x}^* - \mathbf{x}_j \|. Therefore, for each iGi\in G {j[m]:d(i,j)2ε}>t/2, \big|\{j\in [m] : d(i,j) \leq 2\varepsilon\}\big| > t/2, and hence, c(i)=median(d(i,1),,d(i,t))2ε. c(i) = \operatorname{median}\big(d(i,1),\ldots, d(i,t)\big) \leq 2\varepsilon.

It follows that c(i)=minic(i)2εc(i^*) = \min_i c(i) \leq 2\varepsilon, and since c(i)=median(d(i,1),,d(i,t))c(i^*) = \operatorname{median}(d(i^*,1),\ldots, d(i^*,t)), we have {j[m]:d(i,j)2ε}t/2. \big|\{j\in [m] : d(i^*,j) \leq 2\varepsilon\}\big| \geq t/2. But also G>t/2|G|>t/2. So, by the pigeonhole principle, there is at lest one index jGj^*\in G for which d(i,j)2ε. d(i^*,j^*) \leq 2\varepsilon. Therefore, by the triangle inequality xxixxj+xixjε+2ε=3ε, \| \mathbf{x}^* - \mathbf{x}_{i^*} \| \leq \| \mathbf{x}^* - \mathbf{x}_{j^*} \| + \| \mathbf{x}_{i^*} - \mathbf{x}_{j^*} \| \leq \varepsilon + 2\varepsilon = 3\varepsilon, as desired.  ~~\square