Stein's paradox

In 1961 James and Stein published the paper Estimation with Quadratic Loss . Take normally distributed data with an unknown mean \(\mu\) and variance \(1\) . If you now choose a random value \(x\) from this data and have to estimate the mean \(\mu\) on the basis of this, intuitively \(x\) is a reasonable estimate for \(\mu\) (since a normal distribution is present, the randomly chosen \(x\) is probably near \(\mu\) ).

Now the experiment is repeated - this time with three independent, again normally distributed data sets each with variance \(1\) and the mean values \(\mu_1\) , \(\mu_2\) , \(\mu_3\) . After obtaining three random values \(x_1\) , \(x_2\) and \(x_3\) , one estimates (using the same procedure) \(\mu_1=x_1\) , \(\mu_2=x_2\) and \(\mu_3=x_3\) .

The surprising result of James and Stein is that there is a better estimate for \( \left( \mu_1, \mu_2, \mu_3 \right) \) (i.e. the combination of the three independent data sets) than \( \left( x_1, x_2, x_3 \right) \) . The "James Stein estimator" is then:

$$ \begin{pmatrix}\mu_1\\\mu_2\\\mu_3\end{pmatrix} = \left( 1-\frac{1}{x_1^2+x_2^2+x_3^2} \right) \begin{pmatrix}x_1\\x_2\\x_3\end{pmatrix} \neq \begin{pmatrix}x_1\\x_2\\x_3\end{pmatrix} $$

The mean square deviation of this estimator is then always smaller than the mean square deviation \( E \left[ \left|| X - \mu \right||^2 \right] \) of the usual estimator.

It is surprising and perhaps paradoxical that the James-Stein estimator shifts the usual estimator (by a shrinking factor) towards the origin and thus gives a better result in the majority of cases. This applies to dimensions \( \geq 3 \) , but not in the two-dimensional case.

A nice geometric explanation of why this works is provided by Brown & Zao . Note that this does not mean that you have a better estimate for every single dataset - you just have a better estimate with a smaller combined risk.