Bias-Variance-Noise decomposition of Mean Square Error

Suppose we are evaluating a class of predictors on a data distribution $D$. This distribution generates data pairs $(x, y)$ where $y = f(x) + \epsilon$, $\epsilon$ is a zero-mean Gaussian noise. The loss function we choose is the mean square error between the observed truth $y$ and the predicted value $h_S(x)$ of a predictor trained on a dataset $S$ of $m$ instances (i.e. S is drawn from $D^m$). The loss function is averaged over the all possible datasets (equivalently, all possible predictors), and all data pairs $(x, y)$, and all noise values $\epsilon$:

\[
\mathbb{E}_{(x, y) \sim D, S \sim D^m, \epsilon \sim \mathcal{N}(0, \sigma^2)} \left[ (y – h_S(x))^2 \right]
\]

For simplicity, we compute the loss function for a single data pair $(x, y)$:

\begin{equation*}
\begin{split}
\mathbb{E}_{S \sim D^m, \epsilon \sim \mathcal{N}(0, \sigma^2)} \left[ (y – h_S(x))^2 \right] &= \mathbb{E}_{\epsilon} \left[ y^2 \right] – 2 \mathbb{E}_{\epsilon}[y] \mathbb{E}_{S} \left[ h_S(x) \right] + \mathbb{E}_{S} \left[ h_S(x)^2 \right]
\\
&= \mathbb{E}_{\epsilon} \left[ y^2 \right] – 2 f(x) \bar{h}(x) + \mathbb{E}_{S} \left[ h_S(x)^2 \right]
\end{split}
\end{equation*}

To decompose this quantity, we introduce the following lemma:

Lemma. Let $Z$ be a random variable drawn from a distribution $\mathbb{P}$ with mean $\bar{Z} = \mathbb{E}_P[Z]$. Then:

\[
\mathbb{E}_P[Z^2] =  \mathbb{E}_P[(Z – \bar{Z})^2] + \bar{Z}^2
\]

Proof.

\begin{equation*}
\begin{split}
\mathbb{E}_P[(Z – \bar{Z})^2] &= \mathbb{E}_P[Z^2 – 2 Z \bar{Z} + \bar{Z}^2]
\\
&= \mathbb{E}_P[Z^2]  – 2 \mathbb{E}_P[Z] \bar{Z} + \bar{Z}^2
\\
&= \mathbb{E}_P[Z^2] – 2 \bar{Z}^2 + \bar{Z}^2
\\
&= \mathbb{E}_P[Z^2] – \bar{Z}^2
\end{split}
\end{equation*}

Applying this lemma, we have:

\begin{equation*}
\mathbb{E}_{\epsilon} \left[ y^2 \right] =  \mathbb{E}_{\epsilon} \left[ (y – f(x))^2 \right] + f(x)^2
\end{equation*}

\begin{equation*}
\mathbb{E}_{S} \left[ h_S(x)^2 \right] = \mathbb{E}_{S} \left[ (h_S(x) – \bar{h}(x))^2 \right] + \bar{h}(x)^2
\end{equation*}

Hence,

\begin{equation*}
\begin{split}
\mathbb{E}_{S, \epsilon} \left[ (y – h_S(x))^2 \right] &= \mathbb{E}_{\epsilon} \left[ (y – f(x))^2 \right]  + \mathbb{E}_{S} \left[ (h_S(x) – \bar{h}(x))^2 \right] + (f(x)^2 – 2 f(x) \bar{h}(x) + \bar{h}(x)^2)
\\
&= \underbrace{\mathbb{E}_{\epsilon} \left[ (y – f(x))^2 \right] }_{noise}+ \underbrace{\mathbb{E}_{S} \left[ (h_S(x) – \bar{h}(x))^2 \right]}_{variance} + \underbrace{(f(x) – \bar{h}(x))^2}_{bias}
\end{split}
\end{equation*}

In words,
+ Bias is a measure of whether the class of predictors we choose approximates $f$ well.
+ Variance is a measure of whether a predictor is susceptible to changes in the training sample $S$.
+ Noise is a nature of the data generating process, when the observed sample is not a true representative of the underlying function $f$.

Comments

comments