22 Mean Square Error of an Estimator

MLE is one procedure for finding an estimator of a parameter $\theta$. But when several estimators of $\theta$ are available, how do we decide which is “better”?
The value of a parameter $\theta$ is unknown, so it is impossible to determine if any single estimate of $\theta$ is good or bad.
An estimator is a random variable that returns different estimates of $\theta$ for different samples. So even if an estimator produces “good” estimates of $\theta$ for some samples, it might produce “bad” estimates of $\theta$ for others.
Therefore, we can never determine for any particular sample if an estimator produces a good estimate of the true unknown value of $\theta$.
Rather, we evaluate the estimation procedure: Over many samples, does an estimator tend to produce reasonable estimates of $\theta$ for a variety of potential values of $\theta$?
Remember: a statistic/estimator is a random variable whose sampling distribution distribution describes how values of the estimator vary from sample-to-sample over many (hypothetical) samples.
- We can estimate the degree of this variability by simulating many hypothetical samples from an assumed population distribution and computing the value of the statistic for each sample.
- However, in practice usually only a single sample is selected and a single value of the statistic is observed, and we don’t know what the population distribution is.

Example 22.1 Continuing the situation of estimating $\mu$ for a Poisson($\mu$) distribution based on a random sample $X_1, \ldots, X_n$ of size $n$. Consider the estimators \[\begin{align*} S^2 & = \frac{1}{n-1}\sum_{i=1}^n \left(X_i - \bar{X}\right)^2\\ \hat{\sigma}^2 & = \frac{1}{n}\sum_{i=1}^n \left(X_i - \bar{X}\right)^2 = \left(\frac{n-1}{n}\right)S^2 \end{align*}\] We have seen that $S^2$ is an unbiased estimator of $\mu$ but $\hat{\sigma}^2$ is not. Does that mean we should prefer $S^2$ over $\hat{\sigma}^2$?

Assume $n=3$ and $\mu=2.3$. Use simulation to approximate the distribution of $\hat{\sigma}^2$ and its expected value and standard deviation.
Compare the simulation results to those for $S^2$ (from a previous handout.) For which estimator is the bias smaller? For which estimator is the variance smaller? Is there a clear preference between the two estimators when $\mu = 2.3$?
- Estimators can be evaluated based on both bias and variability.
- Mean square error is a combined measure of both an estimator’s bias and its variability.
- The mean square error (MSE) of an estimator $\hat{\theta}$ of a parameter $\theta$ is \[ \text{MSE}_\theta(\hat{\theta}) = \textrm{E}\left(\left(\hat{\theta}-\theta\right)^2\right), \qquad \text{a function of $\theta$} \]
- Mean square error measures, on average, how far the estimator deviates from the parameter it is estimating.
- The MSE of an estimator is a function of the parameter $\theta$. It can be shown that¹ \[\begin{align*} \text{MSE}_\theta(\hat{\theta}) & = \textrm{Var}(\hat{\theta}) + \left(\textrm{E}(\hat{\theta})-\theta\right)^2\\ & = \textrm{Var}(\hat{\theta}) +\left(\text{bias}_\theta(\hat{\theta})\right)^2 \end{align*}\]
- There can be many “reasonable” estimators of a parameter. Given two estimators, it’s often the situation that neither one has smaller MSE for all potential values of $\theta$. Choosing an estimator often involves a tradeoff between bias and variability.
Use the simulation results to compute the MSE of $S^2$ and $\hat{\sigma}^2$ when $\mu = 2.3$. Determine which estimator has smaller MSE when $\mu=2.3$, but then explain why this information is not very useful.
Recall \[ \textrm{Var}(S^2) = \frac{2\mu^2}{n-1} + \frac{\mu}{n} \] Find and plot the MSE functions of $S^2$ and $\hat{\sigma}^2$. Which estimator is preferred? Discuss.

Example 22.2 Continuing estimating $\mu$ for a Poisson($\mu$) distribution. Consider $\bar{X}$ and the constant estimator 2.3.

Find the MSE of the constant estimator 2.3.
Find the MSE of $\bar{X}$.
Suppose $n=3$. Plot the MSE functions of the two estimators. Does either estimator have a better MSE? Explain.
Donny Don’t says: “neither estimator has smaller MSE for all values of $\mu$. So we’ll use the constant estimator 2.3 when $\mu$ is near 2.3 (between 1.57 and 3.36 if $n=3$) and we’ll use $\bar{X}$ for other values of $\mu$”. Do you agree with Donny’s strategy? Do you see any problems with it?
What happens to the MSEs as $n$ increases? In particular, what happens to the range of values of $\mu$ for which the constant estimator 2.3 has smaller MSE than $\bar{X}$?

Given a parameter $\theta$, it is never possible to find a single estimator that has the smallest MSE for all potential values of $\theta$.
Choosing an estimator often involves a tradeoff between bias and variability.
MLEs generally have good properties. For many situations, roughly,
- The MLE is asymptotically unbiased; that is, the bias is small when the sample size is large
- The variance of an MLE is about as small as it can be (for an unbiased estimator)
But there are many reasonable estimation procedures besides MLE (e.g., Bayes estimators).
Bias is estimation is not necessarily bad. Being willing to accept a little bit of bias can often lead to a beneficial reduction in variability.
- On the other hand, try to minimize bias in data collection: how the sample is selected, how the variables are measured, etc
- A lot of the theory assumes we have a perfect random sample, which is never true in practice

Recall: By definition $\textrm{Var}(Y)=\textrm{E}[(Y-\textrm{E}(Y))^2]$ but remember the useful computational formula
$\textrm{Var}(Y) = \textrm{E}(Y^2) - (\textrm{E}(Y))^2$. Apply this to $Y=\hat{\theta}-\theta$, and remember that $\hat{\theta}$ is random but $\theta$ is not.↩︎