18 Maximum Likelihood Estimation: Principles

Example 18.1 Suppose we are willing to assume a Poisson(\(\mu\)) model for the number of cars sold in a day at a particular car dealership. But we don’t know the value of \(\mu\), so we’ll collect data and use it to estimate \(\mu\). Suppose first that we observe just a single day; let \(X\) be the number of cars sold on this day. (We’ll look at a sample of \(n\) days soon.)

Does \(X\) take values on a discrete or continuous scale? What about \(\mu\)?
First some probability questions. Assume \(\mu=2.3\). Find the pmf of \(X\) and sketch a plot of it; what is this a function of? Find and interpret the probability that in a single day there are 3 cars sold.
Now back to some statistics questions. Now remember that in reality \(\mu\) is unknown. Suppose that \(x=3\) home runs are observed in a single day. Now write the pmf plugging in everything we know; what is this a function of? Is this function a pmf?
Compute and intepret the probability that in a single day there are 3 cars sold if \(\mu = 2.3\).
Compute and interpret the probability that in a single day there are 3 cars sold if \(\mu = 1\).
Compute and interpret the probability that in a single day there are 3 cars sold if \(\mu = 5\).
Compute and interpret the probability that in a single day there are 3 cars sold if \(\mu = 3.5\).
Suppose that \(x=3\) cars are sold in a single day. Based just on this day, which value — 1, 2.3, 3.5, or 5 — would you choose as your estimate for \(\mu\)? Why?
We obviously have more choices for our estimate of \(\mu\) than just 1, 2.3, 3.5, or 5. Describe in principle the process you would follow to find the estimate of \(\mu\) based on a single day with 3 cars sold.
Suppose that we observe \(x=3\) cars sold in a single day. Plot the likelihood function; be sure to label your axes. Based on your plot, what would be your estimate of \(\mu\)?
Suppose that we observe \(x=3\) cars sold in a single day. Plot the log of the likelihood function. Compare the plots of the likelihood function and the log-likelihood; what do you notice about where the maximum value occurs?
Now consider a general value of \(x\) cars solds in a single day. Carefully write the likelihood function, and the log-likelihood function. What are these functions of? How would you use them to determine your estimate of \(\mu\) given \(x\) cars sold in a single day?

Consider a single observation \(X\) from a population \(p_\theta\) with parameter \(\theta\): Let \(X\sim p_\theta\). The likelihood function given \(X=x\) is defined as \[L(\theta) \equiv L_x(\theta) = p_\theta(x)\]
That is, the likelihood function is the probability (or density for continuous data \(X\)) of observing the given sample data \(x\) viewed as a function of the unknown parameter(s) \(\theta\). The observed value of \(x\) (data) is treated as a fixed constant.
The maximum likelihood estimate of \(\theta\) is the value of \(\theta\) for which the observed data has the highest likelihood of occurring.
The maximum likelihood estimator (MLE) of \(\theta\) is the RV \(\hat{\theta}\) which maximizes the likelihood, i.e. the value \(\hat{\theta}\) which satisfies \[L(\hat{\theta}) \ge L(\theta) \text{ for all } \theta.\]
Remember that an estimator \(\hat{\theta}\) is a function of \(X\), the observed data.
It is often convenient to consider the log-likelihood function of \(\theta\) \[\ell(\theta) \equiv \ell_x(\theta) = \log L_x(\theta)\\ \]
Because \(\log\) is an increasing function it follows that \(\hat{\theta}\) is the MLE of \(\theta\) if and only if \[\ell(\hat{\theta}) \ge \ell(\theta) \text{ for all } \theta.\]

Example 18.2 Suppose we are willing to assume a Poisson(\(\mu\)) model for the number of cars sold in a day at a particular car dealership. But we don’t know the value of \(\mu\), so we’ll collect data and use it to estimate \(\mu\). But now we estimate \(\mu\) based on a sample of \(n\) days. Let \(X_1, \ldots, X_n\) be the number of cars sold in a random sample of \(n\) days. For example, suppose \(n=3\) and we observe \(x_1 = 3\), \(x_2 = 0\), \(x_3=2\).

First some probability questions. Assume \(\mu=2.3\) and \(n=3\). Find the joint pmf of \((X_1, X_2, X_3)\); what is this a function of? Find and interpret the probability of observing the sample \((x_1 = 3, x_2 = 0, x_3=2)\).
Now back to some statistics questions. Now remember that in reality \(\mu\) is unknown. Suppose that the sample \((x_1 = 3, x_2 = 0, x_3=2)\) is observed. Now write the joint pmf plugging in everything we know; what is this a function of? Is this function a pmf?
Compute the probability of observing the sample \((x_1 = 3, x_2 = 0, x_3=2)\) if \(\mu = 2.3\).
Compute the probability of observing the sample \((x_1 = 3, x_2 = 0, x_3=2)\) if \(\mu = 1\).
Compute the probability of observing the sample \((x_1 = 3, x_2 = 0, x_3=2)\) if \(\mu = 5\).
Compute the probability of observing the sample \((x_1 = 3, x_2 = 0, x_3=2)\) if \(\mu = 3.5\).
Suppose the sample \((x_1 = 3, x_2 = 0, x_3=2)\) is observed. Which value — 1, 2.3, 3.5, or 5 — would you choose as your estimate for \(\mu\)? Why?
We obviously have more choices for our estimate of \(\mu\) than just 1, 2.3, 3.5, or 5. Describe in principle the process you would follow to find the estimate of \(\mu\) based on observing the sample \((x_1 = 3, x_2 = 0, x_3=2)\).
Suppose that we observe the sample \((x_1 = 3, x_2 = 0, x_3=2)\). Plot the likelihood function; be sure to label your axes. Based on your plot, what would be your estimate of \(\mu\)?
Suppose that we observe the sample \((x_1 = 3, x_2 = 0, x_3=2)\). Plot the log of the likelihood function. Compare the plots of the likelihood function and the log-likelihood; what do you notice about where the maximum value occurs?
Now consider a general sample of size \(n\). Carefully write the likelihood function, and the log-likelihood function. What are these functions of? How would you use them to determine your estimate of \(\mu\) given \(x\) cars sold in a single day?

Consider a random sample of \(n\) values from a population \(p_\theta\) with parameter \(\theta\): Let \(X_1,\ldots,X_n\) be i.i.d. \(\sim p_\theta\).
The likelihood function of \(\theta\), given \(X_1=x_1, \ldots, X_n=x_n\), is \[L(\theta)\equiv L_{x_1, \ldots, x_n}(\theta) = \prod_{i=1}^n p_\theta(x_i) = p_\theta(x_1)\times\cdots\times p_\theta(x_n)\]
That is, the likelihood function is the joint density¹ of the sample \(X_1,\ldots,X_n\), evaluated at the observed data values \(x_1,\ldots,x_n\), viewed as a function of the (unknown) parameter(s) \(\theta\).
Note that \(x_1, \ldots, x_n\) represent distinct values (like \(x_1=3, x_2=0, x_3=2\) in the example).
The MLE of \(\theta\) is the value of \(\theta\) for which the observed sample data has the highest likelihood of occurring.
The maximum likelihood estimator (MLE) of \(\theta\) is the RV \(\hat{\theta}\) which maximizes the likelihood, i.e. the value \(\hat{\theta}\) which satisfies \[L(\hat{\theta}) \ge L(\theta) \text{ for all } \theta.\]
Remember that an estimator \(\hat{\theta}\) is a function of \(X_1, \ldots, X_n\).
The log-likelihood function of \(\theta\) is \[\begin{aligned} \ell(\theta) \equiv \ell_{x_1, \ldots, x_n}(\theta)& = \log L(\theta)\\ & = \sum_{i=1}^n\log p_\theta(x_i) = \log p_\theta(x_1) + \cdots + \log p_\theta(x_n) \end{aligned}\]
Because \(\log\) is an increasing function it follows that \(\hat{\theta}\) is the MLE of \(\theta\) if and only if \[\ell(\hat{\theta}) \ge \ell(\theta) \text{ for all } \theta.\]

Example 18.3 Suppose instead of observing the number of cars sold on individual days, you only knew that in a sample of 3 days there were 5 cars sold in total.

Probability question. If \(X_1, \ldots, X_n\) are i.i.d. Poisson(\(\mu\)), what is the distribution of \(X_1+\cdots+X_n\)?
Write the likelihood function for a sample of 3 days in which a total of 5 cars were sold. How does this compare to the likelihood function based on the sample \((3, 0, 2)\)?
Find the MLE of \(\mu\) based on a sample of 3 days in which a total of 5 cars are sold.

In many situations, the full sample data is not necessary to find the MLE; rather, a few descriptive statistics are often sufficient to evaluate the likelihood and determine the MLE
Poisson(\(\mu\)): the MLE of \(\mu\) is the sample mean \(\bar{X}\)
Binomial(\(n\), \(p\)), with \(n\) known: the MLE of \(p\) is the sample proportion \(\hat{p} = X/n\), where \(X\) is the number of “successes” in the sample
Exponential(\(\lambda\)): the MLE of \(\lambda\) is the sample rate \(1/\bar{X}\); the usual situation is where the \(X\)’s measure times between “events” and so \(\bar{X}\) is the sample mean time between “events”.
Normal(\(\mu\), \(\sigma\)), with both \(\mu\) and \(\sigma\) unknown: the MLEs of \(\mu\), \(\sigma^2\), and \(\sigma\), are, respectively \[\begin{align*} \hat{\mu} & = \bar{X} = \frac{1}{n} \sum_{i=1}^n X_i\\ \hat{\sigma}^2 & = \frac{1}{n} \sum_{i=1}^n \left(X_i - \bar{X}\right)^2\\ \hat{\sigma} & = \sqrt{\frac{1}{n} \sum_{i=1}^n \left(X_i - \bar{X}\right)^2} \end{align*}\]

Example 18.4 Assume a Poisson(\(\mu\)) model for the number of cars sold in a day at a particular car dealership. In 3 days, 5 cars are sold.

Consider \(\sqrt{\mu}\). Is \(\sqrt{\mu}\) a parameter or a statistic? How would you interpret \(\sqrt{\mu}\) in this context?
What do you think the MLE of \(\sqrt{\mu}\) is?
Consider \(e^{-\mu}\). Is \(e^{-\mu}\) a parameter or a statistic? How would you interpret \(e^{-\mu}\) in this context?
What do you think the MLE of \(e^{-\mu}\) is?
Consider \(\mu e^{-\mu}\). Is \(\mu e^{-\mu}\) a parameter or a statistic? How would you interpret \(\mu e^{-\mu}\) in this context?
What do you think the MLE of \(\mu e^{-\mu}\) is?
Consider \(\mu^2 e^{-\mu}/2\). Is \(\mu^2 e^{-\mu}/2\) a parameter or a statistic? How would you interpret \(\mu^2 e^{-\mu}/2\) in this context?
What do you think the MLE of \(\mu^2 e^{-\mu}/2\) is?
We started by assuming a Poisson model, but how do we know that is a reasonable assumption? Suppose we observe the number of cars sold for a random sample of days. Suggest you might use the sample data to investigate whether a Poisson model is appropriate.

A parameter is any characteristic of a population distribution
For many populations, there are a few “main” parameters (like \(\mu\) for Poisson(\(\mu\)), or \((\mu\), \(\sigma\)) for Normal(\(\mu\), \(\sigma\)), but any characteristic of a population is parameter and can be estimated
Invariance property of MLEs. If \(\hat{\theta}\) is the MLE of \(\theta\), then \(g(\hat{\theta})\) is the MLE of \(g(\theta)\).
That is, the “MLE of a function is the function of the MLE”
The invariance principle says that if you find the MLE of the “main parameters” then you automatically get the MLE of all the parameters that are based on it

Example 18.5 Suppose that within a certain population, credit scores follow a Normal(\(\mu\), \(\sigma\)) distribution. In a random sample of 5 individuals from this population, the credit scores are 600, 630, 680, 700, 770.

Compute the MLE of \(\mu\).
Compute the MLE of \(\sigma\).
Explain in words what it means for these parameters to be the MLE.
Compute the MLE of the 16th percentile of credit scores for this population
Compute the MLE of the 90th percentile of credit scores for this population
Compute of the MLE of the population proportion of individuals with a credit score above 750.

Maximum likelihood estimation procedures are widely applicable and maximum likelihood estimators have many nice theoretical properties.
However, there are some drawbacks:
- May require numerical maximization, which can be computationally intensive (especially if many parameters)
- May not be reliable estimators in small samples. (Many of the nice theoretical properties of MLEs are true for large samples.)

We usually have i.i.d. observations and so the joint density is just the product of the marginal densities as we have defined it. However, for more general situations the likelihood can be defined directly as the joint density.↩︎