17  From Probability to Statistics

Example 17.1 In each of the following scenarios, one of (i) and (ii) is a probability and one is a statistics question. Classify each question as “probability” or “statistics”. Then discuss some general features of the probability questions, and some general features of the statistics questions

  1. U.S. Satisfaction

    • Assume that 30% of Americans are satisfied with the way things are going in the U.S. If a random sample of 1000 Americans is selected, what are the chances that more than 330 Americans in the sample are satisfied with the way things are going in the U.S?

    • In a random sample of 1000 Americans, 330 Americans are satisfied with the way things are going in the U.S. What percent of all Americans are satisfied with the way things are going in the U.S?

  2. Great white shark length

    • Assume that lengths of female great white sharks follow a Normal distribution with mean 16 ft and standard deviation 3 ft. If a random sample of 40 female great white sharks is selected, what are the chances that the average length in the sample is between 16.5 and 17.5 ft?

    • In a sample of 40 female great white sharks, the sample mean length is 17.0 ft. What is the average length of female great white sharks?

  3. In a clinical trial, 200 subjects are randomly assigned to receive either an experimental vaccine or no treatment (e.g., a placebo), and then each subject is tested to see if they developed immunity.

    • If the vaccine is in reality not effective, what are the chances that the proportion of subjects who develop immunity is twice as large for those who received the vaccine than for those who did not?

    • Among the 100 subjects who received the vaccine, 20 developed immunity; among the 100 subjects who did not receive the vaccine, 10 developed immunity. Is there evidence that the vaccine is effective?

  4. Snap streaks

    • Assume that lengths of streaks on Snapchat follow an Exponential distribution, with mean 75 for teen users and mean 40 for adult users. A random sample of 200 Snapchat users is selected, 100 teens and 100 adults. What are the chances that the sample mean streak length for teens is with within 20 of the sample mean streak length for adults?

    • In a random sample of 100 teen Snapchat users, the sample mean streak length is 80. In a random sample of 100 adult Snapchat users, the sample mean streak length is 35. In general, how much longer do streak lengths for teen Snapchat users tend to be than for adult Snapchat users?








Example 17.2 Jerry has just opened a car dealership in a new location. He assumes that the number of cars he’ll sell in a day has a Poisson(\(\mu\)) distribution. But he doesn’t know the value of \(\mu\), so he plans to use data from his first few days of business to estimate \(\mu\). Let \(X_1, \ldots, X_n\) represent the number of cars sold for a sample of \(n\) days.

  1. What plays the role of the population distribution? What is the parameter? How do you interpret the parameter in this context?




  2. Suggest one estimator of \(\mu\). (Hint: what does \(\mu\) represent for a Poisson distribution?)




  3. Suppose he collects data on \(n=3\) days and observes \(x_1=3\) cars sold on day 1, \(x_2=0\) cars sold on day 2, and \(x_3 = 2\) cars sold on day 3. Compute the value of your estimator from the previous part for this sample. That is, what is your estimate of \(\mu\) based on this sample?




  4. Suggest another estimator of \(\mu\). (Hint: what else does \(\mu\) represent for a Poisson distribution?)




  5. For a Poisson(\(\mu\)) distribution \(\mu\) is both the population mean and the population variance. So one reasonable estimator of \(\mu\) is the sample mean \(\bar{X}\). Another reasonable estimator is the sample variance. But there are two commonly used formulas for the sample variance, each of which provides an estimator of \(\mu\) \[ \begin{aligned} \hat{\sigma}^2 & = \frac{1}{n}\sum_{i=1}^n\left(X_i-\bar{X}\right)^2\\ S^2 & = \frac{1}{n-1}\sum_{i=1}^n\left(X_i-\bar{X}\right)^2 \end{aligned} \] Compute the value of each of these estimators for the (3, 0, 2) sample. That is, what is your estimate of \(\mu\) based on this sample if you use each of these estimators?




  6. At his previous dealeship at a different location, Jerry sold 2.3 cars per day on average. Consider the estimator of \(\mu\): \(\frac{n}{n+100}\bar{X}+ \frac{100}{n+100}(2.3)\). Explain in words what this estimator represents. In particular, what happens as \(n\) increases? Then compute the value of this estimator for the (3, 0, 2) sample.




  7. According to the Poisson(\(\mu\)) distribution, the probability of selling 0 cars in a day is \(p_0=e^{-\mu}\). Rearranging, \(\mu = -\log p_0\). Therefore, if \(\hat{p}_0\) is the sample proportion of days with 0 cars sold, then another estimator of \(\mu\) is \(-\log\hat{p}_0\). Compute the value of this estimator for the (3, 0, 2) sample.




  8. We have seen a few different, seemingly reasonable estimators of \(\mu\), and each produced a different estimate of \(\mu\) based on the same (3, 0, 2) sample data. Which of these numbers is the best estimate of \(\mu\)? That is, which of these estimates is closest to \(\mu\)?




  9. How might you investigate which of the estimators corresponds to the best estimation procedure?





  1. Many problems of interest involve multiple variables (e.g., height, weight, income), in which case the population distribution is a joint distribution. But it is simpler — both conceptually and notationally — to consider just a single variable of interest.↩︎

  2. There are two main philosophical approaches to statistical inference that differ mainly in whether they emphasize “fixed number” or “unknown”. The frequentist approach treats a parameter as a fixed number. The Bayesian approach treats an unknown parameter— in other words, an uncertain parameter—as a random variable with a probability distribution that describes the degree of uncertainty. Both approaches are valid, each with advantages and disadvantages. The frequentist approach has been historically more prevalent, but the Bayesian approach has gained in popularity in the last 30 or so years.↩︎

  3. Abuse of notation in the expression: \(p_\theta\) on the left represents the joint distribution of \((X_1, \ldots, X_n)\), a function with \(n\) inputs; \(p_\theta\) on the right represents that marginal distribution of each \(X_i\) (since i.d.), a function with a single input.↩︎