Homework 8 solutions

See code and results in Colab notebook.

Problem 1

In roulette, a bet on a single number has a 1/38 probability of success and pays 35-to-1. That is, if you bet 1 dollar, your net winnings are -1 with probability 37/38 and +35 with probability 1/38. Consider betting on a single number on each of \(n\) spins of a roulette wheel. Let \(\bar{X}_n\) be your average net winnings per bet.

For each of the values \(n = 10\), \(n = 100\), \(n = 1000\):

Compute \(\text{E}(\bar{X}_n)\)
Compute \(\text{SD}(\bar{X}_n)\)
Coding required. Run a simulation to determine if the distribution of \(\bar{X}_n\) is approximately Normal
Coding required. Use simulation to approximate \(\text{P}(\bar{X}_n >0)\), the probability that you come out ahead after \(n\) bets
If \(n=1000\) use the Central Limit Theorem to approximate \(\text{P}(\bar{X}_{1000} >0)\), the probability that you come out ahead after 1000 bets.
The casino wants to determine how many bets on a single number are needed before they have (at least) a 99% probability of making a profit. (Remember, the casino profits if you lose; that is, if \(\bar{X}_n <0\).) Use the Central Limit Theorem to determine the minimum number of bets (keeping in mind that \(n\) must be an integer). You can assume that whatever \(n\) is, it’s large for the CLT to kick in.

Solution

Let \(X\) be the winnings on a single bet. The population mean is \[ \mu = \text{E}(X) = (35)(1/38) + (-1)(37/38) = -2/38 \approx -0.05 \] The population variance is \[ \sigma^2 = \text{Var}(X) = \text{E}(X^2) - (\text{E}(X))^2 = (35)^2(1/38) + (-1)^2(37/38) - (-2/38)^2 = 33.2 \] The population SD is \(\sigma = 5.76\) dollars.

\(\text{E}(\bar{X}_n)=\mu = -2/38\) for any \(n\)
\(\text{SD}(\bar{X}_n) = \frac{\sigma}{\sqrt{n}} = \frac{5.76}{\sqrt{n}}\), so {r} round(5.7626 / sqrt(c(10, 100, 1000)), 3) for \(n\) = 10, 100, 1000. As the sample size increases the sample-to-sample variability of sample means decreases
See simulation results in Colab notebook. Distribution does not look anything like Normal for \(n=10\) or \(n=100\). Shape of sample-to-sample distribution of sample means looks approximately Normal for \(n=1000\) (but there is still discrete versus continuous issue).
See simulation results in the Colab notebook. Your expected winnings are negative so the game is biased against you. In the long run, you lose, and the casino wins. So you don’t want to play in the long run. In the short run, there is a tradeoff: the more games you play the more chances you have to win at least once, and winning at least once can offset your losses. If you just play 10 games, the probability of winning at least once doesn’t offset your losses as effectively as when you play 100 games. In the short run, your probability of coming out ahead increases as you play more games, but eventually it starts to decrease with \(n\) as you play more and more games. (There is some weird discreteness in terms of number of possibilities that happens so the pattern is a little more complicated.) However, remember that \(\text{E}(\bar{X}_n)\) is negative for any \(n\). For example, consider \(n=35\). Suppose that every day you go to the casino and play 35 games. Then on about 60% of days you’ll end up ahead for that day. However, even though you end up ahead on more days than not, your winnings on the days you end up ahead are not enough to offset your losses on the other days. Over many days of 35 bets every day, you will lose about 35(2/38) = 1.84) dollars on average per day.
Assuming \(n=1000\) is large enough to the CLT to kick, \(\bar{X}_n\) has a Normal distribution with mean \(-2/38\) and SD \(5.76/\sqrt{1000}=\) r round(5.7626 / sqrt(1000), 3). A value of 0 is \((0 - (-2/38))/0.182 = 0.289\) SDs above the mean. From the empirical rule, the probability will be between 0.5 and 0.31. Software (below) shows it’s 0.386.
Assuming \(n\) is large enough to the CLT to kick, \(\bar{X}_n\) has a Normal distribution with mean \(-2/38\) and SD \(5.76/\sqrt{n}\). We want to find \(n\) so that \(\text{P}(\bar{X}_n <0)= 0.99\). That is, we want 0 to be the 99th percentile. For a Normal distribution, the 99th percentile is 2.33 SDs above the mean. So we want to find \(n\) so that \(0 = -2/38 + 2.33(5.76/\sqrt{n})\). Solving for \(n\) gives \(n = (2.33(5.76)/(2/38))^2 = 64878\). (Roulette spins happen about once every minute at a casino. Even if the casino only has a single wheel with a single bet on each spin, the casino would easily clear more than the required number of bets in a day. Your bets are essentially free money for the casino.)

Problem 2

The standard measurement of the alcohol content of drinks is alcohol by volume (ABV), which is given as the volume of ethanol as a percent of the total volume of the drink. In a sample of 67 brands of beer, the mean ABV is 4.61 percent and the standard deviation of ABV is 0.754 percent.

First, to get some practice, answer the following questions by computing by hand using only the information provided here.
Coding required. Then, use Python to produce the summaries and analysis needed to answer the questions. Be sure to also include appropriate plots of the beer data.

Compute a 95% confidence interval for the appropriate population mean.
Write a clearly worded sentence reporting your confidence interval in context.
Is 4.5% a plausible value of the parameter? Explain briefly.
One of the brands of beer in the sample is O’Doul’s, a non-alcoholic beer. The ABV for O’Doul’s is 0.4% (it has a bit of alcohol.) Suppose O’Doul’s is removed from the data set. Compute the sample mean ABV of the remaining 66 brands.
The sample SD of ABV of the remaining 66 brands is 0.55 percent. Explain intuitively why this value is smaller than the sample SD of all 67 brands.
Compute the 95% confidence interval based on the sample with O’Doul’s removed. Compare to the original interval, both in terms of center of the CI and its width. Explain briefly.
Based on the interval based on the sample with O’Doul’s removed, is 4.5% a plausible value of the parameter? Explain briefly.
Which of the analyses is more appropriate: with or without O’Doul’s? Explain your reasoning.

Solution

The SE of \(0.754/\sqrt{67}= 0.092\) measures the sample-to-sample variability of sample mean ABV over many samples of 67 brands of beer each. The 95% CI is \(4.61 \pm 2 × 0.092 \Rightarrow 4.61 \pm 0.184 \Rightarrow [4.42, 4.79].\)
We estimate with 95% confidence that the population mean ABV for brands of beer is between 4.42 and 4.79 percent.
4.5 percent is in the confidence interval, so yes it is a plausible value of the parameter (if 95% confidence represents our criteria for plausibility.)
The original sample mean of 67 values is 4.61, so the sum of the 67 values is \(67\times 4.61 = 308.87\). Removing the outlier leaves 66 values with a sum of \(308.87-0.4 = 308.47\), so the sample mean of the 66 values is \(308.47/66 = 4.67\).The outlier pulled the mean down, so the mean is larger without it.
The sample SD of ABV was larger with O’Doul’s; the outlier pulled up the average distance from the mean.
O’Doul’s was bringing down the mean, so removing it increase the sample mean and increases the center of the confidence interval. Removing O’Douls results in a smaller sample SD, and a narrower CI.
Technically, no since the value 4.5 percent lies outside the CI. (Though if we really want to assess the plausibility of a particular value like 4.5 we would want to compute a p-value to measure the strength of the evidence; here we’re basically just saying that the p-value is less than 0.05.)
In general, EXCLUDING OUTLIERS WITHOUT A COMPELLING REASON IS BAD. However, in this case the population we are interested in is brands of beer, probably meaning beer with actual alcohol in it. So it makes sense to only consider a sample of alcoholic beers, and so I think the more appropriate analysis with the non-alcoholic beer excluded.

Problem 3

Do non-smokers have better lungs than smokers? One measure of lung function is forced expiratory volume (FEV), the amount of air (in liters) an individual can exhale in the first second of forceful breath. Larger values of FEV are indicative of better functioning lungs. In a study, FEV was measured for a group of 654 subjects, along with whether they smoked and other variables like age (years). Note: I’m purposely not yet giving you much information about how the sample was collected.

Coding required. Use Python to summarize the FEV data. Summarize the distribution of FEV both overall and separately for smokers and non-smokers.
Suppose you want to fill in the blanks in the following sentence: 95% of FEV values in the population are between [blank] and [blank]. Assuming this sample is representative, provide reasonable values that fill in the blanks, and explain your reasoning. Hint: it is better to give a correct rough estimate than a precise but incorrect answer, so think before trying to compute anything. (Since you don’t have much information about the sample, you don’t have to clarify yet what “population” is appropriate.)
Using appropriate summary statistics, compute by hand a 95% confidence for the appropriate difference in population means.
Coding required. Use Python to compute the confidence interval and compare to what you computed by hand.
Write a clearly worded sentence interpreting the confidence interval in context.
Assuming this is a representative sample, are we reasonably confident that one group (smokers or non-smokers) tends to have better FEV? Why? Which group?
Coding required. You should have observed something surprising in the previous parts. Investigate some of the other variables in the data set and produce a few appropriate plots or summaries that provide an explanation for the surprising result. Write a clearly worded sentence or two summarizing your explanation.

Solution

See Colab notebook for results. FEV for smokers tends to be higher than for non-smokers (which is the opposite of what you would expect.)
This is NOT a confidence interval question. Find the 2.5th and 97.5th percentile of values: between 1.3 and 4.6. But just eyeballing from the histogram, it seems like most values are between about 1 and 5 liters.
Observed difference in means 3.28-2.56 =0.72. The SE is \(\sqrt{\frac{0.85^2}{589}+\frac{0.75^2}{65}} = 0.1\). The endpoints of the 95% CI are \(0.72\pm 2 \times 0.1= 0.72 \pm 0.2\), an interval of [0.51, 0.91].
See Colab notebook; results are comparable.
We estimate with 95% confidence that the population mean FEV for smokers is between 0.51 and 0.91 liters greater than the population mean FEV for smokers.(We’ll clarify “population” below.)
Smokers because of the previous part.
See Colab notebook for some plots. The 654 subjects in this study were all children! Their ages varied from 3 to 19 years old. This would certainly limit the “population” we would be willing to generalize to.

Regarding why smokers tend to have higher FEV:
- Smokers tend to be older.
- Older kids to tend to have larger and better lung capacities.
- In statistical terms, age is a confounding variable. (We could use regression to control for age.)

Problem 4

(I had much longer instructions written, but I think they weren’t helpful. If you have questions about how the applet is working, don’t hesitate to ask.)

You are going to use a simulation applet that randomly generates confidence interals for a population mean to help you understand some ideas. The applet calculates a confidence interval for each set of randomly generated data.

In the box for “Statistic” select means “Means”.
In the box next to “Method” select “t”.
Start with “Number of intervals” equal to 1 at first to see what is happening, but then change it to lots.

Experiment with different simulations; be sure to change

Distribution
Population Mean
Population SD
Sample size
Confidence level

Then write a paragraph or two or some bullet points of your main observations. Based on this applet, what can you say about how confidence intervals work? How does changing the inputs affect the confidence intervals and the Results?

Solution

Simulated coverage level is generally close to confidence level, unless sample size is small and population is not Normal
Increasing population SD makes the intervals wider (but you would have no control over this in practice)
Increasing the sample size make the intervals narrower (that is, more precise estimates for the same level of confidence)
Increasing the confidence level makes the intervals wider
Since sample means are unbiased estimators of the population mean, when the population mean changes, the sample means follow it.