On February 2nd, we left Shmoop headquarters in the morning to collect some very special data. This was the day we would finally settle the question: how much wood can a woodchuck chuck?
We randomly selected woodchucks to use in the study, gave them a pile of wood to chuck, and let them go at it. We recorded their individual results, and used them to calculate a sample mean. It was surprisingly high.
We, of course, were thrilled. People always asked how much a woodchuck would chuck if a woodchuck could chuck wood. It turns out that no one had asked them to even try. We went to sleep that night dreaming a satisfied dream about our data.
When we woke up, it was February 2nd again, and our data was gone. It was one of those February 2nds, was it? We weren't going to let something like a closed time loop stop us. At least we remembered the sample mean we had already calculated.
So, we set out to repeat our data collection for the first time. We made a new random selection of our woodchucks. They chucked that wood, we recorded the results, found our new sample mean, and woke up on February 2nd again.
This has been going on for a while now. It's starting to get a bit monotonous, but at least we were able to remember the sample means from every trial. If one sample mean is a good estimate for a population parameter, just think of how good a few hundred sample means are.
That's when we got the idea to put all of our results—all of the sample means we had found—on a single graph together. This made a sampling distribution, a distribution of possible sample estimates we could get from sampling the population. The next day, we woke up on February 3rd. The time loop was broken.
So Much Sampling, So Little Time
If we were to go out and collect multiple, identical samples like this, we would expect to get slightly different results each time. We'd sample different individuals at random, and so our measurements would vary some. We would be sampling the same population every time, though, so we would expect all of our estimates to be good estimates of the true population parameter.
In real life, almost no one ever collects samples like this. That's time and money that could be spent on sampling some new population, or on fast cars and boatloads of cake. But a sampling distribution would still be nice to have. It would let us ask, "How typical was our result?"
Yep. That sure would be nice to have. Oh well. We shouldn't sit around wishing for something that we will obviously never see in the next section.
Example 1
What is a sampling distribution? How do we find it? What assumptions do we have to make? |
Example 2
We went to a local fishing tournament and measured the average number of fish caught by each person, which was 13. Before the tournament started, our hypothesis was that people would only catch 5 fish on average. We ran some simulations to find the sampling distribution. Out of 250 trials, 12 of them were as extreme or more so than our sample mean. Can we reject the null hypothesis? |
Example 3
Bob and Jim want to know if one of them is a faster runner than the other. They ran the same length of track 10 times each and found their average times. Turns out that, on average, Bob ran the track 6 seconds faster than Jim. Using resampling and simulation, they found that the margin of error is ±8 seconds. What can Bob and Jim conclude? |
Exercise 1
Researchers conducted an experiment, to see if adding racing stripes to cars make them go faster. They had two sets of randomly selected cars—those with and those without racing stripes—make laps around the same track. The average times of the two groups are being compared.
What null hypothesis should be used for the sampling distribution?
Exercise 2
We think our favorite die is a little lop-sided. It seems to land on 6 more than we would expect. We're going to roll the die 50 times and record how many 6's we get. When we create our sampling distribution, what null hypothesis should we use?
Exercise 3
Simulation of a sampling distribution found a margin of error of ±4. If the sample mean from the data was 30, can we reject a null hypothesis of 35?
Exercise 4
The mean of a sample is 0.24. A simulation of the sampling distribution is run 10,000 times, and 44 of the trials have an estimate as large or larger than the observed sample mean. Can we reject the null hypothesis for α = 0.05?
Exercise 5
What is the most likely value for the null hypothesis of this simulated sampling distribution?