It's time we sat down and had The Talk, about where data comes from. When someone is very interested in a topic, the stork comes along and gives them a little bundle of data for them to analyze as their very own. The End.
Not buying it, huh? Well, you're probably old enough to learn the truth now. It's a story of the Populations and the Samples. No birds or bees involved.
Stay On Target
If we want to collect some data, the first order of business is to decide what exactly we want to study. Going out and measuring things at random is a good way to get lost, beaten up, or arrested. What is our target, the thing that we want to learn about?
A population is all of the individuals that we could collect data from. This is a real Humpty Dumpty definition: the words mean exactly what we want them to mean, no more, no less. When we say "all" of the individuals, we do mean every single one, even if they would be impossible to find or measure. A population is more of an idea than something we actually work with.
And when we say "individuals," we don't necessarily mean people. If we were studying hummingbird calls, then all the hummingbirds would be our population. Or, if we were only interested in the length of the bird calls, then it would be the bird calls themselves that would be our population. We can think about the individuals of a population as being "items of interest." By the way, you should come see our band, Items of Interest, this Saturday.
Some examples of populations are:
- All of the students that read Shmoop
- All of the weightlifters in Pocatello, Idaho
- Every family with at least two kids
- All of the news articles written about Shmoop
Obviously, it can be hard, or even impossible, to study every individual in a population. That's why we won't even try. Instead, we'll take a sample, a subset of the total population, and study that instead. This is actually the whole point of statistics—to be able to use a sample to make some conclusions about the population as a whole. And you thought it was all about mathematicians trying to trick people into paying attention to them.
When we sample a population, we're trying to learn about some parameter of the population as a whole. For instance, we might be curious about the average GPA of the students that read Shmoop. We could ask 1000 Shmoopers, a sample of the whole, about their GPA. The average GPA of our sample can be used as an estimate of the parameter in the whole population. We think that the estimate and the GPA would both be pretty good.
We can use all kinds of measurements as parameters and estimates. We can find the sample mean and use it to estimate the population mean, like in our GPA example. Yes, we can talk about multiple means at the same time. This gets confusing, so we have different symbols for the sample mean vs. the population mean: x vs. μ. Oh hey, we've seen these two before.
Other measurements we can use as parameters and estimates are proportions, the median, and the standard deviation. In each case, we use the values found from a sample to create an estimate for the population as a whole.
Random Sampling
Not every sample is going to be a good sample. If we only ask the chess club members for their GPA, we're going to get a biased picture if we try to use that as an estimate. Nothing personal, chess people, but you're not very representative of the class as a whole.
The way to get an unbiased estimate is to create the sample using random sampling. We're not talking monkey ninja pirate zombie types of random, though. There are two things we have to do to get a random sample:
- Every individual in the population has an equal chance of being selected.
- The selection of each individual is independent from all the others.
It's random like rolling dice, where every face has an equal chance of being rolled, and every roll of the die is independent of the one before and afterwards. As long as our sample is large enough, the results will be representative of the population as a whole.
Actually getting a random sample can be tough, though. If you're sampling wild flowers (maybe because you have a hot date and forgot to get a gift, you dog), it would be tempting to pull over to the side of the road and grab a whole clump of flowers. However, all the flowers away from the road are less likely to be picked, and flowers growing together are more likely to be picked. If your date wanted a random sample of flowers, and why wouldn't they, they're going to be disappointed.
Summary
Who loves a recap? We do.
- A population is everything that we want to study. If we're interested in how many cereal boxes have prizes inside of them, then all the cereal boxes are our population.
- A parameter is some measurement of the population. The proportion of all cereal boxes that have prizes would be a parameter. We almost never know the true value of the parameter, which is why we're trying to estimate it.
- We estimate the parameter value by creating a random sample. We don't just grab every box of Lucky Charms within reach; we select boxes at random, so our results won't be biased. We can use the proportion calculated from the sample to estimate the parameter value.
- Non-random sampling can only be used to draw conclusions about groups similar to the sample. We can't use it to make estimates for the total population.