Advanced Statistics—Semester B

Got a hypothesis? There's a test for that.

  • Course Length: 16 weeks
  • Course Type: AP
  • Category:
    • Math
    • High School
    • College Prep

Schools and Districts: We offer customized programs that won't break the bank. Get a quote.

Get a Quote

This course has been granted a-g certification, which means it has met the rigorous iNACOL Standards for Quality Online Courses and will now be honored as part of the requirements for admission into the University of California system.


Have you ever had that dream where you show up for school late, find out there's a test that day you had forgotten all about, realize you're naked, and then turn into a bird and fly away? It's not the most pleasant experience, that's for sure. Dreaming about it isn't nearly as bad as living it, though—but we don't like to remember that day.

Wait, we're getting off track. We aren't here to help you analyze your dreams to find their deep-seated causes. We're here to help you learn about statistics: the study of analyzing data and making inferences about a population. It's the stuff that dreams (the good kind) are made of.

Shmoop has the inside scoop on all things stats. In Semester B of this year-long Statistics course, we're going to focus on statistical inference. That's the art, science, and mathematics of taking a relatively small sample from a population and using it to come to some big conclusions. We'll multiply your knowledge with the multiplicative rule, sample the delights of the sampling distribution, and boost your self-confidence with confidence intervals.

Take a look at what we'll be covering:

  • Statistics (like politics) is the art of the possible, so it's no surprise that probability plays a big role in it. We'll give you the ground rules that all random events obey.
  • Using our new probability knowledge, we'll dig into random variables and probability distributions. Like rolling dice or flipping coins, sampling data from a population involves a little uncertainty. At the same time, we don't expect to roll a 7 or land right on the edge; some outcomes are more likely than others.
  • The sampling distribution is the living, beating heart of statistical inference. Metaphorically speaking, of course. Which means it's not living, beating, or a heart. It is important for making any inferences from a sample, though.
  • At this point, we finally have the tools we need to make the leap from our incomplete sample to the total population. Want to know what the average length of a honey badger is? Just take a sample, measure the sample's average, and use the techniques of estimation to figure out a range of plausible values for the population average. Well, assuming you survive even a sample of honey badger attacks.
  • We'll also spend plenty of time talking about hypothesis tests. No, they're not the ones you were dreaming about. Instead, we use them to make a decision or come to a conclusion about the population. They're a great way to distill our data down to a simple yea-or-nay vote.

On top of all that, this course has an activity in every pot and a chicken in every lesson. Or something like that. Plus, we've got easy-to-follow readings, practice problems coming out of our ears, and a towel for wiping the ear wax off of those problems. Sorry about that, we didn't know where else to keep them.

FYI: This course is a two-semester course. You can check out the first half here.

Technology Requirements

A graphing calculator is used throughout the course (and on the AP Stats exam). We give plenty of instruction on using the TI-84 specifically, but any graphing calculator from this list will do the trick.

Required Skills

A strong working knowledge of algebra, up through Algebra 2.


Unit Breakdown

8 Advanced Statistics—Semester B - Formal Probability

Put away your tuxedos and fancy dresses, this unit isn't that kind of formal event. Instead, we're gussying up the place by inviting a bunch of equations over. This unit is all about predicting the probability of different events by using the right formulas. Why run a ton of simulations for the statistical equivalent of tying your shoes? Save your CPU cycles for the hard problems.

9 Advanced Statistics—Semester B - Probability Distributions of Random Variables

Variables are like a box full of mystery. What's going to be inside? Have you tried shaking it? We think that x sounds like a 12. Well, random variables up the mystery to another level, because they can take on different values each time we use them. Like when rolling dice, the random variable X could be a different number 1 through 6 for each roll. If you're thinking, "Wow, that would be a great way to represent the possible results from sampling a population. And we could graph them, too, to see how likely different outcomes are," then...why are you taking this course again?

10 Advanced Statistics—Semester B - Sampling Distributions

So you've taken a sample, measured some variable you're interested in, and calculated an estimate like a mean or proportion. So...what if you did it again? And again and again and again? If you did it enough times, you could graph all those estimates and create its sampling distribution. While that might sound like a Guinness Book of World Records waste of time, the concepts behind the sampling distribution are key to understanding why statistics works at all.

11 Advanced Statistics—Semester B - Estimation and Confidence Intervals

If you've ever wondered how much wood a woodchuck could chuck if a woodchuck could chuck wood, then this is the unit for you. Using a confidence interval, we can estimate, with 95% confidence, the average amount of wood chucking your average woodchuck gets up to. Turns out that it's 1.3 ± 2.6 pieces of wood. In this unit, we'll focus on estimating values from one sample at a time.

12 Advanced Statistics—Semester B - Hypothesis Testing: The Basics

Hypothesis tests are the jelly to the confidence interval's peanut butter. Instead of using our data to pin down an estimate for some value, we ask the data a question. Is the parameter larger than 0? Smaller than 27? Something, anything other than -14? Then we let the data loose until it comes back with an answer. In this unit, we'll cover how to set up our null and alternative hypotheses, how to interpret our results, and how to minimize the chances of making any Errors. Yes, that capitalization is intentional.

13 Advanced Statistics—Semester B - Hypothesis Testing for Comparing Samples and Categorical Data

A lot of interesting questions involve making a comparison between two things. Which tastes better, vanilla or chocolate ice cream? Who's stronger, Batman or Superman? Who has longer lines, Walmart on Black Friday or the DMV on a typical Tuesday? This unit will cover how to run a hypothesis test when we have two samples and need to know which one is harder, better, faster, stronger. We'll also dip our toes into the chi-squared tests, which help us sort through categorical data.

14 Advanced Statistics—Semester B - More Advanced Statistical Tests and Other Topics

We'll wrap up the course with a grab-bag smattering of different topics. We've got the return of linear regression—now with more hypothesis testing and confidence intervals. We'll also dive into making confidence intervals to estimate the difference between two samples. And we'll learn that the real statistics were the friends we made along the way.


Recommended prerequisites:

  • Algebra II—Semester A
  • Algebra II—Semester B
  • Advanced Statistics—Semester A

  • Sample Lesson - Introduction

    Lesson 9.07: Normal Distribution



    LEGOland roller coaster
    These kids are learning how to diagram sentences on the Shmoopercoaster.
    (Source)

    Things are about to get Normal.

    Wait. What?

    Isn't that supposed to be "things are about to get weird?" Yeah. That's what it's supposed to be, but that's not what's about to happen. We're about to get Normal. Not crazy. Normal.

    Somebody out there is probably saying, "But you're Shmoop. You're the opposite of normal. That's why we're here, after all, because Shmoop isn't normal school. It's like school if it was on a rollercoaster."

    Don't fret. We're still the same, old irreverent Shmoop. We'll still make stupid jokes, deliver content with a healthy dose of skepticism and sarcasm, and make butt-kicking videos. We just need to share some info about the Normal Distribution and all the amazing things it can do for us. That's what we meant by, "Things are about to get Normal."

    Besides, we couldn't handle being normal for long.


    Sample Lesson - Reading

    Reading 9.9.07: The New Normal

    These discrete probability distributions have been running around all willy-nilly like they own the place. They think they're the only ones that matter. They're completely drunk on their power and out of control.

    Luckily, there's a new sheriff in town called the Normal distribution, and she's a continuous probability distribution that won't take any guff from these discrete troublemakers. We're no longer stuck just counting things. Now we can get a probability for any number, not just whole ones.

    The Normal distribution (the capital N isn't necessary, but it's common to see it like that) is a bell-shaped curve that sits completely above the x-axis.  Its shape is completely determined by some old friends, the mean, μ, and the standard deviation, σ.

    The peak on the bell sits exactly at the value of the mean, with the bell spread symmetrically on either side of it. The left and right ends technically go on forever in each direction and never reach the x-axis. That point where the curve changes from a steep hill to a shallower one on each side is exactly one standard deviation away on each side of the mean. Typically, we mark the x-axis from -3σ to 3σ in units of σ.

    What we mean by that whole "-3σ to 3σ thing for our ticks marks" is this: Since the shape of the Normal curve is utterly determined by the mean and the standard deviation (σ), we use those values as the values on the x-axis. Take a Normal scenario with a mean of 100 and a standard deviation of 10. We'll have our peak at 100, and a symmetric curve sloping down identically on either side.

    Our change-in-curvature points, will be ten units away from the mean (-1σ and 1σ on either side of the mean or 90 and 110). At 20 units on either side of the mean (80 and 120), we'll have another set of tick-marks to represent -2σ and 2σ. Lastly, when we're at 70 and 130 (or -3σ to 3σ on either side), we'll have our last set of tick-marks. These also don't represent any nifty features in the curve like the change-in-curvature point.

    Since we had that special pair of points at one standard deviation on either side, we just keep marking the x-axis off in those same intervals.

    Shmoopers out there with some Calculus under their belts will recognize that "curvature change" point as an inflection point. Shmoopers out there without some Calculus under their belts? Guess what? We call it an inflection point. Either way, we don't care much about its inflection-ness, but connections should always be made when they can be.

    Statisticians and scientists have measured kajillions of things over the years, plotted them, and noticed similarities between some sets of data. The heights of women, for example, tend to have a peak middle value representing the mean height of women. As we increase or decrease heights on either side of that mean, we see fewer and fewer women with these more and more extreme heights.

    Plotting the heights of thousands or even millions of women produced a bell-shaped curve. When the mean and standard deviation were calculated, the mean fell smack-dab at the peak, and that curvature changeroo point was exactly one standard deviation away on each side. Turns out the heights of men also have the same shape, with a different mean and standard deviation, of course.

    The overall shapes of the graphs were the same. Tons of biological measures (height, lengths of small intestine, sizes of weevils, etc.) turn out to have the same overall shape, where the peak sits at the mean with the change-in-curvature point one standard deviation away. This shape was analyzed to determine a function that represents it and given a shiny, new name. Ta da! The Normal distribution was born.

    It's probably no shock that there are lots of other things besides living things that are Normally distributed. Because there are.

    Important Safety Tips for Sketching Normal Curves

    Now, this may sound silly, but you Shmoopers out there need to get really good at sketching this shape, marking off the x-axis appropriately, and then using that plot to help you answer some questions. The AP Stats test will almost certainly ask you to draw and label at least one Normal curve. Some things to watch out for:

    • Try really hard for symmetry around the mean.
    • Try not to accidentally make the tails turn up or down. They approach the x-axis but never reach it, and they certainly don't head back up.
    • Try to get the curvature change point the same distance away on each side of the mean. Use that distance to mark off the other values of σ from -3 to 3. On either end, -3σ and 3σ should be very near the tails with wee little areas to the outside of them.

    Practice sketching this. Really. We mean it.

    Let's Dissect this Distribution

    Repeated study of the different areas of the Normal distribution has led to the following ultra-super-mega-important findings, collectively called the 68-95-99.7 Rule. We'd have called it Steve, but it wasn't up to us. Still, it's an awesomely descriptive name. Its other name, The Empirical Rule, is pretty common but not as descriptive.

    Here's what it tells us.

    • Between -σ and σ, 68% of all the data points in the distribution make their home.
    • Between -2σ and 2σ, 95% of all the data points in the distribution make their home.
    • Between -3σ and 3σ, 99.7% of all the data points in the distribution make their home.

    Check out this plot for a visual.

    This is true for every Normal distribution, no matter the value of μ and σ. In other words, if we know the data is Normally distributed, we know how much of the data sits in each of those three regions, no matter the context of the problem.

    Given the symmetry, we can easily find how much of the data is between -1σ and μ, for example. If 68% is from -σ to σ, then half that, or 34%, runs from -1σ to μ. We can play these games all over the plot if we want. It's not a bad idea to make a plot to refer to, listing all the percentages in each σ-length region. In fact, let's do it:

    Also, some really old friends heard about our new distribution and decided to stop by and see it. Median and mode missed their buddy, mean, who's smack-dab in the middle of our Normal distribution. As it turns out, so are the median and mode.

    Since the distribution is symmetric, half the data lies to the right of the mean, and half lies to the left of the mean. The point with half the data above and half the data below is the median. The peak represents the value with the greatest probability of occurring, i.e. the most likely value we can have. It's the mode, as well. All three coexist in one happy place where the peak of the Normal distribution is located on the x-axis.

    It's Context Time

    Our new sheriff is getting bored with all this talk. She needs some action. Let's see how she stacks up against this scenario.

    "Plastic lids for soda pop cups with a 4-inch diameter top produced on injection molding machines are not all uniform in size. Minute differences in the temperature of the plastic, temperature of the mold, composition of the plastic, and scads of other factors contribute to produce lids whose diameters are Normally distributed around a mean value of 4 in., with a standard deviation of 0.014 in. A lid can be between 3.986 in. and 4.028 inches and still fit on the cup. Let's find the percentage of lids that will fit."

    Typically these problems will ask us to sketch and label a Normal curve for this context and shade the portion of the graph which we're interested in. Get used to this part. It should be done every time we use a Normal distribution.

    Place a point above the x-axis representing the peak (maybe three or four lines on lined paper above the x-axis). Label the spot on the x-axis under that point as μ = 4 or just 4. Mark off three equal distances on each side of the mean—those are our standard deviations. Add 0.014 to 4 three times to get the positive values. Subtract 0.014 from 4 three times to get the negative values.

    Sketch the Normal curve on each side, so that the curvature changes right above 3.986 and 4.014. Make sure the tails extend a wee bit beyond 3.958 and 4.042. Our Normal curve needs to have values specific to this context, which is why we added and subtracted 0.014 to and from the mean. It also needs to have the correct shape.

    Practice sketching this shape. We mean it.

    We're curious about the percentage of lids between 3.986 in. and 4.028 in. Shade in that area under the curve.

    There's going to be somewhere between 68% and 95% of the data between those values because we don't quite have the whole 95% between -2σ and 2σ, and we have more than the 68% between -σ and σ.

    First off, 47.5% of the data are between the mean and +2 standard deviations. That's half of the 95%. And 34% of the data are between the mean and -1 standard deviation, as we stated above. We've got the 47.5% on the right of the mean and the 34% on the left of the mean. All together that's 81.5%.

    So 81.5% of those lids will fit. That's a pretty awful success rate if we're basically creating one out of five lids that we have to junk.

    One last little thing before we're out. While there are many sets of data that are actually Normally distributed, there are also scads more that are nearly Normally distributed. If data are close to a Normal curve in shape, we often use the Normal distribution to analyze them because it's a handy and easy-to-use tool. If we're not close to a Normal curve in shape, we should use some other method to analyze it.

    How do we know if data are Normally distributed? Typically, we might have to make a histogram and peek at the shape. We may even have to go so far as finding the mean and standard deviation and seeing what percentage of data falls in our different standard deviation intervals. The third example Click-Through addresses this scenario, so check that for more details.

    Recap

    The Normal distribution is a continuous probability distribution that often very closely models many biological and manufacturing processes, as well as oodles of other real-world situations. The graph of the Normal distribution is a symmetric, bell-shaped curve with its peak at an x-value equal to the mean. The shape of the curve is completely determined by the value of the standard deviation. In fact, at the point where the curve changes from a steep curve to a shallow one, we're one standard deviation away from the mean on either side.

    The 65-95-99.7 (or Empirical) Rule tells us what percentage of data falls in different regions on the graph.

    • 68% of all the data points in the distribution fall between -σ and σ.
    • 95% of all the data points in the distribution fall between -2σ and 2σ.
    • 99.7% of all the data points in the distribution fall between -3σ and 3σ.

    The following video is just a recap of what we just recapped:


    Sample Lesson - Activity

    Activity 9.07b: Livin' La Vida Normal

    In this activity, we'll toss two different scenarios at ya. Both of these scenarios include data that are or are roughly Normal. Your job will be to sketch a whole mess of shaded Normal curves and use those curves to help you answer some questions. All sketches should include labels and a title.

    Scenario 1: It is thought that adult men have a mean height of 70 inches X = 70) with a standard deviation of 4 inches (σX = 4).

    Step 1: On a sheet of paper, sketch a Normal curve for this scenario. The x-axis should cover a little more than the region from -3 to 3 standard deviations and should be marked off with numbers appropriate to the scenario. Indicate the percentage of data falling between each pair of opposite standard deviations.

    Step 2: Answer the following questions. For each question, you must include a properly sketched Normal curve with the appropriate areas shaded. You should also explain how you arrived at each answer. Number your answers, so we can tell what's what.

    1. What percentage of adult men are over 6 feet 10 inches tall?
    2. What percentage of adult men are less than 5 feet 6 inches tall?
    3. What percentage of adult men are above average in height?
    4. Based on the distribution, what height would you say is the cutoff below which an adult man would be considered unusually short? How did you use the distribution to choose that value? What percentage of adult men would be considered unusually short by your definition?

    Scenario 2: Way back in 1910, an experiment to determine the wheat yields in identically treated, identically sized sections was run. It was found that the yields of the different sections were roughly Normal in nature. The frequency table gives the frequencies of different sized yields in bushels.

    (Source)

    BushelsFrequency
    2.5-2.751
    2.75-3.06
    3.0-3.2525
    3.25-3.554
    3.5-3.7586
    3.75-4.0101
    4.0-4.2594
    4.25-4.576
    4.5-4.7537
    4.75-5.013
    5.0-5.257

    Step 3: On a different sheet of paper than the one you used for Scenario 1, sketch a Normal curve for this scenario. The x-axis should cover a little more than the region from -3 to 3 standard deviations and should be marked off with numbers appropriate to the scenario. Indicate the percentage of data falling between each pair of opposite standard deviations.

    Step 4: Answer the following questions. For each question, you must include a properly sketched Normal curve with the appropriate areas shaded. You should also explain how you arrived at each answer. Number your answers again, pretty please.

    5. What percentage of sections should have produced fewer than 3.03 bushels according to the Normal distribution? How does that value compare to the actual percentage producing less than 3 bushels?

    6. What percentage of sections should have produced between 3.49 and 4.41 bushels according to the Normal distribution? How does that compare to the actual percentage producing between 3.5 and 4.5 bushels?

    7. What percentage of sections should have produced between 3.03 and 4.87 bushels according to the Normal distribution? How does that compare to the actual percentage producing between 3 and 5 bushels?

    8. What percentage of sections should have produced between 2.57 and 5.33 bushels according to the Normal distribution? How does that compare to the actual percentage producing between 2.5 and 5.25 bushels?

    9. Do you believe there's enough evidence to suggest the data are roughly Normal in shape? Explain why or why not.

    Keep all of the sketches and questions together by scenario, please. Copy, scan, or take pix of all your artwork and answers and random doodles, and fling them via the Internet to us at the upload button below.


    Sample Lesson - Activity

    1. Which of the following statements about the Normal distribution are true?

      I. A Normal curve is any symmetric, bell-shaped distribution.
      II. The shape of Normal distribution is completely determined by the mean and standard deviation of the data.
      III. Normal curves are discrete probability distributions.

    2. Which graph below correctly displays the Normal distribution of the lengths of bolts produced on a manufacturing line with a mean of 3 cm and a standard deviation of 0.03 cm?

    3. Which statements about the Empirical Rule are true?

      I. 68% of the data lie between -1 and 1 standard deviation.
      II. 95% of the data lie between -3σ and 3σ.
      III. 13.5% of the data lie between 1 and 2 standard deviations.

    4. Students at James Woods Senior High School took the SAT last year. They averaged an 1104 combined score with a standard deviation of 241. The data from the tests showed an approximately Normal distribution. About what percentage of students scored above 1345?

    5. The studs on the top of LEGO bricks need to fit a very narrow profile of sizes, or else the bricks won't stay connected or can't be connected at all. These studs have a mean diameter of 4.8 mm. The machines that produce the bricks do so with a standard deviation of 0.023 mm. The stud sizes are Normally distributed. If the studs are less than 4.754 mm in diameter or more than 4.869 mm in diameter, they're unusable and must be melted down and re-used. What percentage of bricks must be melted down?