Imagine you are working in an agricultural sciences lab, where you have been collaborating with a local farmer to develop new varieties of fruits and vegetables. It is your job to analyze the fruits and vegetables that are harvested each year and track any changes that occur from one plant generation to the next. Today you are measuring the sugar content of the latest crop of tomatoes to test how sweet they are. You have 25 tomatoes and find that the mean sugar content is 32 milligrams (mg) of sugar per gram (g) of tomato with a standard deviation of 4 mg/g (see Introduction to Descriptive Statistics for more information about calculating a mean and standard deviation).
Just as you finish these calculations, your collaborator from the farm shows up to ask about the latest results. Specifically, she wants to know the mean sugar content for this year’s entire tomato harvest (Figure 1). You look at your data and wonder what to tell her. How does the mean and standard deviation of the sample relate to the mean and standard deviation of the entire harvest?
As you will see in this module, your current predicament is very familiar to scientists. In fact, it is very unlikely that the mean and standard deviation of your 25-tomato sample is exactly the same as the mean and standard deviation of the entire harvest. Fortunately, you can use techniques from a branch of statistics known as “inferential statistics” to use your smaller subset of measurements to learn something about the sugar content of the entire tomato harvest. These and other inferential statistics techniques are an invaluable tool for scientists as they analyze and interpret their data.
What are inferential statistics?
Many statistical techniques have been developed to help scientists make sense of the data they collect. These techniques are typically categorized as either descriptive or inferential. While descriptive statistics (see Introduction to Descriptive Statistics) allow scientists to quickly summarize the major characteristics of a dataset, inferential statistics go a step further by helping scientists uncover patterns or relationships in a dataset, make judgments about data, or apply information about a small dataset to a larger group. They are part of the process of data analysis used by scientists to interpret and make statements about their results (see Data Analysis and Interpretation for more information).
The inferential statistics toolbox available to scientists is quite large and contains many different methods for analyzing and interpreting data. As an introduction to the topic, we will give a brief overview of some of the more common methods of statistical inference used by scientists. Many of these methods involve using smaller subsets of data to make inferences about larger populations. Therefore, we will also discuss ways in which scientists can mitigate systematic errors (sampling bias) by selecting subsamples (often simply referred to as “samples”) that are representative of the larger population. This module describes inferential statistics in a qualitative way.
Populations versus subsamples
When we use the word “population” in our everyday speech, we are usually talking about the number of people, plants, or animals that live in a particular area. However, to a scientist or statistician this term can mean something very different. In statistics, a population is defined as the complete set of possible observations. If a physicist conducts an experiment in her lab, the population is the entire set of possible results that could arise if the experiment were repeated an infinite number of times. If a marine biologist is tracking the migration patterns of blue whales in the Northeast Pacific Ocean, the population would be the entire set of migratory journeys taken by every single blue whale that lives in the Northeast Pacific. Note that in this case the statistical population is the entire set of migration events – the variable being observed – and not the blue whales themselves (the biological population).
Based on this definition of a population, you might be thinking how impractical, or even impossible, it could be for a scientist to collect data about an entire population. Just imagine trying to tag thousands of blue whales or repeating an experiment indefinitely! Instead, scientists typically collect data for a smaller subset – a “subsample” – of the population. If the marine biologist tags and tracks only 92 blue whales, this more practical subsample of migration data can then be used to make inferences about the larger population (Figure 2).
But this raises an important point about statistical inference: By selecting only a subsample of a population, you are not identifying with certainty all possible outcomes. Instead, as the name of the technique implies, you are making inferences about a large number of possible outcomes. As you will see later in this module, addressing the uncertainty associated with these inferences is an important part of inferential statistics.
The importance of random sampling
When using a subsample to draw conclusions about a much larger population, it is critical that the subsample reasonably represents the population it comes from. Scientists often use a process called “simple random sampling” to collect representative subsample datasets. Random sampling does not mean the data are collected haphazardly but rather that the probability of each individual in the population being included in the subsample is the same. This process helps scientists ensure that they are not introducing unintentional biases into their sample that might make their subsample less representative of the larger population.
Let’s think about this in the context of our original tomato example. To make inferences about the entire tomato harvest, we need to make sure our 25-tomato subsample is as representative of the entire tomato harvest as possible. To collect a random subsample of the tomato harvest, we could use a computer program, such as a random number generator, to randomly select different locations throughout the tomato field and different days throughout the harvesting season at which to collect subsample tomatoes. This randomization ensures that there is no bias inherent in the subsample selection process. In contrast, a biased sample might only select tomatoes from a single day during the harvesting period or from just one small area of the field.
If the sugar content of tomatoes varies throughout the season or if one area of the field gets more sun and water than another, then these subsamples would hardly be representative of the entire harvest. (For more information about the importance of randomization in science, see Statistics in Science.) You may also notice that the process of random sampling requires that some minimum number of samples be collected to ensure that subsample accounts for all of the possible conditions that can affect the research. Determining the ideal sample size for an experiment can depend on a number of factors, including the degree of variation within a population and the level of precision required in the analysis. When designing an experiment, scientists consider these factors to choose an appropriate number of samples to collect.
Another example of simple random sampling comes from a wildlife study about songbirds living on an island off the coast of California (Langin et al. 2009). To understand how the songbirds were being affected by climate change, researchers wanted to know how much food was available to the songbirds throughout the island. They knew that these particular songbirds ate mostly insects off of oak tree leaves – but imagine trying to find and measure the mass of every insect living on every oak tree on an island!
To collect a representative subsample, the researchers randomly selected 12 geographic coordinates throughout the island, collected a single branch from the oak tree closest to each coordinate in a randomly selected direction, and then measured the total insect mass on each branch. They then repeated this procedure every two weeks, randomly selecting locations each time and taking care to collect the same size branch at every location (Figure 3).
This carefully constructed procedure helped the researchers avoid biasing their subsample. By randomly selecting multiple locations, they ensured that branches from more than one tree would be selected and that one tree would not be favored over the others. Repeating the sampling procedure also helped limit bias. If insects were very abundant during the summer but hard to find in the winter, then sampling only one time or during one season would not be likely to generate a representative snapshot of year-round insect availability. Despite its name, the process of simple random sampling is not so “simple” at all! It requires careful planning to avoid introducing unintended biases.
Estimating statistical parameters
Once scientists have collected an appropriately random or unbiased subsample, they can use the subsample to make inferences about the population through a process called “estimation.” In our original example about tomatoes and sugar content, we reported the mean (32 mg/g) and standard deviation (4 mg/g) for the sugar content of 25 tomatoes. These are called “statistics” and are a property of the subsample. We can use these subsample statistics to estimate “parameters” for the population, such as the population mean or population standard deviation.
Notice that we refer to the population mean as a parameter while the subsample mean is called a statistic. This reflects the fact that any given population has only one true mean, while the subsample mean can change from one subsample to the next. Suppose you measured the sugar content of a different set of 25 tomatoes from the same harvest. This subsample’s mean and standard deviation will probably be slightly different from the first subsample due to variations in sugar content from one tomato to the next. Yet either set of subsample statistics could be used to estimate the population mean for the entire harvest.
To estimate population parameters from subsample statistics, scientists typically use two different types of estimates: point estimates and interval estimates. Often these two estimations are used in tandem to report a plausible range of values for a population parameter based on a subsample dataset.
A point estimate of a population parameter is simply the value of a subsample statistic. For our tomatoes, this means that the subsample mean of 32 mg/g could be used as a point estimate of the population mean. In other words, we are estimating that the population mean is also 32 mg/g. Given that the subsample statistic will vary from one subsample to another, point estimates are not commonly used by themselves as they do not account for subsample variability.
An interval estimate of a population parameter is a range of values in which the parameter is thought to lie. Interval estimates are particularly useful in that they reflect the uncertainty related to the estimation (see our Uncertainty, Error, and Confidence module) and can be reported as a range of values surrounding a point estimate. One common tool used in science to generate interval estimates is the confidence interval. Confidence intervals take into consideration both the variability and total number of observations within a subsample to provide a range of plausible values around a point estimate. A confidence interval is calculated at a chosen confidence level, which represents the level of uncertainty associated with the estimation. We could calculate a confidence interval estimate using our 25-tomato subsample, which has a mean of 32 mg/g and a standard deviation of 4 mg/g. When calculated at the 95% confidence level, this interval estimate would be reported as 32 ± 2 mg/g, meaning that the population mean is likely to lie somewhere between 30 mg/g and 34 mg/g.*
*While the standard deviation provides a measure of the spread of all observations in the sample, the confidence interval provides a narrower probability of where the mean would fall if you took another sub-sample from the population.
Comparing multiple subsamples
Another technique that scientists often employ is to compare two or more subsamples to determine how likely it is that they have similar population parameters. Let’s say that you want to compare your current tomato harvest to a previous year’s harvest. This year the mean sugar content was 32 ± 2 mg/g but last year the mean sugar content was only 26 ± 3 mg/g. While these two numbers seem quite different from each other, how can you be confident that the difference wasn’t simply due to random variation in your two subsamples?
In cases like this, scientists turn to a branch of statistical inference known as statistical hypothesis testing. When comparing two subsamples, scientists typically consider two simple hypotheses: Either the two subsamples come from similar populations and are essentially the same (the null hypothesis) or the two subsamples come from different populations and are therefore “significantly” different from one another (the alternative hypothesis). In statistics, the word “significant” is used to designate a level of statistical robustness. A “significant” difference implies that the difference can be reliably detected by the statistical test but says nothing about the scientific importance, relevance, or meaningfulness of the difference.
To determine whether the sugar content of your two tomato harvests is indeed significantly different, you could use a statistical hypothesis test such as Student’s t-test to compare the two subsamples. Conducting a t-test provides a measure of statistical significance that can be used to either reject or accept the null hypothesis. The significance level quantifies the likelihood that a particular result occurred by chance. In science, the significance level used for hypothesis testing is often 0.05. This means that in order for a result to be deemed “statistically significant,” there must be less than a 5% probability that the result was observed by chance. If you conduct a t-test on your two tomato samples and calculate a probability value (also called p-value) less than 0.05, you can reject the null hypothesis and report that the difference in sugar content is significantly different from one year to the next.
What if you now wanted to compare the sugar content of all tomato harvests from the last 20 years? Theoretically, you could conduct pairwise t-tests among all of the different subsamples, but this approach can lead to trouble. With every t-test, there is always a chance, however small, that the null hypothesis is incorrectly rejected and a so-called “false positive” result is produced. Repeating multiple t-tests over and over can introduce unintended error into the analysis by increasing the likelihood of false positives. When comparing three or more samples, scientists instead use methods like “analysis of variance,” aka ANOVA, which compare multiple samples all at once to reduce the chance of introducing error into the statistical analysis.
Finding relationships among variables
As you continue to analyze all of your tomato data, you notice that the tomatoes seem to be sweeter in warmer years. Are you making this up, or could there actually be a relationship between tomato sweetness and the weather? To analyze these kinds of mutual relationships between two or more variables, scientists can use techniques in inferential statistics to measure how much the variables correlate with one another. A strong correlation between two variables means that the variables change, or vary, in similar ways. For example, medical research has shown people with high-salt diets tend to have higher blood pressure than people with low-salt diets. Thus, blood pressure and salt consumption are said to be correlated.
When scientists analyze the relationships between two or more variables, they must take great care to distinguish between correlation and causation. A strong correlation between two variables can signify that a relationship exists, but it does not provide any information about the nature of that relationship. It can be tempting to look for cause-and-effect relationships in datasets, but correlation among variables does not necessarily mean that changes in one variable caused or influenced changes in the other. While two variables may show a correlation if they are directly related to each other, they could also be correlated if they are both related to a third unknown variable. Moreover, two variables can sometimes appear correlated simply by chance. The total revenue generated by arcades and the number of computer science doctorates awarded in the United States change in very similar ways over time and can therefore be said to correlate (Figure 4). The two variables are highly correlated with each other, but we cannot conclude that changes in one variable are causing changes in the other. Causation must ultimately be determined by the researcher, typically through the discovery of a reasonable mechanism by which one variable can directly affect the other.
Although correlation does not imply causation on its own, researchers can still establish cause-and-effect relationships between two variables. In these kinds of relationships, an independent variable (one that is not changed by any other variables being studied) is said to cause an effect on a dependent variable. The dependent variable is named for the fact that it will change in response to an independent variable – its value is literally dependent on the value of the independent variable. The strength of such a relationship can be analyzed using a linear regression, which shows the degree to which the data collected for two variables fall along a straight line. This statistical operation could be used to examine the relationship between tomato sweetness (the dependent variable) and a number of weather-related independent variables that could plausibly affect the growth, and therefore sweetness, of the tomatoes (Figure 5).
When the independent and dependent variable measurements fall close to a straight line, the relationship between the two variables is said to be “strong” and you can be more confident that the two variables are indeed related. When data points appear more scattered, the relationship is weaker and there is more uncertainty associated with the relationship.
Statistical inference with qualitative data
So far we have only considered examples in which the data being collected and analyzed are quantitative in nature and can be described with numbers. Instead of describing tomato sweetness quantitatively by experimentally measuring the sugar content, what if you asked a panel of taste-testers to rank the sweetness of the tomatoes on a scale from “not at all sweet” to “very sweet”? This would give you a qualitative dataset based on observations rather than numerical measurements (Figure 6).
The statistical methods discussed above would not be appropriate for analyzing this kind of data. If you tried to assign numerical values, one through four, to each of the responses on the taste-test scale, the meaning of the original data would change. For example, we cannot say with certainty that the difference between “3 - sweet” and “4 - very sweet” is really exactly the same as the difference between “1 - not at all sweet” and “2 - somewhat sweet.”
Rather than trying to make qualitative data more quantitative, scientists can use methods in statistical inference that are more appropriate for interpreting qualitative datasets. These methods often test for statistical significance by comparing the overall shape of the distributions of two or more subsamples – for instance, the location and number of peaks in the distribution or overall spread of the data – instead of using more quantitative measures like the mean and standard deviation. This approach is perfect for analyzing your tomato taste-test data. By using a statistical test that compares the shapes of the distributions of the taste-testers’ responses, you can determine whether or not the results are significantly different and thus whether one tomato harvest actually tastes sweeter than the other.
Proceed with caution!
Inferential statistics provides tools that help scientists analyze and interpret their data. The key here is that the scientists – not the statistical tests – are making the judgment calls. The way that the term “significance” is used in statistical inference can be a major source of confusion. In statistics, significance indicates how reliably a result can be observed if a statistical test is repeated over and over. A statistically significant result is not necessarily relevant or important; it is the scientist that determines the importance of the result. (For a broader discussion of statistical significance, see our module Statistics in Science.)
One additional pitfall is the close relationship between statistical significance and subsample size. As subsamples grow larger, it becomes easier to reliably detect even the smallest differences among them. Sometimes well-meaning scientists are so excited to report statistically significant results that they forget to ask whether the magnitude, or size, of the result is actually meaningful.
Statistical inference is a powerful tool, but like any tool it must be used appropriately. Misguided application or interpretation of inferential statistics can lead to distorted or misleading scientific results. On the other hand, proper application of the methods described in this module can help scientists gain important insights about their data and lead to amazing discoveries. So use these methods wisely, and remember: It is ultimately up to scientists to ascribe meaning to their data.
Activate glossary term highlighting to easily identify key terms within the module. Once highlighted, you can click on these terms to view their definitions.
Activate NGSS annotations to easily identify NGSS standards within the module. Once highlighted, you can click on them to view these standards.