by Anthony Carpi, Ph.D., Anne E. Egger, Ph.D.
Modern science is often based on statements of statistical significance and probability. For example: 1) studies have shown that the probability of developing lung cancer is almost 20-times greater in cigarette smokers compared to non-smokers (ACS, 2004); 2) there is a significant likelihood of a catastrophic meteorite impact on Earth sometime in the next 200,000 years (Bland, 2005); and 3) first-born male children exhibit IQ test scores that are 2.82 points higher than second-born males, a difference that is significant at the 95% confidence level (Kristensen & Bjerkedal, 2007). But why do scientists speak in terms that seem obscure? If cigarette smoking causes lung cancer, why not simply say so? If we should immediately establish a colony on the moon to escape extraterrestrial disaster, why not inform people? And if older children are smarter than their younger siblings, why not let them know?
The reason is that none of these latter statements accurately reflect the data. Scientific data rarely lead to absolute conclusions. Not all smokers die from lung cancer - some smokers decide to quit, thus reducing their risk, some smokers may die prematurely from cardiovascular or diseases other than lung cancer, and some smokers may simply never contract the disease. All data exhibit variability, and it is the role of statistics to quantify this variability and allow scientists to make more accurate statements about their data.
A common misconception is that statistics provide a measure of proof that something is true, but they actually do no such thing. Instead, statistics provide a measure of the probability of observing a certain result. This is a critical distinction. For example, the American Cancer Society has conducted several massive studies of cancer in an effort to make statements about the risks of the disease in US citizens. Cancer Prevention Study I enrolled approximately 1 million people between 1959 and 1960, and Cancer Prevention Study II was even larger, enrolling 1.2 million people in 1982. Both of these studies found much higher rates of lung cancer among cigarette smokers compared to non-smokers, however, not all individuals who smoked contracted lung cancer (and, in fact, some non-smokers did contract lung cancer). Thus, the development of lung cancer is a probability-based event, not a simple cause-and-effect relationship. Statistical techniques allow scientists to put numbers to this probability, moving from a statement like “if you smoke cigarettes, you are more likely to develop lung cancer” to the one that started this module: “the probability of developing lung cancer is almost 20-times greater in cigarette smokers compared to non-smokers.” The quantification of probability offered by statistics is a powerful tool used widely throughout science, but it is frequently misunderstood.
Figure 1: The field of statistics has its roots in calculations of the probable outcomes of games of chance.
The field of statistics dates to 1654 when a French gambler, Antoine Gombaud, asked the noted mathematician and philosopher Blaise Pascal about how one should divide the stakes among players when a game of chance is interrupted prematurely. Pascal posed the question to the lawyer and mathematician Pierre de Fermat, and over a series of letters, Pascal and Fermat devised a mathematical system that not only answered Gombaud’s original question, but laid the foundations of modern probability theory and statistics.
From its roots in gambling, statistics has grown into a field of study that involves the development of methods and tests that are used to quantitatively define the variability inherent in data, the probability of certain outcomes, and the error and uncertainty associated with those outcomes (see our Data: Uncertainty, Error, and Confidence module). As such, statistical methods are used extensively throughout the scientific process, from the design of research questions through data analysis and to the final interpretation of data. The specific statistical methods used vary widely between different scientific disciplines; however, the reasons that these tests and techniques are used are similar across disciplines. This module does not attempt to introduce the many different statistical concepts and tests that have been developed, but rather provides an overview of how various statistical methods are used in science. More information about specific statistical tests and methods can be found in the links section of this module.
Many people misinterpret statements of likelihood and probability as a sign of weakness or uncertainty in scientific results. However, the use of statistical methods and probability tests in research is an important aspect of science that adds strength and certainty to scientific conclusions. For example, in 1843, John Bennet Lawes, an English entrepreneur, founded the Rothamsted Agriculture Experimental Station in Hertfordshire, England to investigate the impact of fertilizer application on crop yield. Lawes was motivated to do so because he had established one of the first artificial fertilizer factories a year earlier. For the next 80 years, researchers at the Station conducted experiments in which they applied fertilizers, planted different crops, kept track of the amount of rain that fell, and measured the size of the harvest at the end of each growing season. By the turn of the century, the Station had a vast collection of data but few useful conclusions: one fertilizer would outperform another one year but underperform the next, certain fertilizers appeared to affect only certain crops, and the differing amounts of rainfall that fell each year continually confounded the experiments (Salsburg, 2001). The data were essentially useless because there were a large number of uncontrolled variables.
Figure 2: A building at the Rothamsted Research Station
In 1919, the Rothamsted Station hired a young statistician by the name of Ronald Aylmer Fisher to try to make some sense of the data. Fisher’s statistical analyses suggested that the relationship between rainfall and plant growth was far more statistically significant than the relationship between fertilizer type and plant growth. But the agricultural scientists at the station weren’t out to test for weather – they wanted to know which fertilizers were most effective for which crops. No one could remove weather as a variable in the experiments, but Fisher realized that its effects could essentially be separated out if the experiments were designed appropriately. In order to share his insights with the scientific community, he published two books: Statistical Methods for Research Workers in 1925 and The Design of Experiments in 1935. By highlighting the need to consider statistical analysis during the planning stages of research, Fisher revolutionized the practice of science and transformed the Rothamsted Station into a major center for research on statistics and agriculture, which it still is today.
Figure 3: An original figure from Fisher's The Design of Experiments showing the arrangement of treatment groups and yields of barley in an experiment at the Rothamsted station in 1927 (Fisher, 1935). Letters in parentheses denote control plots not treated with fertilizer (I) or those treated with different fertilizers (s = sulfate of ammonia, m = chloride of ammonia, c = cyanamide, and u = urea) with or without the addition of superphosphate (p). Subscripted numbers in parentheses indicate relative quantities of fertilizer used. Numbers at the bottom of each block indicate the relative yield of barley from the plot.
Controls: The use of controls is based on the concept of variability. Since any phenomenon has some measure of variability, controls allow the researcher to measure natural, random, or systematic variability in a similar system and use that estimate as a baseline for comparison to the observed variable or phenomenon. At Rothamsted, a control would be a crop that did not receive the application of fertilizer (see plots labeled I in Figure 3). The variability inherent in plant growth would still produce plants of varying heights and sizes. The control then could provide a measure of the impact that weather or other variables could have on crop growth independent of fertilizer application, thus allowing the researchers to statistically remove this as a factor.
Randomization: Statistical randomization helps to manage bias in scientific research. Unlike the common use of the word random, which implies haphazard or disorganized, statistical randomization is a precise procedure in which units being observed are assigned to a treatment or control group in a manner that takes into account the potential influence of confounding variables. This allows the researcher to quantify the influence of these confounding variables by observing them in both the control and treatment groups. For example, before Fisher, fertilizers were applied along different crop rows at Rothamsted, some of which fell entirely along the edge of fields. Yet edges are known to affect agricultural yield, and so it was difficult in many cases to distinguish edge effects from fertilizer effects. Fisher introduced a process of randomly assigning different fertilizers to different plots within a field in a single year while assuring that not all of the treatment (or control) plots for any particular fertilizer fell along the edge of the field (see Figure 3).
Replication: Fisher also advocated for replicating experimental trials and measurements such that the range of variability inherently associated with the experiment or measurement could be quantified and the robustness of the results could be evaluated. At Rothamsted this meant planting multiple plots with the same crop and applying the same fertilizer to each of those plots (see Figure 3). Further, this meant repeating similar applications in different years so that the variability of different fertilizer applications as a function of different weather conditions could be quantified.
In general, scientists design research studies based on the nature of the question they are seeking to investigate, but they refine their research plan in line with many of Fisher’s statistical concepts to increase the likelihood that their findings will be useful. The incorporation of these techniques facilitates the analysis and interpretation of data, another place where statistics are used.
A multitude of statistical techniques have been developed for data analysis, but they generally fall into two groups: descriptive and inferential.
Descriptive Statistics: Descriptive statistics allow a scientist to quickly sum up major attributes of a dataset using measures such as the mean, median, and standard deviation. These measures provide a general sense of the group being studied, allowing scientists to place the study within a larger context. For example, Cancer Prevention Study I (CPS-I) was a prospective mortality study initiated in 1959 as mentioned earlier. Researchers conducting the study reported the age and demographics of participants, among other variables, to allow a comparison between the study group and the broader population of the United States at the time. Adults participating in the study ranged from 30 to 108 years of age, with the median age reported as 52 years. The study subjects were 57% female, 97% white and 2% black. By comparison, median age in the United States in 1959 was 29.4 years of age, obviously much younger than the study group since CPS-I did not enroll anyone under 30 years of age. Further, 51% of U.S. residents were female in 1960, 89% white, and about 11% black. One recognized shortcoming of CPS I, easily identifiable from the descriptive statistics, was that with 97% participants categorized as white, the study did not adequately assess disease profiles in minority groups of the U.S.
Inferential Statistics: Inferential statistics are used to model patterns in data, make judgments about data, identify relationships between variables in datasets, and make inferences about larger populations based on smaller samples of data. It is important to keep in mind that from a statistical perspective the word “population” does not have to mean a group of people, as it does in common language. A statistical population is the larger group that a data set is used to make inferences about – this can be a group of people, corn plants, meteor impacts, oil field locations, or any other group of measurements as the case may be.
Transferring results from small sample sizes to large populations is especially important with respect to scientific studies. For example, while Cancer Prevention Studies I and II enrolled approximately 1 million and 1.2 million people, respectively, they represented a small fraction of the 179 and 226 million people who were living in the United States in 1960 and 1980. Common inferential techniques include regression, correlation, and point estimation/testing. For example, Petter Kristensen and Tor Bjerkedal (2007) examined IQ test scores in a group of 250,000 male Norwegian military personnel. Their analyses suggested that first-born male children had an average IQ test score 2.82 ± 0.07 points higher than second-born male children, a statistically significant difference at the 95% confidence level (Kristensen & Bjerkedal, 2007).
The phrase “statistically significant” is a key concept in data analysis, and it is commonly misunderstood. Many people assume that, like the common use of the word significant, calling a result statistically significant means that the result is important or momentous, but this is not the case. Instead, statistical significance is an estimate of the probability that the observed association or difference is due to chance rather than any real association. In other words, tests of statistical significance describe the likelihood that an observed association or difference would be seen even if there were no real association or difference actually present. The measure of significance is often expressed in terms of confidence, which has the same meaning in statistics as it does in common language, but can be quantified. In Kristensen’s and Bjerkedal’s work, for example, the IQ difference between first- and second born men was found to be significant at a 95% confidence level, meaning that there is only a 5% probability that the IQ difference is due purely to chance. This does not mean that the difference is large or even important: 2.82 IQ points is a tiny blip on the IQ scale and hardly enough to declare first-borns geniuses in relation to their younger siblings. Nor do the findings imply that the outcome is 95% “correct.” Instead, they indicate that the observed difference is not due simply to random sampling bias and that there is a 95% probability the same results would be seen again if another researcher conducted a similar study in different population of Norwegian men. A second-born Norwegian who has a higher IQ than his older brother does not disprove the research – it is just a statistically less likely outcome.
Just as revealing as a statistically significant difference or relationship, is the lack of a statistical significance difference. For example, researchers have found that the risks of dying from heart disease in men who have quit smoking for at least two years is not significantly different from the risk of the disease in male non-smokers (Rosenberg et al., 1985). So, the statistics show that while smokers have a significantly higher rate of heart disease than non-smokers, this risk falls back to baseline within just two years after having quit smoking.
Given the wide variety of possible statistical tests, it is easy to misuse statistics in data analysis, often to the point of deception. One reason for this is that statistics do not address systematic error that can be introduced into a study either intentionally or accidentally. For example, in one of the first studies that reported on the effects of quitting smoking, E. Cuyler Hammond and Daniel Horn found that individuals who smoked more than one pack of cigarettes a day but had quit smoking within the past year had a death rate of 198.0, significantly higher than the rate of 157.1 for individuals who were still smoking more than one pack a day at the time of their study (Hammond & Horn, 1958). Without a proper understanding of the study, one might conclude from the statistics that quitting smoking is actually dangerous for heavy smokers. However, Hammond later offers an explanation for this finding when he says “This is not surprising in light of the fact that recent ex-smokers, as a group, are heavily weighted with men in ill health” (Hammond, 1965). Thus, heavy smokers who had stopped smoking included many individuals who had quit because they were already diagnosed with an illness, thus adding systematic error to the sample set. Without a complete understanding of these facts, the statistics alone could be misinterpreted. The most effective use of statistics, then, is to identifying trends and features within a dataset. These trends can then be interpreted by the researcher in light of his or her understanding of their scientific basis, possibly opening up opportunities for further study. Andrew Lang, a Scottish poet and novelist, famously summed up this aspect of statistical testing when he stated that, “An unsophisticated forecaster uses statistics as a drunken man uses lamp-posts - for support rather than for illumination.”
Another misconception of statistical testing is that statistical relationships and associations prove causation. In reality, identification of a correlation or association between variables does not mean that a change in one variable actually caused the change in another variable. For example, in 1950 Richard Doll and Austin Hill, British researchers who became known for conducting one of the first scientifically valid comparative studies (see our Research Methods: Comparison module) of smoking and the development of lung cancer, famously wrote about the correlation they uncovered, “This is not necessarily to state that smoking causes carcinoma of the lung. The association would occur if carcinoma of the lung caused people to smoke or if both attributes were end-effects of a common cause” (Doll & Hill, 1950). Doll and Hill went on to discuss the scientific basis of the correlation and the fact that the habit of smoking preceded the development of lung cancer in all of their study subjects, leading them to conclude “…that smoking is a factor, and an important factor, in the production of carcinoma of the lung.” As multiple lines of scientific evidence have accumulated regarding the association between smoking and lung cancer, scientists are now able to make very accurate statements about the statistical probability of risk associated with smoking cigarettes.
Figure 4: Filtered and low tar cigarettes were advertised as less dangerous based on hollow statistics.
While statistics help uncover patterns, relationships, and variability in data, they can unfortunately be used to misrepresent data, relationships, and interpretations. For example, in the late 1950s, in light of the mounting comparative studies that demonstrated a causative relationship between cigarette smoking and lung cancer, the major tobacco companies began to investigate the viability of marketing alternative products that they could promote as “healthier” than regular cigarettes. As a result, filtered and light cigarettes were developed. The tobacco industry then sponsored and widely advertised research that suggested that the common cellulose acetate filter reduced tar in regular cigarettes by 42-46% and nicotine by 19-35%. Marlboro® filtered cigarettes claimed to have “22 percent less tar, 34 percent less nicotine” than other brands. The Tobacco industry launched a similar advertising campaign promoting low tar cigarettes (6 to 12 mg tar compared to 12 to 16 mg in “regular” cigarettes) and ultra low tar cigarettes (under 6 mg) (Glantz et al., 1996). While the industry flooded the public with statistics on tar content, the tobacco companies did not advertise the fact that there was no research to indicate that tar or nicotine were the causative agents in the development of smoking-induced lung cancer. In fact, several research studies showed that the risks associated with low tar products were no different than regular products, and worse still, some studies showed that “low tar” cigarettes led to increased consumption of cigarettes among smokers (Stepney, 1980; NCI, 2001). Thus hollow statistics were used to mislead the public and detract from the real issue.
All measurements contain some uncertainty and error, and statistical methods help us quantify and characterize this uncertainty. This helps explain why scientists often speak in qualified statements. For example, no seismologist who studies earthquakes would be willing to tell you exactly when an earthquake is going to occur; instead, the U.S. Geological Survey issues statements like this, “There is … a 62% probability of at least one magnitude 6.7 or greater earthquake in the 3-decade interval 2003-2032 within the San Francisco Bay Region” (USGS, 2007). This may sound ambiguous, but it is in fact a very precise, mathematically-derived description of how confident seismologists are that a major earthquake will occur, and open reporting of error and uncertainty is a hallmark of quality scientific research.
Today, science and statistical analyses have become so intertwined that many scientific disciplines have developed their own subsets of statistical techniques and terminology. For example, the field of biostatistics (sometimes referred to as biometry) involves the application of specific statistical techniques to disciplines in biology such as population genetics, epidemiology, and public health. The field of geostatistics has evolved to develop specialized spatial analysis techniques that help geologists map the location of petroleum and mineral deposits; these spatial analysis techniques have also helped Starbuck’s® determine the ideal distribution of coffee shops based on maximizing the number of customers visiting each store. Used correctly, statistical analysis goes well beyond finding the next oil field or cup of coffee to illuminating scientific data in a way that helps validate scientific knowledge.hide
ACS (2004). Cancer Facts & Figures - 2004. American Cancer Society, Atlanta, GA.
ACS (2007) Cancer prevention studies overview. American Cancer Society, Atlanta, GA
ACS (2008). Characteristics of American Cancer Society Cohorts, American Cancer Society, Atlanta, GA, retrieved July 18, 2008
Bland, P.A. (2005). The impact rate on Earth. Phil. Trans. R. Soc. A, 363:2793-2810.
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.) Hillsdale, NJ: Lawrence Erlbaum Associates.
Doll, R., & Hill, A.B. (1950). Smoking and Carcinoma of the lung. British Medical Journal 2(4682), 739-748.
Fisher, R.A. (1935). The Design of Experiments. Oxford University Press, Oxford.
Glantz, S.A., Slade, J., Bero, L.A., Hanauer, P., Barnes, D.E. (1996) The Cigarette Papers. University of California Press, Berkeley, CA.
Hamilton, W.L., Norton, G.d., Ouellette, T.K., Rhodes, W.M., Kling, R., Connolly, G.N. (2004). Smokers’ responses to advertisements for regular and light cigarettes and potential reduced-exposure tobacco products. Nicotine & Tobacco Research, 6(Supp. 3):S353–S362.
Hammond, E.C., Horn, D. (1958). Smoking and death rates: report on forty-four months of follow-up of 187,783 men. 2. Death rates by cause. J Am Med Assoc. 166(11):1294-308.
Hammond, E.C. (1965). Evidence of the Effects of Giving Up Cigarette Smoking. American Journal of Public Health 55:682-691.
Kristensen, P., & Bjerkedal, T. (2007). Explaining the Relation between Birth Order and Intelligence. Science 316(5832), 1717.
National Center for Health Statistics (2006). Health, United States, 2006. NCHS, Centers for Disease Control and Prevention, U.S. Department of Health and Human Services, Hyattsville, MD.
NCI – National Cancer Institute (2001). Monograph 13: Risks Associated with Smoking Cigarettes with Low Tar Machine-Measured Yields of Tar and Nicotine. NCI, Tobacco Control Research, Document M914.
Rosenberg, L., Kaufman, D.W., Helmrich, S.P., Shapiro, S. (1985). The risk of myocardial infarction after quitting smoking in men under 55 years of age. N Engl J Med, 313:1511-1514.
Salsburg, D. (2001). ,The Lady Tasting Tea: How Statistics Revolutionized Science in the Twentieth Century, W.H. Freeman & Company, New York.
Silverstein, B., Feld, S., Kozlowski, L.T. (1980). The Availability of Low-Nicotine Cigarettes as a Cause of Cigarette Smoking among Teenage Females (in Research Notes) Journal of Health and Social Behavior, 21(4):383-388.
Stepney, R. (1980). Consumption of Cigarettes of Reduced Tar and Nicotine Delivery. Addiction, 75(1):81–88.
Fisher, R. A. (1935). Design of Experiments: Hafner Press, New York.
Anthony Carpi, Ph.D., Anne E. Egger, Ph.D. "Data: Statistics," Visionlearning Vol. POS-1 (2), 2008.