**Data**

# Statistics in Science

Modern science is often based on statements of statistical significance and probability. For example: 1) studies have shown that the probability of developing lung cancer is almost 20 times greater in cigarette smokers compared to nonsmokers (ACS, 2004); 2) there is a significant likelihood of a catastrophic meteorite impact on Earth sometime in the next 200,000 years (Bland, 2005); and 3) first-born male children exhibit IQ test scores that are 2.82 points higher than second-born males, a difference that is significant at the 95% confidence level (Kristensen & Bjerkedal, 2007). But why do scientists speak in terms that seem obscure? If cigarette smoking causes lung cancer, why not simply say so? If we should immediately establish a colony on the moon to escape extraterrestrial disaster, why not inform people? And if older children are smarter than their younger siblings, why not let them know?

The reason is that none of these latter statements accurately reflects the data. Scientific data rarely lead to absolute conclusions. Not all smokers die from lung cancer – some smokers decide to quit, thus reducing their risk, some smokers may die prematurely from cardiovascular or diseases other than lung cancer, and some smokers may simply never contract the disease. All data exhibit variability, and it is the role of statistics to quantify this variability and allow scientists to make more accurate statements about their data.

A common misconception is that statistics provide a measure of proof that something is true, but they actually do no such thing. Instead, statistics provide a measure of the probability of observing a certain result. This is a critical distinction. For example, the American Cancer Society has conducted several massive studies of cancer in an effort to make statements about the risks of the disease in US citizens. Cancer Prevention Study I enrolled approximately 1 million people between 1959 and 1960, and Cancer Prevention Study II was even larger, enrolling 1.2 million people in 1982. Both of these studies found much higher rates of lung cancer among cigarette smokers compared to nonsmokers, however, not all individuals who smoked contracted lung cancer (and, in fact, some nonsmokers did contract lung cancer). Thus, the development of lung cancer is a probability-based event, not a simple cause-and-effect relationship.

Statistical techniques allow scientists to put numbers to this probability, moving from a statement like "If you smoke cigarettes, you are more likely to develop lung cancer" to the one that started this module: "The probability of developing lung cancer is almost 20 times greater in cigarette smokers compared to nonsmokers." The quantification of probability offered by statistics is a powerful tool used widely throughout science, but it is frequently misunderstood.

Comprehension Checkpoint

Statistics can

## What is statistics?

The field of statistics dates to 1654 when a French gambler, Antoine Gombaud, asked the noted mathematician and philosopher Blaise Pascal about how one should divide the stakes among players when a game of chance is interrupted prematurely. Pascal posed the question to the lawyer and mathematician Pierre de Fermat, and over a series of letters, Pascal and Fermat devised a mathematical system that not only answered Gombaud's original question, but laid the foundations of modern probability theory and statistics.

From its roots in gambling, statistics has grown into a field of study that involves the development of methods and tests that are used to quantitatively define the variability inherent in data, the probability of certain outcomes, and the error and uncertainty associated with those outcomes (see our Uncertainty, Error, and Confidence module). As such, statistical methods are used extensively throughout the scientific process, from the design of research questions through data analysis and to the final interpretation of data.

The specific statistical methods used vary widely between different scientific disciplines; however, the reasons that these tests and techniques are used are similar across disciplines. This module does not attempt to introduce the many different statistical concepts and tests that have been developed, but rather provides an overview of how various statistical methods are used in science. More information about specific statistical tests and methods can be found under the Resources tab.

## Statistics in research design

Many people misinterpret statements of likelihood and probability as a sign of weakness or uncertainty in scientific results. However, the use of statistical methods and probability tests in research is an important aspect of science that adds strength and certainty to scientific conclusions. For example, in 1843, John Bennet Lawes, an English entrepreneur, founded the Rothamsted Experimental Station in Hertfordshire, England to investigate the impact of fertilizer application on crop yield. Lawes was motivated to do so because he had established one of the first artificial fertilizer factories a year earlier. For the next 80 years, researchers at the Station conducted experiments in which they applied fertilizers, planted different crops, kept track of the amount of rain that fell, and measured the size of the harvest at the end of each growing season.

By the turn of the century, the Station had a vast collection of data but few useful conclusions: One fertilizer would outperform another one year but underperform the next, certain fertilizers appeared to affect only certain crops, and the differing amounts of rainfall that fell each year continually confounded the experiments (Salsburg, 2001). The data were essentially useless because there were a large number of uncontrolled variables.

In 1919, the Rothamsted Station hired a young statistician by the name of Ronald Aylmer Fisher to try to make some sense of the data. Fisher's statistical analyses suggested that the relationship between rainfall and plant growth was far more statistically significant than the relationship between fertilizer type and plant growth. But the agricultural scientists at the station weren't out to test for weather – they wanted to know which fertilizers were most effective for which crops. No one could remove weather as a variable in the experiments, but Fisher realized that its effects could essentially be separated out if the experiments were designed appropriately.

In order to share his insights with the scientific community, Fisher published two books: *Statistical Methods for Research Workers* in 1925 and *The Design of Experiments* in 1935. By highlighting the need to consider statistical analysis during the planning stages of research, Fisher revolutionized the practice of science and transformed the Rothamsted Station into a major center for research on statistics and agriculture, which it still is today.

In *The Design of Experiments*, Fisher introduced several concepts that have become hallmarks of good scientific research, including the use of controls, randomization, and replication (Figure 3).

Controls:The use of controls is based on the concept of variability. Since any phenomenon has some measure of variability, controls allow the researcher to measure natural, random, or systematic variability in a similar system and use that estimate as a baseline for comparison to the observed variable or phenomenon. At Rothamsted, a control would be a crop that did not receive the application of fertilizer (see plots labeledIin Figure 3). The variability inherent in plant growth would still produce plants of varying heights and sizes. The control then could provide a measure of the impact that weather or other variables could have on crop growth independent of fertilizer application, thus allowing the researchers to statistically remove this as a factor.

Randomization:Statistical randomization helps to manage bias in scientific research. Unlike the common use of the wordrandom,which implies haphazard or disorganized, statistical randomization is a precise procedure in which units being observed are assigned to a treatment or control group in a manner that takes into account the potential influence of confounding variables. This allows the researcher to quantify the influence of these confounding variables by observing them in both the control and treatment groups. For example, before Fisher, fertilizers were applied along different crop rows at Rothamsted, some of which fell entirely along the edge of fields. Yet edges are known to affect agricultural yield, and so it was difficult in many cases to distinguish edge effects from fertilizer effects – the edge effects would be considered a confounding variable. Fisher introduced a process of randomly assigning different fertilizers to different plots within a field in a single year while assuring that not all of the treatment (or control) plots for any particular fertilizer fell along the edge of the field (see Figure 3).

Replication:Fisher also advocated for replicating experimental trials and measurements. This way the range of variability inherently associated with the experiment or measurement could be quantified and the robustness of the results could be evaluated. At Rothamsted this meant planting multiple plots with the same crop and applying the same fertilizer to each of those plots (see Figure 3). Further, this meant repeating similar applications in different years so that the variability of different fertilizer applications as a function of different weather conditions could be quantified.

In general, scientists design research studies based on the nature of the question they are seeking to investigate, but they refine their research plan in line with many of Fisher's statistical concepts to increase the likelihood that their findings will be useful. The incorporation of these techniques facilitates the analysis and interpretation of data, another place where statistics are used.

Comprehension Checkpoint

*Statistical randomization* is a term that scientists apply to research that does not follow a set procedure.

## Statistics in data analysis

A multitude of statistical techniques have been developed for data analysis, but they generally fall into two groups: descriptive and inferential.

Descriptive Statistics:Descriptive statistics allow a scientist to quickly sum up major attributes of a dataset using measures such as the mean, median, and standard deviation. These measures provide a general sense of the group being studied, allowing scientists to place the study within a larger context. For example, Cancer Prevention Study I (CPS-I) was a prospective mortality study initiated in 1959 as mentioned earlier. Researchers conducting the study reported the age and demographics of participants, among other variables, to allow a comparison between the study group and the broader population of the United States at the time. Adults participating in the study ranged from 30 to 108 years of age, with the median age reported as 52 years. The study subjects were 57% female, 97% white, and 2% black. By comparison, median age in the United States in 1959 was 29.4 years of age, obviously much younger than the study group since CPS-I did not enroll anyone under 30 years of age. Further, 51% of US residents were female in 1960, 89% white, and about 11% black. One recognized shortcoming of CPS I, easily identifiable from the descriptive statistics, was that with 97% participants categorized as white, the study did not adequately assess disease profiles in minority groups of the US.

Inferential Statistics:Inferential statistics are used to model patterns in data, make judgments about data, identify relationships between variables in datasets, and make inferences about larger populations based on smaller samples of data. It is important to keep in mind that from a statistical perspective, the word "population" does not have to mean a group of people as it does in common language. A statistical population is the larger group that a dataset is used to make inferences about – this can be a group of people, corn plants, meteor impacts, oil field locations, or any other group of measurements as the case may be.

Transferring results from small sample sizes to large populations is especially important with respect to scientific studies. For example, while Cancer Prevention Studies I and II enrolled approximately 1 million and 1.2 million people, respectively, they represented a small fraction of the 179 and 226 million people who were living in the United States in 1960 and 1980. Common inferential techniques include regression, correlation, and point estimation/testing. For example, Petter Kristensen and Tor Bjerkedal (2007) examined IQ test scores in a group of 250,000 male Norwegian military personnel. Their analyses suggested that first-born male children had an average IQ test score 2.82 ± 0.07 points higher than second-born male children, a statistically significant difference at the 95% confidence level (Kristensen & Bjerkedal, 2007).

The phrase "statistically significant" is a key concept in data analysis, and it is commonly misunderstood. Many people assume that, like the common use of the word *significant*, calling a result statistically significant means that the result is important or momentous, but this is not the case. Instead, statistical significance is an estimate of the probability that the observed association or difference is due to chance rather than any real association. In other words, tests of statistical significance describe the likelihood that an observed association or difference would be seen even if there were no real association or difference actually present. The measure of significance is often expressed in terms of confidence, which has the same meaning in statistics as it does in common language, but can be quantified.

In Kristensen and Bjerkedal's work, for example, the IQ difference between first- and second-born men was found to be significant at a 95% confidence level, meaning that there is only a 5% probability that the IQ difference is due purely to chance. This does not mean that the difference is large or even important: 2.82 IQ points is a tiny blip on the IQ scale and hardly enough to declare first-borns geniuses in relation to their younger siblings. Nor do the findings imply that the outcome is 95% "correct." Instead, they indicate that the observed difference is not due simply to random sampling bias and that there is a 95% probability the same results would be seen again if another researcher conducted a similar study in a different population of Norwegian men. A second-born Norwegian who has a higher IQ than his older brother does not disprove the research – it is just a statistically less likely outcome.

Just as revealing as a statistically significant difference or relationship, is the lack of a statistical significance difference. For example, researchers have found that the risks of dying from heart disease in men who have quit smoking for at least two years is not significantly different from the risk of the disease in male nonsmokers (Rosenberg et al., 1985). So, the statistics show that while smokers have a significantly higher rate of heart disease than nonsmokers, this risk falls back to baseline within just two years after having quit smoking.

Comprehension Checkpoint

If a result is *statistically significant*, it means that the result is likely

## Limitations, misconceptions, and the misuse of statistics

Given the wide variety of possible statistical tests, it is easy to misuse statistics in data analysis, often to the point of deception. One reason for this is that statistics do not address systematic error that can be introduced into a study either intentionally or accidentally. For example, in one of the first studies that reported on the effects of quitting smoking, E. Cuyler Hammond and Daniel Horn found that individuals who smoked more than one pack of cigarettes a day but had quit smoking within the past year had a death rate of 198.0, significantly higher than the rate of 157.1 for individuals who were still smoking more than one pack a day at the time of their study (Hammond & Horn, 1958). Without a proper understanding of the study, one might conclude from the statistics that quitting smoking is actually dangerous for heavy smokers. However, Hammond later offers an explanation for this finding when he says, "This is not surprising in light of the fact that recent ex-smokers, as a group, are heavily weighted with men in ill health" (Hammond, 1965). Thus, heavy smokers who had stopped smoking included many individuals who had quit because they were already diagnosed with an illness, thus adding systematic error to the sample set. Without a complete understanding of these facts, the statistics alone could be misinterpreted.

The most effective use of statistics, then, is to identify trends and features within a dataset. These trends can then be interpreted by the researcher in light of his or her understanding of their scientific basis, possibly opening up opportunities for further study. Andrew Lang, a Scottish poet and novelist, famously summed up this aspect of statistical testing when he stated, "An unsophisticated forecaster uses statistics as a drunken man uses lamp-posts – for support rather than for illumination."

Another misconception of statistical testing is that statistical relationships and associations prove causation. In reality, identification of a correlation or association between variables does not mean that a change in one variable actually caused the change in another variable. For example, in 1950 Richard Doll and Austin Hill, British researchers who became known for conducting one of the first scientifically valid comparative studies (see our Comparison in Research module) of smoking and the development of lung cancer, famously wrote about the correlation they uncovered:

This is not necessarily to state that smoking causes carcinoma of the lung. The association would occur if carcinoma of the lung caused people to smoke or if both attributes were end-effects of a common cause. (Doll & Hill, 1950)

Doll and Hill went on to discuss the scientific basis of the correlation and the fact that the habit of smoking preceded the development of lung cancer in all of their study subjects, leading them to conclude "...that smoking is a factor, and an important factor, in the production of carcinoma of the lung." As multiple lines of scientific evidence have accumulated regarding the association between smoking and lung cancer, scientists are now able to make very accurate statements about the statistical probability of risk associated with smoking cigarettes.

While statistics help uncover patterns, relationships, and variability in data, they can unfortunately be used to misrepresent data, relationships, and interpretations. For example, in the late 1950s, in light of the mounting comparative studies that demonstrated a causative relationship between cigarette smoking and lung cancer, the major tobacco companies began to investigate the viability of marketing alternative products that they could promote as "healthier" than regular cigarettes. As a result, filtered and light cigarettes were developed. The tobacco industry then sponsored and widely advertised research that suggested that the common cellulose acetate filter reduced tar in regular cigarettes by 42-46% and nicotine by 19-35%. Marlboro^{®} filtered cigarettes claimed to have "22 percent less tar, 34 percent less nicotine" than other brands. The tobacco industry launched a similar advertising campaign promoting low tar cigarettes (6 to 12 mg tar compared to 12 to 16 mg in "regular" cigarettes) and ultra low tar cigarettes (under 6 mg) (Glantz et al., 1996).

While the industry flooded the public with statistics on tar content, the tobacco companies did not advertise the fact that there was no research to indicate that tar or nicotine were the causative agents in the development of smoking-induced lung cancer. In fact, several research studies showed that the risks associated with low tar products were no different than regular products, and worse still, some studies showed that "low tar" cigarettes led to increased consumption of cigarettes among smokers (Stepney, 1980; NCI, 2001). Thus hollow statistics were used to mislead the public and detract from the real issue.

Comprehension Checkpoint

If there is a statistical correlation between two events or variables, this means that one event *causes* the other.

## Statistics and scientific research

All measurements contain some uncertainty and error, and statistical methods help us quantify and characterize this uncertainty. This helps explain why scientists often speak in qualified statements. For example, no seismologist who studies earthquakes would be willing to tell you exactly when an earthquake is going to occur; instead, the US Geological Survey issues statements like this: "There is ... a 62% probability of at least one magnitude 6.7 or greater earthquake in the 3-decade interval 2003-2032 within the San Francisco Bay Region" (USGS, 2007). This may sound ambiguous, but it is in fact a very precise, mathematically-derived description of how confident seismologists are that a major earthquake will occur, and open reporting of error and uncertainty is a hallmark of quality scientific research.

Today, science and statistical analyses have become so intertwined that many scientific disciplines have developed their own subsets of statistical techniques and terminology. For example, the field of biostatistics (sometimes referred to as biometry) involves the application of specific statistical techniques to disciplines in biology such as population genetics, epidemiology, and public health. The field of geostatistics has evolved to develop specialized spatial analysis techniques that help geologists map the location of petroleum and mineral deposits; these spatial analysis techniques have also helped Starbuck's^{®} determine the ideal distribution of coffee shops based on maximizing the number of customers visiting each store. Used correctly, statistical analysis goes well beyond finding the next oil field or cup of coffee to illuminating scientific data in a way that helps validate scientific knowledge.

### Summary

Scientific research rarely leads to absolute certainty. There is some degree of uncertainty in all conclusions, and statistics allow us to discuss that uncertainty. Statistical methods are used in all areas of science. The module explores the difference between (a) proving that something is true and (b) measuring the probability of getting a certain result. It explains how common words like "significant," "control," and "random" have a different meaning in the field of statistics than in everyday life.

### Key Concepts

Statistics are used to describe the variability inherent in data in a quantitative fashion, and to quantify relationships between variables.

Statistical analysis is used in designing scientific studies to increase consistency, measure uncertainty, and produce robust datasets.

There are a number of misconceptions that surround statistics, including confusion between statistical terms and the common language use of similar terms, and the role that statistics employ in data analysis.