Logo for BCcampus Open Publishing

Want to create or adapt books like this? Learn more about how Pressbooks supports open publishing practices.

Chapter 13: Inferential Statistics

Understanding Null Hypothesis Testing

Learning Objectives

  • Explain the purpose of null hypothesis testing, including the role of sampling error.
  • Describe the basic logic of null hypothesis testing.
  • Describe the role of relationship strength and sample size in determining statistical significance and make reasonable judgments about statistical significance based on these two factors.

The Purpose of Null Hypothesis Testing

As we have seen, psychological research typically involves measuring one or more variables for a sample and computing descriptive statistics for that sample. In general, however, the researcher’s goal is not to draw conclusions about that sample but to draw conclusions about the population that the sample was selected from. Thus researchers must use sample statistics to draw conclusions about the corresponding values in the population. These corresponding values in the population are called  parameters . Imagine, for example, that a researcher measures the number of depressive symptoms exhibited by each of 50 clinically depressed adults and computes the mean number of symptoms. The researcher probably wants to use this sample statistic (the mean number of symptoms for the sample) to draw conclusions about the corresponding population parameter (the mean number of symptoms for clinically depressed adults).

Unfortunately, sample statistics are not perfect estimates of their corresponding population parameters. This is because there is a certain amount of random variability in any statistic from sample to sample. The mean number of depressive symptoms might be 8.73 in one sample of clinically depressed adults, 6.45 in a second sample, and 9.44 in a third—even though these samples are selected randomly from the same population. Similarly, the correlation (Pearson’s  r ) between two variables might be +.24 in one sample, −.04 in a second sample, and +.15 in a third—again, even though these samples are selected randomly from the same population. This random variability in a statistic from sample to sample is called  sampling error . (Note that the term error  here refers to random variability and does not imply that anyone has made a mistake. No one “commits a sampling error.”)

One implication of this is that when there is a statistical relationship in a sample, it is not always clear that there is a statistical relationship in the population. A small difference between two group means in a sample might indicate that there is a small difference between the two group means in the population. But it could also be that there is no difference between the means in the population and that the difference in the sample is just a matter of sampling error. Similarly, a Pearson’s  r  value of −.29 in a sample might mean that there is a negative relationship in the population. But it could also be that there is no relationship in the population and that the relationship in the sample is just a matter of sampling error.

In fact, any statistical relationship in a sample can be interpreted in two ways:

  • There is a relationship in the population, and the relationship in the sample reflects this.
  • There is no relationship in the population, and the relationship in the sample reflects only sampling error.

The purpose of null hypothesis testing is simply to help researchers decide between these two interpretations.

The Logic of Null Hypothesis Testing

Null hypothesis testing  is a formal approach to deciding between two interpretations of a statistical relationship in a sample. One interpretation is called the   null hypothesis  (often symbolized  H 0  and read as “H-naught”). This is the idea that there is no relationship in the population and that the relationship in the sample reflects only sampling error. Informally, the null hypothesis is that the sample relationship “occurred by chance.” The other interpretation is called the  alternative hypothesis  (often symbolized as  H 1 ). This is the idea that there is a relationship in the population and that the relationship in the sample reflects this relationship in the population.

Again, every statistical relationship in a sample can be interpreted in either of these two ways: It might have occurred by chance, or it might reflect a relationship in the population. So researchers need a way to decide between them. Although there are many specific null hypothesis testing techniques, they are all based on the same general logic. The steps are as follows:

  • Assume for the moment that the null hypothesis is true. There is no relationship between the variables in the population.
  • Determine how likely the sample relationship would be if the null hypothesis were true.
  • If the sample relationship would be extremely unlikely, then reject the null hypothesis  in favour of the alternative hypothesis. If it would not be extremely unlikely, then  retain the null hypothesis .

Following this logic, we can begin to understand why Mehl and his colleagues concluded that there is no difference in talkativeness between women and men in the population. In essence, they asked the following question: “If there were no difference in the population, how likely is it that we would find a small difference of  d  = 0.06 in our sample?” Their answer to this question was that this sample relationship would be fairly likely if the null hypothesis were true. Therefore, they retained the null hypothesis—concluding that there is no evidence of a sex difference in the population. We can also see why Kanner and his colleagues concluded that there is a correlation between hassles and symptoms in the population. They asked, “If the null hypothesis were true, how likely is it that we would find a strong correlation of +.60 in our sample?” Their answer to this question was that this sample relationship would be fairly unlikely if the null hypothesis were true. Therefore, they rejected the null hypothesis in favour of the alternative hypothesis—concluding that there is a positive correlation between these variables in the population.

A crucial step in null hypothesis testing is finding the likelihood of the sample result if the null hypothesis were true. This probability is called the  p value . A low  p  value means that the sample result would be unlikely if the null hypothesis were true and leads to the rejection of the null hypothesis. A high  p  value means that the sample result would be likely if the null hypothesis were true and leads to the retention of the null hypothesis. But how low must the  p  value be before the sample result is considered unlikely enough to reject the null hypothesis? In null hypothesis testing, this criterion is called  α (alpha)  and is almost always set to .05. If there is less than a 5% chance of a result as extreme as the sample result if the null hypothesis were true, then the null hypothesis is rejected. When this happens, the result is said to be  statistically significant . If there is greater than a 5% chance of a result as extreme as the sample result when the null hypothesis is true, then the null hypothesis is retained. This does not necessarily mean that the researcher accepts the null hypothesis as true—only that there is not currently enough evidence to conclude that it is true. Researchers often use the expression “fail to reject the null hypothesis” rather than “retain the null hypothesis,” but they never use the expression “accept the null hypothesis.”

The Misunderstood  p  Value

The  p  value is one of the most misunderstood quantities in psychological research (Cohen, 1994) [1] . Even professional researchers misinterpret it, and it is not unusual for such misinterpretations to appear in statistics textbooks!

The most common misinterpretation is that the  p  value is the probability that the null hypothesis is true—that the sample result occurred by chance. For example, a misguided researcher might say that because the  p  value is .02, there is only a 2% chance that the result is due to chance and a 98% chance that it reflects a real relationship in the population. But this is incorrect . The  p  value is really the probability of a result at least as extreme as the sample result  if  the null hypothesis  were  true. So a  p  value of .02 means that if the null hypothesis were true, a sample result this extreme would occur only 2% of the time.

You can avoid this misunderstanding by remembering that the  p  value is not the probability that any particular  hypothesis  is true or false. Instead, it is the probability of obtaining the  sample result  if the null hypothesis were true.

Role of Sample Size and Relationship Strength

Recall that null hypothesis testing involves answering the question, “If the null hypothesis were true, what is the probability of a sample result as extreme as this one?” In other words, “What is the  p  value?” It can be helpful to see that the answer to this question depends on just two considerations: the strength of the relationship and the size of the sample. Specifically, the stronger the sample relationship and the larger the sample, the less likely the result would be if the null hypothesis were true. That is, the lower the  p  value. This should make sense. Imagine a study in which a sample of 500 women is compared with a sample of 500 men in terms of some psychological characteristic, and Cohen’s  d  is a strong 0.50. If there were really no sex difference in the population, then a result this strong based on such a large sample should seem highly unlikely. Now imagine a similar study in which a sample of three women is compared with a sample of three men, and Cohen’s  d  is a weak 0.10. If there were no sex difference in the population, then a relationship this weak based on such a small sample should seem likely. And this is precisely why the null hypothesis would be rejected in the first example and retained in the second.

Of course, sometimes the result can be weak and the sample large, or the result can be strong and the sample small. In these cases, the two considerations trade off against each other so that a weak result can be statistically significant if the sample is large enough and a strong relationship can be statistically significant even if the sample is small. Table 13.1 shows roughly how relationship strength and sample size combine to determine whether a sample result is statistically significant. The columns of the table represent the three levels of relationship strength: weak, medium, and strong. The rows represent four sample sizes that can be considered small, medium, large, and extra large in the context of psychological research. Thus each cell in the table represents a combination of relationship strength and sample size. If a cell contains the word  Yes , then this combination would be statistically significant for both Cohen’s  d  and Pearson’s  r . If it contains the word  No , then it would not be statistically significant for either. There is one cell where the decision for  d  and  r  would be different and another where it might be different depending on some additional considerations, which are discussed in Section 13.2 “Some Basic Null Hypothesis Tests”

Table 13.1 How Relationship Strength and Sample Size Combine to Determine Whether a Result Is Statistically Significant
Sample Size Weak relationship Medium-strength relationship Strong relationship
Small (  = 20) No No  = Maybe

 = Yes

Medium (  = 50) No Yes Yes
Large (  = 100)  = Yes

 = No

Yes Yes
Extra large (  = 500) Yes Yes Yes

Although Table 13.1 provides only a rough guideline, it shows very clearly that weak relationships based on medium or small samples are never statistically significant and that strong relationships based on medium or larger samples are always statistically significant. If you keep this lesson in mind, you will often know whether a result is statistically significant based on the descriptive statistics alone. It is extremely useful to be able to develop this kind of intuitive judgment. One reason is that it allows you to develop expectations about how your formal null hypothesis tests are going to come out, which in turn allows you to detect problems in your analyses. For example, if your sample relationship is strong and your sample is medium, then you would expect to reject the null hypothesis. If for some reason your formal null hypothesis test indicates otherwise, then you need to double-check your computations and interpretations. A second reason is that the ability to make this kind of intuitive judgment is an indication that you understand the basic logic of this approach in addition to being able to do the computations.

Statistical Significance Versus Practical Significance

Table 13.1 illustrates another extremely important point. A statistically significant result is not necessarily a strong one. Even a very weak result can be statistically significant if it is based on a large enough sample. This is closely related to Janet Shibley Hyde’s argument about sex differences (Hyde, 2007) [2] . The differences between women and men in mathematical problem solving and leadership ability are statistically significant. But the word  significant  can cause people to interpret these differences as strong and important—perhaps even important enough to influence the college courses they take or even who they vote for. As we have seen, however, these statistically significant differences are actually quite weak—perhaps even “trivial.”

This is why it is important to distinguish between the  statistical  significance of a result and the  practical  significance of that result.  Practical significance refers to the importance or usefulness of the result in some real-world context. Many sex differences are statistically significant—and may even be interesting for purely scientific reasons—but they are not practically significant. In clinical practice, this same concept is often referred to as “clinical significance.” For example, a study on a new treatment for social phobia might show that it produces a statistically significant positive effect. Yet this effect still might not be strong enough to justify the time, effort, and other costs of putting it into practice—especially if easier and cheaper treatments that work almost as well already exist. Although statistically significant, this result would be said to lack practical or clinical significance.

Key Takeaways

  • Null hypothesis testing is a formal approach to deciding whether a statistical relationship in a sample reflects a real relationship in the population or is just due to chance.
  • The logic of null hypothesis testing involves assuming that the null hypothesis is true, finding how likely the sample result would be if this assumption were correct, and then making a decision. If the sample result would be unlikely if the null hypothesis were true, then it is rejected in favour of the alternative hypothesis. If it would not be unlikely, then the null hypothesis is retained.
  • The probability of obtaining the sample result if the null hypothesis were true (the  p  value) is based on two considerations: relationship strength and sample size. Reasonable judgments about whether a sample relationship is statistically significant can often be made by quickly considering these two factors.
  • Statistical significance is not the same as relationship strength or importance. Even weak relationships can be statistically significant if the sample size is large enough. It is important to consider relationship strength and the practical significance of a result in addition to its statistical significance.
  • Discussion: Imagine a study showing that people who eat more broccoli tend to be happier. Explain for someone who knows nothing about statistics why the researchers would conduct a null hypothesis test.
  • The correlation between two variables is  r  = −.78 based on a sample size of 137.
  • The mean score on a psychological characteristic for women is 25 ( SD  = 5) and the mean score for men is 24 ( SD  = 5). There were 12 women and 10 men in this study.
  • In a memory experiment, the mean number of items recalled by the 40 participants in Condition A was 0.50 standard deviations greater than the mean number recalled by the 40 participants in Condition B.
  • In another memory experiment, the mean scores for participants in Condition A and Condition B came out exactly the same!
  • A student finds a correlation of  r  = .04 between the number of units the students in his research methods class are taking and the students’ level of stress.

Long Descriptions

“Null Hypothesis” long description: A comic depicting a man and a woman talking in the foreground. In the background is a child working at a desk. The man says to the woman, “I can’t believe schools are still teaching kids about the null hypothesis. I remember reading a big study that conclusively disproved it years ago.” [Return to “Null Hypothesis”]

“Conditional Risk” long description: A comic depicting two hikers beside a tree during a thunderstorm. A bolt of lightning goes “crack” in the dark sky as thunder booms. One of the hikers says, “Whoa! We should get inside!” The other hiker says, “It’s okay! Lightning only kills about 45 Americans a year, so the chances of dying are only one in 7,000,000. Let’s go on!” The comic’s caption says, “The annual death rate among people who know that statistic is one in six.” [Return to “Conditional Risk”]

Media Attributions

  • Null Hypothesis by XKCD  CC BY-NC (Attribution NonCommercial)
  • Conditional Risk by XKCD  CC BY-NC (Attribution NonCommercial)
  • Cohen, J. (1994). The world is round: p < .05. American Psychologist, 49 , 997–1003. ↵
  • Hyde, J. S. (2007). New directions in the study of gender similarities and differences. Current Directions in Psychological Science, 16 , 259–263. ↵

Values in a population that correspond to variables measured in a study.

The random variability in a statistic from sample to sample.

A formal approach to deciding between two interpretations of a statistical relationship in a sample.

The idea that there is no relationship in the population and that the relationship in the sample reflects only sampling error.

The idea that there is a relationship in the population and that the relationship in the sample reflects this relationship in the population.

When the relationship found in the sample would be extremely unlikely, the idea that the relationship occurred “by chance” is rejected.

When the relationship found in the sample is likely to have occurred by chance, the null hypothesis is not rejected.

The probability that, if the null hypothesis were true, the result found in the sample would occur.

How low the p value must be before the sample result is considered unlikely in null hypothesis testing.

When there is less than a 5% chance of a result as extreme as the sample result occurring and the null hypothesis is rejected.

Research Methods in Psychology - 2nd Canadian Edition Copyright © 2015 by Paul C. Price, Rajiv Jhangiani, & I-Chant A. Chiang is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License , except where otherwise noted.

Share This Book

null hypothesis in practical research

What is The Null Hypothesis & When Do You Reject The Null Hypothesis

Julia Simkus

Editor at Simply Psychology

BA (Hons) Psychology, Princeton University

Julia Simkus is a graduate of Princeton University with a Bachelor of Arts in Psychology. She is currently studying for a Master's Degree in Counseling for Mental Health and Wellness in September 2023. Julia's research has been published in peer reviewed journals.

Learn about our Editorial Process

Saul Mcleod, PhD

Editor-in-Chief for Simply Psychology

BSc (Hons) Psychology, MRes, PhD, University of Manchester

Saul Mcleod, PhD., is a qualified psychology teacher with over 18 years of experience in further and higher education. He has been published in peer-reviewed journals, including the Journal of Clinical Psychology.

Olivia Guy-Evans, MSc

Associate Editor for Simply Psychology

BSc (Hons) Psychology, MSc Psychology of Education

Olivia Guy-Evans is a writer and associate editor for Simply Psychology. She has previously worked in healthcare and educational sectors.

On This Page:

A null hypothesis is a statistical concept suggesting no significant difference or relationship between measured variables. It’s the default assumption unless empirical evidence proves otherwise.

The null hypothesis states no relationship exists between the two variables being studied (i.e., one variable does not affect the other).

The null hypothesis is the statement that a researcher or an investigator wants to disprove.

Testing the null hypothesis can tell you whether your results are due to the effects of manipulating ​ the dependent variable or due to random chance. 

How to Write a Null Hypothesis

Null hypotheses (H0) start as research questions that the investigator rephrases as statements indicating no effect or relationship between the independent and dependent variables.

It is a default position that your research aims to challenge or confirm.

For example, if studying the impact of exercise on weight loss, your null hypothesis might be:

There is no significant difference in weight loss between individuals who exercise daily and those who do not.

Examples of Null Hypotheses

Research QuestionNull Hypothesis
Do teenagers use cell phones more than adults?Teenagers and adults use cell phones the same amount.
Do tomato plants exhibit a higher rate of growth when planted in compost rather than in soil?Tomato plants show no difference in growth rates when planted in compost rather than soil.
Does daily meditation decrease the incidence of depression?Daily meditation does not decrease the incidence of depression.
Does daily exercise increase test performance?There is no relationship between daily exercise time and test performance.
Does the new vaccine prevent infections?The vaccine does not affect the infection rate.
Does flossing your teeth affect the number of cavities?Flossing your teeth has no effect on the number of cavities.

When Do We Reject The Null Hypothesis? 

We reject the null hypothesis when the data provide strong enough evidence to conclude that it is likely incorrect. This often occurs when the p-value (probability of observing the data given the null hypothesis is true) is below a predetermined significance level.

If the collected data does not meet the expectation of the null hypothesis, a researcher can conclude that the data lacks sufficient evidence to back up the null hypothesis, and thus the null hypothesis is rejected. 

Rejecting the null hypothesis means that a relationship does exist between a set of variables and the effect is statistically significant ( p > 0.05).

If the data collected from the random sample is not statistically significance , then the null hypothesis will be accepted, and the researchers can conclude that there is no relationship between the variables. 

You need to perform a statistical test on your data in order to evaluate how consistent it is with the null hypothesis. A p-value is one statistical measurement used to validate a hypothesis against observed data.

Calculating the p-value is a critical part of null-hypothesis significance testing because it quantifies how strongly the sample data contradicts the null hypothesis.

The level of statistical significance is often expressed as a  p  -value between 0 and 1. The smaller the p-value, the stronger the evidence that you should reject the null hypothesis.

Probability and statistical significance in ab testing. Statistical significance in a b experiments

Usually, a researcher uses a confidence level of 95% or 99% (p-value of 0.05 or 0.01) as general guidelines to decide if you should reject or keep the null.

When your p-value is less than or equal to your significance level, you reject the null hypothesis.

In other words, smaller p-values are taken as stronger evidence against the null hypothesis. Conversely, when the p-value is greater than your significance level, you fail to reject the null hypothesis.

In this case, the sample data provides insufficient data to conclude that the effect exists in the population.

Because you can never know with complete certainty whether there is an effect in the population, your inferences about a population will sometimes be incorrect.

When you incorrectly reject the null hypothesis, it’s called a type I error. When you incorrectly fail to reject it, it’s called a type II error.

Why Do We Never Accept The Null Hypothesis?

The reason we do not say “accept the null” is because we are always assuming the null hypothesis is true and then conducting a study to see if there is evidence against it. And, even if we don’t find evidence against it, a null hypothesis is not accepted.

A lack of evidence only means that you haven’t proven that something exists. It does not prove that something doesn’t exist. 

It is risky to conclude that the null hypothesis is true merely because we did not find evidence to reject it. It is always possible that researchers elsewhere have disproved the null hypothesis, so we cannot accept it as true, but instead, we state that we failed to reject the null. 

One can either reject the null hypothesis, or fail to reject it, but can never accept it.

Why Do We Use The Null Hypothesis?

We can never prove with 100% certainty that a hypothesis is true; We can only collect evidence that supports a theory. However, testing a hypothesis can set the stage for rejecting or accepting this hypothesis within a certain confidence level.

The null hypothesis is useful because it can tell us whether the results of our study are due to random chance or the manipulation of a variable (with a certain level of confidence).

A null hypothesis is rejected if the measured data is significantly unlikely to have occurred and a null hypothesis is accepted if the observed outcome is consistent with the position held by the null hypothesis.

Rejecting the null hypothesis sets the stage for further experimentation to see if a relationship between two variables exists. 

Hypothesis testing is a critical part of the scientific method as it helps decide whether the results of a research study support a particular theory about a given population. Hypothesis testing is a systematic way of backing up researchers’ predictions with statistical analysis.

It helps provide sufficient statistical evidence that either favors or rejects a certain hypothesis about the population parameter. 

Purpose of a Null Hypothesis 

  • The primary purpose of the null hypothesis is to disprove an assumption. 
  • Whether rejected or accepted, the null hypothesis can help further progress a theory in many scientific cases.
  • A null hypothesis can be used to ascertain how consistent the outcomes of multiple studies are.

Do you always need both a Null Hypothesis and an Alternative Hypothesis?

The null (H0) and alternative (Ha or H1) hypotheses are two competing claims that describe the effect of the independent variable on the dependent variable. They are mutually exclusive, which means that only one of the two hypotheses can be true. 

While the null hypothesis states that there is no effect in the population, an alternative hypothesis states that there is statistical significance between two variables. 

The goal of hypothesis testing is to make inferences about a population based on a sample. In order to undertake hypothesis testing, you must express your research hypothesis as a null and alternative hypothesis. Both hypotheses are required to cover every possible outcome of the study. 

What is the difference between a null hypothesis and an alternative hypothesis?

The alternative hypothesis is the complement to the null hypothesis. The null hypothesis states that there is no effect or no relationship between variables, while the alternative hypothesis claims that there is an effect or relationship in the population.

It is the claim that you expect or hope will be true. The null hypothesis and the alternative hypothesis are always mutually exclusive, meaning that only one can be true at a time.

What are some problems with the null hypothesis?

One major problem with the null hypothesis is that researchers typically will assume that accepting the null is a failure of the experiment. However, accepting or rejecting any hypothesis is a positive result. Even if the null is not refuted, the researchers will still learn something new.

Why can a null hypothesis not be accepted?

We can either reject or fail to reject a null hypothesis, but never accept it. If your test fails to detect an effect, this is not proof that the effect doesn’t exist. It just means that your sample did not have enough evidence to conclude that it exists.

We can’t accept a null hypothesis because a lack of evidence does not prove something that does not exist. Instead, we fail to reject it.

Failing to reject the null indicates that the sample did not provide sufficient enough evidence to conclude that an effect exists.

If the p-value is greater than the significance level, then you fail to reject the null hypothesis.

Is a null hypothesis directional or non-directional?

A hypothesis test can either contain an alternative directional hypothesis or a non-directional alternative hypothesis. A directional hypothesis is one that contains the less than (“<“) or greater than (“>”) sign.

A nondirectional hypothesis contains the not equal sign (“≠”).  However, a null hypothesis is neither directional nor non-directional.

A null hypothesis is a prediction that there will be no change, relationship, or difference between two variables.

The directional hypothesis or nondirectional hypothesis would then be considered alternative hypotheses to the null hypothesis.

Gill, J. (1999). The insignificance of null hypothesis significance testing.  Political research quarterly ,  52 (3), 647-674.

Krueger, J. (2001). Null hypothesis significance testing: On the survival of a flawed method.  American Psychologist ,  56 (1), 16.

Masson, M. E. (2011). A tutorial on a practical Bayesian alternative to null-hypothesis significance testing.  Behavior research methods ,  43 , 679-690.

Nickerson, R. S. (2000). Null hypothesis significance testing: a review of an old and continuing controversy.  Psychological methods ,  5 (2), 241.

Rozeboom, W. W. (1960). The fallacy of the null-hypothesis significance test.  Psychological bulletin ,  57 (5), 416.

Print Friendly, PDF & Email

Null Hypothesis Examples

ThoughtCo / Hilary Allison

  • Scientific Method
  • Chemical Laws
  • Periodic Table
  • Projects & Experiments
  • Biochemistry
  • Physical Chemistry
  • Medical Chemistry
  • Chemistry In Everyday Life
  • Famous Chemists
  • Activities for Kids
  • Abbreviations & Acronyms
  • Weather & Climate
  • Ph.D., Biomedical Sciences, University of Tennessee at Knoxville
  • B.A., Physics and Mathematics, Hastings College

In statistical analysis, the null hypothesis assumes there is no meaningful relationship between two variables. Testing the null hypothesis can tell you whether your results are due to the effect of manipulating ​a dependent variable or due to chance. It's often used in conjunction with an alternative hypothesis, which assumes there is, in fact, a relationship between two variables.

The null hypothesis is among the easiest hypothesis to test using statistical analysis, making it perhaps the most valuable hypothesis for the scientific method. By evaluating a null hypothesis in addition to another hypothesis, researchers can support their conclusions with a higher level of confidence. Below are examples of how you might formulate a null hypothesis to fit certain questions.

What Is the Null Hypothesis?

The null hypothesis states there is no relationship between the measured phenomenon (the dependent variable ) and the independent variable , which is the variable an experimenter typically controls or changes. You do not​ need to believe that the null hypothesis is true to test it. On the contrary, you will likely suspect there is a relationship between a set of variables. One way to prove that this is the case is to reject the null hypothesis. Rejecting a hypothesis does not mean an experiment was "bad" or that it didn't produce results. In fact, it is often one of the first steps toward further inquiry.

To distinguish it from other hypotheses , the null hypothesis is written as ​ H 0  (which is read as “H-nought,” "H-null," or "H-zero"). A significance test is used to determine the likelihood that the results supporting the null hypothesis are not due to chance. A confidence level of 95% or 99% is common. Keep in mind, even if the confidence level is high, there is still a small chance the null hypothesis is not true, perhaps because the experimenter did not account for a critical factor or because of chance. This is one reason why it's important to repeat experiments.

Examples of the Null Hypothesis

To write a null hypothesis, first start by asking a question. Rephrase that question in a form that assumes no relationship between the variables. In other words, assume a treatment has no effect. Write your hypothesis in a way that reflects this.

Are teens better at math than adults? Age has no effect on mathematical ability.
Does taking aspirin every day reduce the chance of having a heart attack? Taking aspirin daily does not affect heart attack risk.
Do teens use cell phones to access the internet more than adults? Age has no effect on how cell phones are used for internet access.
Do cats care about the color of their food? Cats express no food preference based on color.
Does chewing willow bark relieve pain? There is no difference in pain relief after chewing willow bark versus taking a placebo.

Other Types of Hypotheses

In addition to the null hypothesis, the alternative hypothesis is also a staple in traditional significance tests . It's essentially the opposite of the null hypothesis because it assumes the claim in question is true. For the first item in the table above, for example, an alternative hypothesis might be "Age does have an effect on mathematical ability."

Key Takeaways

  • In hypothesis testing, the null hypothesis assumes no relationship between two variables, providing a baseline for statistical analysis.
  • Rejecting the null hypothesis suggests there is evidence of a relationship between variables.
  • By formulating a null hypothesis, researchers can systematically test assumptions and draw more reliable conclusions from their experiments.
  • Difference Between Independent and Dependent Variables
  • Examples of Independent and Dependent Variables
  • What Is a Hypothesis? (Science)
  • What 'Fail to Reject' Means in a Hypothesis Test
  • Definition of a Hypothesis
  • Null Hypothesis Definition and Examples
  • Scientific Method Vocabulary Terms
  • Null Hypothesis and Alternative Hypothesis
  • Hypothesis Test for the Difference of Two Population Proportions
  • How to Conduct a Hypothesis Test
  • What Is a P-Value?
  • What Are the Elements of a Good Hypothesis?
  • Hypothesis Test Example
  • What Is the Difference Between Alpha and P-Values?
  • Understanding Path Analysis
  • An Example of a Hypothesis Test
  • Resources Home 🏠
  • Try SciSpace Copilot
  • Search research papers
  • Add Copilot Extension
  • Try AI Detector
  • Try Paraphraser
  • Try Citation Generator
  • April Papers
  • June Papers
  • July Papers

SciSpace Resources

Importance of Null Hypothesis in Research

Sumalatha G

Table of Contents

The null hypothesis is a fundamental concept in statistical analysis and research methodology. It forms the basis of many statistical tests and is a critical component in the process of scientific discovery. But what is the meaning of the null hypothesis, and why is it so important in research? Let's delve into this topic to gain a comprehensive understanding.

What is Null Hypothesis?

The null hypothesis often denoted as H0, is a statement in statistical inference that suggests no statistical significance exists in a set of observed data. In other words, it assumes that any kind of difference or importance you see in a set of data is due to chance.

The null hypothesis is the initial claim that researchers set out to test. It's a starting point that allows us to test specific relationships between variables in a study. The null hypothesis is not necessarily a claim that researchers believe is true, but rather, a claim that is assumed to be true for the purpose of testing statistical significance.

Formulating the Null Hypothesis

When formulating a null hypothesis, it's important to remember that it should be a clear, concise, and testable statement. It should also make a claim about the population parameters for the variables under study, not about the sample statistics.

For example, if a researcher wants to test whether a new drug has an effect on a disease, the null hypothesis might be "The new drug has no effect on the disease." This is a claim that can be tested by collecting and analyzing data.

The Importance of the Null Hypothesis in Research

The null hypothesis plays a crucial role in statistical hypothesis testing, a standard procedure in scientific research. It provides a benchmark against which the alternative hypothesis is tested and helps control for the effects of random variation. Here are the 3 prominent significance of the null hypothesis in research.

Foundation of Research Design

In crafting a robust research design, formulating a clear null hypothesis is paramount. It defines the scope of the study, delineates variables, and establishes the groundwork for subsequent analyses.

Statistical Testing Reliance

Null hypothesis testing, a common statistical method, relies on comparing observed data to what would be expected under the assumption of no effect. This statistical scrutiny is integral to drawing valid conclusions.

Clarifying Research Objectives

The null hypothesis sharpens the focus of the research objectives. Defining the absence of an anticipated effect compels researchers to construct precise and testable hypotheses aligned with their inquiries.

Validates the effects of the study

Null hypothesis is important because it provides a framework for proving or disproving that something has an effect. By assuming that the null hypothesis is true, researchers can test the validity of their alternative hypothesis. If the data collected provides enough evidence to reject the null hypothesis, it suggests that the alternative hypothesis may be true.

Role in Statistical Significance

The null hypothesis is central to the concept of statistical significance. If the data collected in a study can be considered unlikely under the null hypothesis, then the null hypothesis is rejected, and the result is deemed statistically significant.

Statistical significance, however, does not necessarily imply practical significance. A result can be statistically significant but still needs to be of practical importance, depending on the context and the specific research question.

Testing the Null Hypothesis

Testing the null hypothesis involves collecting data and calculating a test statistic. The test statistic is then compared to a critical value, which is determined based on the significance level, the type of test being conducted, and the degrees of freedom.

If the test statistic is more extreme than the critical value, the null hypothesis is rejected. If not, there is not enough evidence to reject the null hypothesis. This does not prove that the null hypothesis is true, but rather, that there is not enough evidence to suggest that it is false.

Types of Errors in Hypothesis Testing

When testing the null hypothesis, there are two types of errors that can occur: Type I and Type II errors. A Type I error occurs when the null hypothesis is true, but is rejected. A Type II error occurs when the null hypothesis is false, but is not rejected.

Researchers must consider the potential for these errors when designing their studies and interpreting their results. The risk of these errors can be controlled to some extent by choosing an appropriate significance level and by increasing the sample size.

In conclusion, the null hypothesis is a fundamental concept in statistical analysis and research methodology. It provides a benchmark for testing statistical significance and helps control for the effects of random variation. While it is a simple concept, understanding the null hypothesis and its role in research is crucial for any researcher or statistician.

By formulating a clear and testable null hypothesis, researchers can design their studies in a way that allows them to make meaningful inferences about the relationships between variables. Whether the null hypothesis is ultimately rejected or not, it plays a crucial role in advancing scientific knowledge and understanding.

You might also like

Boosting Citations: A Comparative Analysis of Graphical Abstract vs. Video Abstract

Boosting Citations: A Comparative Analysis of Graphical Abstract vs. Video Abstract

Sumalatha G

The Impact of Visual Abstracts on Boosting Citations

Introducing SciSpace’s Citation Booster To Increase Research Visibility

Introducing SciSpace’s Citation Booster To Increase Research Visibility

Logo for M Libraries Publishing

Want to create or adapt books like this? Learn more about how Pressbooks supports open publishing practices.

13.1 Understanding Null Hypothesis Testing

Learning objectives.

  • Explain the purpose of null hypothesis testing, including the role of sampling error.
  • Describe the basic logic of null hypothesis testing.
  • Describe the role of relationship strength and sample size in determining statistical significance and make reasonable judgments about statistical significance based on these two factors.

The Purpose of Null Hypothesis Testing

As we have seen, psychological research typically involves measuring one or more variables for a sample and computing descriptive statistics for that sample. In general, however, the researcher’s goal is not to draw conclusions about that sample but to draw conclusions about the population that the sample was selected from. Thus researchers must use sample statistics to draw conclusions about the corresponding values in the population. These corresponding values in the population are called parameters . Imagine, for example, that a researcher measures the number of depressive symptoms exhibited by each of 50 clinically depressed adults and computes the mean number of symptoms. The researcher probably wants to use this sample statistic (the mean number of symptoms for the sample) to draw conclusions about the corresponding population parameter (the mean number of symptoms for clinically depressed adults).

Unfortunately, sample statistics are not perfect estimates of their corresponding population parameters. This is because there is a certain amount of random variability in any statistic from sample to sample. The mean number of depressive symptoms might be 8.73 in one sample of clinically depressed adults, 6.45 in a second sample, and 9.44 in a third—even though these samples are selected randomly from the same population. Similarly, the correlation (Pearson’s r ) between two variables might be +.24 in one sample, −.04 in a second sample, and +.15 in a third—again, even though these samples are selected randomly from the same population. This random variability in a statistic from sample to sample is called sampling error . (Note that the term error here refers to random variability and does not imply that anyone has made a mistake. No one “commits a sampling error.”)

One implication of this is that when there is a statistical relationship in a sample, it is not always clear that there is a statistical relationship in the population. A small difference between two group means in a sample might indicate that there is a small difference between the two group means in the population. But it could also be that there is no difference between the means in the population and that the difference in the sample is just a matter of sampling error. Similarly, a Pearson’s r value of −.29 in a sample might mean that there is a negative relationship in the population. But it could also be that there is no relationship in the population and that the relationship in the sample is just a matter of sampling error.

In fact, any statistical relationship in a sample can be interpreted in two ways:

  • There is a relationship in the population, and the relationship in the sample reflects this.
  • There is no relationship in the population, and the relationship in the sample reflects only sampling error.

The purpose of null hypothesis testing is simply to help researchers decide between these two interpretations.

The Logic of Null Hypothesis Testing

Null hypothesis testing is a formal approach to deciding between two interpretations of a statistical relationship in a sample. One interpretation is called the null hypothesis (often symbolized H 0 and read as “H-naught”). This is the idea that there is no relationship in the population and that the relationship in the sample reflects only sampling error. Informally, the null hypothesis is that the sample relationship “occurred by chance.” The other interpretation is called the alternative hypothesis (often symbolized as H 1 ). This is the idea that there is a relationship in the population and that the relationship in the sample reflects this relationship in the population.

Again, every statistical relationship in a sample can be interpreted in either of these two ways: It might have occurred by chance, or it might reflect a relationship in the population. So researchers need a way to decide between them. Although there are many specific null hypothesis testing techniques, they are all based on the same general logic. The steps are as follows:

  • Assume for the moment that the null hypothesis is true. There is no relationship between the variables in the population.
  • Determine how likely the sample relationship would be if the null hypothesis were true.
  • If the sample relationship would be extremely unlikely, then reject the null hypothesis in favor of the alternative hypothesis. If it would not be extremely unlikely, then retain the null hypothesis .

Following this logic, we can begin to understand why Mehl and his colleagues concluded that there is no difference in talkativeness between women and men in the population. In essence, they asked the following question: “If there were no difference in the population, how likely is it that we would find a small difference of d = 0.06 in our sample?” Their answer to this question was that this sample relationship would be fairly likely if the null hypothesis were true. Therefore, they retained the null hypothesis—concluding that there is no evidence of a sex difference in the population. We can also see why Kanner and his colleagues concluded that there is a correlation between hassles and symptoms in the population. They asked, “If the null hypothesis were true, how likely is it that we would find a strong correlation of +.60 in our sample?” Their answer to this question was that this sample relationship would be fairly unlikely if the null hypothesis were true. Therefore, they rejected the null hypothesis in favor of the alternative hypothesis—concluding that there is a positive correlation between these variables in the population.

A crucial step in null hypothesis testing is finding the likelihood of the sample result if the null hypothesis were true. This probability is called the p value . A low p value means that the sample result would be unlikely if the null hypothesis were true and leads to the rejection of the null hypothesis. A high p value means that the sample result would be likely if the null hypothesis were true and leads to the retention of the null hypothesis. But how low must the p value be before the sample result is considered unlikely enough to reject the null hypothesis? In null hypothesis testing, this criterion is called α (alpha) and is almost always set to .05. If there is less than a 5% chance of a result as extreme as the sample result if the null hypothesis were true, then the null hypothesis is rejected. When this happens, the result is said to be statistically significant . If there is greater than a 5% chance of a result as extreme as the sample result when the null hypothesis is true, then the null hypothesis is retained. This does not necessarily mean that the researcher accepts the null hypothesis as true—only that there is not currently enough evidence to conclude that it is true. Researchers often use the expression “fail to reject the null hypothesis” rather than “retain the null hypothesis,” but they never use the expression “accept the null hypothesis.”

The Misunderstood p Value

The p value is one of the most misunderstood quantities in psychological research (Cohen, 1994). Even professional researchers misinterpret it, and it is not unusual for such misinterpretations to appear in statistics textbooks!

The most common misinterpretation is that the p value is the probability that the null hypothesis is true—that the sample result occurred by chance. For example, a misguided researcher might say that because the p value is .02, there is only a 2% chance that the result is due to chance and a 98% chance that it reflects a real relationship in the population. But this is incorrect . The p value is really the probability of a result at least as extreme as the sample result if the null hypothesis were true. So a p value of .02 means that if the null hypothesis were true, a sample result this extreme would occur only 2% of the time.

You can avoid this misunderstanding by remembering that the p value is not the probability that any particular hypothesis is true or false. Instead, it is the probability of obtaining the sample result if the null hypothesis were true.

Role of Sample Size and Relationship Strength

Recall that null hypothesis testing involves answering the question, “If the null hypothesis were true, what is the probability of a sample result as extreme as this one?” In other words, “What is the p value?” It can be helpful to see that the answer to this question depends on just two considerations: the strength of the relationship and the size of the sample. Specifically, the stronger the sample relationship and the larger the sample, the less likely the result would be if the null hypothesis were true. That is, the lower the p value. This should make sense. Imagine a study in which a sample of 500 women is compared with a sample of 500 men in terms of some psychological characteristic, and Cohen’s d is a strong 0.50. If there were really no sex difference in the population, then a result this strong based on such a large sample should seem highly unlikely. Now imagine a similar study in which a sample of three women is compared with a sample of three men, and Cohen’s d is a weak 0.10. If there were no sex difference in the population, then a relationship this weak based on such a small sample should seem likely. And this is precisely why the null hypothesis would be rejected in the first example and retained in the second.

Of course, sometimes the result can be weak and the sample large, or the result can be strong and the sample small. In these cases, the two considerations trade off against each other so that a weak result can be statistically significant if the sample is large enough and a strong relationship can be statistically significant even if the sample is small. Table 13.1 “How Relationship Strength and Sample Size Combine to Determine Whether a Result Is Statistically Significant” shows roughly how relationship strength and sample size combine to determine whether a sample result is statistically significant. The columns of the table represent the three levels of relationship strength: weak, medium, and strong. The rows represent four sample sizes that can be considered small, medium, large, and extra large in the context of psychological research. Thus each cell in the table represents a combination of relationship strength and sample size. If a cell contains the word Yes , then this combination would be statistically significant for both Cohen’s d and Pearson’s r . If it contains the word No , then it would not be statistically significant for either. There is one cell where the decision for d and r would be different and another where it might be different depending on some additional considerations, which are discussed in Section 13.2 “Some Basic Null Hypothesis Tests”

Table 13.1 How Relationship Strength and Sample Size Combine to Determine Whether a Result Is Statistically Significant

Relationship strength
Sample Size Weak Medium Strong
Small ( = 20) No No

= Maybe

= Yes

Medium ( = 50) No Yes Yes
Large ( = 100)

= Yes

= No

Yes Yes
Extra large ( = 500) Yes Yes Yes

Although Table 13.1 “How Relationship Strength and Sample Size Combine to Determine Whether a Result Is Statistically Significant” provides only a rough guideline, it shows very clearly that weak relationships based on medium or small samples are never statistically significant and that strong relationships based on medium or larger samples are always statistically significant. If you keep this in mind, you will often know whether a result is statistically significant based on the descriptive statistics alone. It is extremely useful to be able to develop this kind of intuitive judgment. One reason is that it allows you to develop expectations about how your formal null hypothesis tests are going to come out, which in turn allows you to detect problems in your analyses. For example, if your sample relationship is strong and your sample is medium, then you would expect to reject the null hypothesis. If for some reason your formal null hypothesis test indicates otherwise, then you need to double-check your computations and interpretations. A second reason is that the ability to make this kind of intuitive judgment is an indication that you understand the basic logic of this approach in addition to being able to do the computations.

Statistical Significance Versus Practical Significance

Table 13.1 “How Relationship Strength and Sample Size Combine to Determine Whether a Result Is Statistically Significant” illustrates another extremely important point. A statistically significant result is not necessarily a strong one. Even a very weak result can be statistically significant if it is based on a large enough sample. This is closely related to Janet Shibley Hyde’s argument about sex differences (Hyde, 2007). The differences between women and men in mathematical problem solving and leadership ability are statistically significant. But the word significant can cause people to interpret these differences as strong and important—perhaps even important enough to influence the college courses they take or even who they vote for. As we have seen, however, these statistically significant differences are actually quite weak—perhaps even “trivial.”

This is why it is important to distinguish between the statistical significance of a result and the practical significance of that result. Practical significance refers to the importance or usefulness of the result in some real-world context. Many sex differences are statistically significant—and may even be interesting for purely scientific reasons—but they are not practically significant. In clinical practice, this same concept is often referred to as “clinical significance.” For example, a study on a new treatment for social phobia might show that it produces a statistically significant positive effect. Yet this effect still might not be strong enough to justify the time, effort, and other costs of putting it into practice—especially if easier and cheaper treatments that work almost as well already exist. Although statistically significant, this result would be said to lack practical or clinical significance.

Key Takeaways

  • Null hypothesis testing is a formal approach to deciding whether a statistical relationship in a sample reflects a real relationship in the population or is just due to chance.
  • The logic of null hypothesis testing involves assuming that the null hypothesis is true, finding how likely the sample result would be if this assumption were correct, and then making a decision. If the sample result would be unlikely if the null hypothesis were true, then it is rejected in favor of the alternative hypothesis. If it would not be unlikely, then the null hypothesis is retained.
  • The probability of obtaining the sample result if the null hypothesis were true (the p value) is based on two considerations: relationship strength and sample size. Reasonable judgments about whether a sample relationship is statistically significant can often be made by quickly considering these two factors.
  • Statistical significance is not the same as relationship strength or importance. Even weak relationships can be statistically significant if the sample size is large enough. It is important to consider relationship strength and the practical significance of a result in addition to its statistical significance.
  • Discussion: Imagine a study showing that people who eat more broccoli tend to be happier. Explain for someone who knows nothing about statistics why the researchers would conduct a null hypothesis test.

Practice: Use Table 13.1 “How Relationship Strength and Sample Size Combine to Determine Whether a Result Is Statistically Significant” to decide whether each of the following results is statistically significant.

  • The correlation between two variables is r = −.78 based on a sample size of 137.
  • The mean score on a psychological characteristic for women is 25 ( SD = 5) and the mean score for men is 24 ( SD = 5). There were 12 women and 10 men in this study.
  • In a memory experiment, the mean number of items recalled by the 40 participants in Condition A was 0.50 standard deviations greater than the mean number recalled by the 40 participants in Condition B.
  • In another memory experiment, the mean scores for participants in Condition A and Condition B came out exactly the same!
  • A student finds a correlation of r = .04 between the number of units the students in his research methods class are taking and the students’ level of stress.

Cohen, J. (1994). The world is round: p < .05. American Psychologist, 49 , 997–1003.

Hyde, J. S. (2007). New directions in the study of gender similarities and differences. Current Directions in Psychological Science , 16 , 259–263.

Research Methods in Psychology Copyright © 2016 by University of Minnesota is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License , except where otherwise noted.

13.1 Understanding Null Hypothesis Testing

Learning objectives.

  • Explain the purpose of null hypothesis testing, including the role of sampling error.
  • Describe the basic logic of null hypothesis testing.
  • Describe the role of relationship strength and sample size in determining statistical significance and make reasonable judgments about statistical significance based on these two factors.

  The Purpose of Null Hypothesis Testing

As we have seen, psychological research typically involves measuring one or more variables in a sample and computing descriptive statistics for that sample. In general, however, the researcher’s goal is not to draw conclusions about that sample but to draw conclusions about the population that the sample was selected from. Thus researchers must use sample statistics to draw conclusions about the corresponding values in the population. These corresponding values in the population are called  parameters . Imagine, for example, that a researcher measures the number of depressive symptoms exhibited by each of 50 adults with clinical depression and computes the mean number of symptoms. The researcher probably wants to use this sample statistic (the mean number of symptoms for the sample) to draw conclusions about the corresponding population parameter (the mean number of symptoms for adults with clinical depression).

Unfortunately, sample statistics are not perfect estimates of their corresponding population parameters. This is because there is a certain amount of random variability in any statistic from sample to sample. The mean number of depressive symptoms might be 8.73 in one sample of adults with clinical depression, 6.45 in a second sample, and 9.44 in a third—even though these samples are selected randomly from the same population. Similarly, the correlation (Pearson’s  r ) between two variables might be +.24 in one sample, −.04 in a second sample, and +.15 in a third—again, even though these samples are selected randomly from the same population. This random variability in a statistic from sample to sample is called  sampling error . (Note that the term error  here refers to random variability and does not imply that anyone has made a mistake. No one “commits a sampling error.”)

One implication of this is that when there is a statistical relationship in a sample, it is not always clear that there is a statistical relationship in the population. A small difference between two group means in a sample might indicate that there is a small difference between the two group means in the population. But it could also be that there is no difference between the means in the population and that the difference in the sample is just a matter of sampling error. Similarly, a Pearson’s  r  value of −.29 in a sample might mean that there is a negative relationship in the population. But it could also be that there is no relationship in the population and that the relationship in the sample is just a matter of sampling error.

In fact, any statistical relationship in a sample can be interpreted in two ways:

  • There is a relationship in the population, and the relationship in the sample reflects this.
  • There is no relationship in the population, and the relationship in the sample reflects only sampling error.

The purpose of null hypothesis testing is simply to help researchers decide between these two interpretations.

The Logic of Null Hypothesis Testing

Null hypothesis testing  is a formal approach to deciding between two interpretations of a statistical relationship in a sample. One interpretation is called the  null hypothesis  (often symbolized  H 0  and read as “H-naught”). This is the idea that there is no relationship in the population and that the relationship in the sample reflects only sampling error. Informally, the null hypothesis is that the sample relationship “occurred by chance.” The other interpretation is called the  alternative hypothesis  (often symbolized as  H 1 ). This is the idea that there is a relationship in the population and that the relationship in the sample reflects this relationship in the population.

Again, every statistical relationship in a sample can be interpreted in either of these two ways: It might have occurred by chance, or it might reflect a relationship in the population. So researchers need a way to decide between them. Although there are many specific null hypothesis testing techniques, they are all based on the same general logic. The steps are as follows:

  • Assume for the moment that the null hypothesis is true. There is no relationship between the variables in the population.
  • Determine how likely the sample relationship would be if the null hypothesis were true.
  • If the sample relationship would be extremely unlikely, then reject the null hypothesis  in favor of the alternative hypothesis. If it would not be extremely unlikely, then  retain the null hypothesis .

Following this logic, we can begin to understand why Mehl and his colleagues concluded that there is no difference in talkativeness between women and men in the population. In essence, they asked the following question: “If there were no difference in the population, how likely is it that we would find a small difference of  d  = 0.06 in our sample?” Their answer to this question was that this sample relationship would be fairly likely if the null hypothesis were true. Therefore, they retained the null hypothesis—concluding that there is no evidence of a sex difference in the population. We can also see why Kanner and his colleagues concluded that there is a correlation between hassles and symptoms in the population. They asked, “If the null hypothesis were true, how likely is it that we would find a strong correlation of +.60 in our sample?” Their answer to this question was that this sample relationship would be fairly unlikely if the null hypothesis were true. Therefore, they rejected the null hypothesis in favor of the alternative hypothesis—concluding that there is a positive correlation between these variables in the population.

A crucial step in null hypothesis testing is finding the likelihood of the sample result if the null hypothesis were true. This probability is called the  p value . A low  p  value means that the sample result would be unlikely if the null hypothesis were true and leads to the rejection of the null hypothesis. A p  value that is not low means that the sample result would be likely if the null hypothesis were true and leads to the retention of the null hypothesis. But how low must the  p  value be before the sample result is considered unlikely enough to reject the null hypothesis? In null hypothesis testing, this criterion is called  α (alpha)  and is almost always set to .05. If there is a 5% chance or less of a result as extreme as the sample result if the null hypothesis were true, then the null hypothesis is rejected. When this happens, the result is said to be  statistically significant . If there is greater than a 5% chance of a result as extreme as the sample result when the null hypothesis is true, then the null hypothesis is retained. This does not necessarily mean that the researcher accepts the null hypothesis as true—only that there is not currently enough evidence to reject it. Researchers often use the expression “fail to reject the null hypothesis” rather than “retain the null hypothesis,” but they never use the expression “accept the null hypothesis.”

The Misunderstood  p  Value

The  p  value is one of the most misunderstood quantities in psychological research (Cohen, 1994) [1] . Even professional researchers misinterpret it, and it is not unusual for such misinterpretations to appear in statistics textbooks!

The most common misinterpretation is that the  p  value is the probability that the null hypothesis is true—that the sample result occurred by chance. For example, a misguided researcher might say that because the  p  value is .02, there is only a 2% chance that the result is due to chance and a 98% chance that it reflects a real relationship in the population. But this is incorrect . The  p  value is really the probability of a result at least as extreme as the sample result  if  the null hypothesis  were  true. So a  p  value of .02 means that if the null hypothesis were true, a sample result this extreme would occur only 2% of the time.

You can avoid this misunderstanding by remembering that the  p  value is not the probability that any particular  hypothesis  is true or false. Instead, it is the probability of obtaining the  sample result  if the null hypothesis were true.

image

“Null Hypothesis” retrieved from http://imgs.xkcd.com/comics/null_hypothesis.png (CC-BY-NC 2.5)

Role of Sample Size and Relationship Strength

Recall that null hypothesis testing involves answering the question, “If the null hypothesis were true, what is the probability of a sample result as extreme as this one?” In other words, “What is the  p  value?” It can be helpful to see that the answer to this question depends on just two considerations: the strength of the relationship and the size of the sample. Specifically, the stronger the sample relationship and the larger the sample, the less likely the result would be if the null hypothesis were true. That is, the lower the  p  value. This should make sense. Imagine a study in which a sample of 500 women is compared with a sample of 500 men in terms of some psychological characteristic, and Cohen’s  d  is a strong 0.50. If there were really no sex difference in the population, then a result this strong based on such a large sample should seem highly unlikely. Now imagine a similar study in which a sample of three women is compared with a sample of three men, and Cohen’s  d  is a weak 0.10. If there were no sex difference in the population, then a relationship this weak based on such a small sample should seem likely. And this is precisely why the null hypothesis would be rejected in the first example and retained in the second.

Of course, sometimes the result can be weak and the sample large, or the result can be strong and the sample small. In these cases, the two considerations trade off against each other so that a weak result can be statistically significant if the sample is large enough and a strong relationship can be statistically significant even if the sample is small. Table 13.1 shows roughly how relationship strength and sample size combine to determine whether a sample result is statistically significant. The columns of the table represent the three levels of relationship strength: weak, medium, and strong. The rows represent four sample sizes that can be considered small, medium, large, and extra large in the context of psychological research. Thus each cell in the table represents a combination of relationship strength and sample size. If a cell contains the word  Yes , then this combination would be statistically significant for both Cohen’s  d  and Pearson’s  r . If it contains the word  No , then it would not be statistically significant for either. There is one cell where the decision for  d  and  r  would be different and another where it might be different depending on some additional considerations, which are discussed in Section 13.2 “Some Basic Null Hypothesis Tests”

Sample Size Weak Medium Strong
Small (  = 20) No No  = Maybe

 = Yes

Medium (  = 50) No Yes Yes
Large (  = 100)  = Yes

 = No

Yes Yes
Extra large (  = 500) Yes Yes Yes

Although Table 13.1 provides only a rough guideline, it shows very clearly that weak relationships based on medium or small samples are never statistically significant and that strong relationships based on medium or larger samples are always statistically significant. If you keep this lesson in mind, you will often know whether a result is statistically significant based on the descriptive statistics alone. It is extremely useful to be able to develop this kind of intuitive judgment. One reason is that it allows you to develop expectations about how your formal null hypothesis tests are going to come out, which in turn allows you to detect problems in your analyses. For example, if your sample relationship is strong and your sample is medium, then you would expect to reject the null hypothesis. If for some reason your formal null hypothesis test indicates otherwise, then you need to double-check your computations and interpretations. A second reason is that the ability to make this kind of intuitive judgment is an indication that you understand the basic logic of this approach in addition to being able to do the computations.

Statistical Significance Versus Practical Significance

Table 13.1 illustrates another extremely important point. A statistically significant result is not necessarily a strong one. Even a very weak result can be statistically significant if it is based on a large enough sample. This is closely related to Janet Shibley Hyde’s argument about sex differences (Hyde, 2007) [2] . The differences between women and men in mathematical problem solving and leadership ability are statistically significant. But the word  significant  can cause people to interpret these differences as strong and important—perhaps even important enough to influence the college courses they take or even who they vote for. As we have seen, however, these statistically significant differences are actually quite weak—perhaps even “trivial.”

This is why it is important to distinguish between the  statistical  significance of a result and the  practical  significance of that result.  Practical significance refers to the importance or usefulness of the result in some real-world context. Many sex differences are statistically significant—and may even be interesting for purely scientific reasons—but they are not practically significant. In clinical practice, this same concept is often referred to as “clinical significance.” For example, a study on a new treatment for social phobia might show that it produces a statistically significant positive effect. Yet this effect still might not be strong enough to justify the time, effort, and other costs of putting it into practice—especially if easier and cheaper treatments that work almost as well already exist. Although statistically significant, this result would be said to lack practical or clinical significance.

image

“Conditional Risk” retrieved from http://imgs.xkcd.com/comics/conditional_risk.png (CC-BY-NC 2.5)

Key Takeaways

  • Null hypothesis testing is a formal approach to deciding whether a statistical relationship in a sample reflects a real relationship in the population or is just due to chance.
  • The logic of null hypothesis testing involves assuming that the null hypothesis is true, finding how likely the sample result would be if this assumption were correct, and then making a decision. If the sample result would be unlikely if the null hypothesis were true, then it is rejected in favor of the alternative hypothesis. If it would not be unlikely, then the null hypothesis is retained.
  • The probability of obtaining the sample result if the null hypothesis were true (the  p  value) is based on two considerations: relationship strength and sample size. Reasonable judgments about whether a sample relationship is statistically significant can often be made by quickly considering these two factors.
  • Statistical significance is not the same as relationship strength or importance. Even weak relationships can be statistically significant if the sample size is large enough. It is important to consider relationship strength and the practical significance of a result in addition to its statistical significance.
  • Discussion: Imagine a study showing that people who eat more broccoli tend to be happier. Explain for someone who knows nothing about statistics why the researchers would conduct a null hypothesis test.
  • The correlation between two variables is  r  = −.78 based on a sample size of 137.
  • The mean score on a psychological characteristic for women is 25 ( SD  = 5) and the mean score for men is 24 ( SD  = 5). There were 12 women and 10 men in this study.
  • In a memory experiment, the mean number of items recalled by the 40 participants in Condition A was 0.50 standard deviations greater than the mean number recalled by the 40 participants in Condition B.
  • In another memory experiment, the mean scores for participants in Condition A and Condition B came out exactly the same!
  • A student finds a correlation of  r  = .04 between the number of units the students in his research methods class are taking and the students’ level of stress.
  • Cohen, J. (1994). The world is round: p < .05. American Psychologist, 49 , 997–1003. ↵
  • Hyde, J. S. (2007). New directions in the study of gender similarities and differences. Current Directions in Psychological Science, 16 , 259–263. ↵

Creative Commons License

Share This Book

  • Increase Font Size

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base

An Easy Introduction to Statistical Significance (With Examples)

Published on January 7, 2021 by Pritha Bhandari . Revised on June 22, 2023.

If a result is statistically significant , that means it’s unlikely to be explained solely by chance or random factors. In other words, a statistically significant result has a very low chance of occurring if there were no true effect in a research study.

The p value , or probability value, tells you the statistical significance of a finding. In most studies, a p value of 0.05 or less is considered statistically significant, but this threshold can also be set higher or lower.

Table of contents

How do you test for statistical significance, what is a significance level, problems with relying on statistical significance, other types of significance in research, other interesting articles, frequently asked questions about statistical significance.

In quantitative research , data are analyzed through null hypothesis significance testing, or hypothesis testing. This is a formal procedure for assessing whether a relationship between variables or a difference between groups is statistically significant.

Null and alternative hypotheses

To begin, research predictions are rephrased into two main hypotheses: the null and alternative hypothesis .

  • A null hypothesis ( H 0 ) always predicts no true effect, no relationship between variables , or no difference between groups.
  • An alternative hypothesis ( H a or H 1 ) states your main prediction of a true effect, a relationship between variables, or a difference between groups.

Hypothesis testin g always starts with the assumption that the null hypothesis is true. Using this procedure, you can assess the likelihood (probability) of obtaining your results under this assumption. Based on the outcome of the test, you can reject or retain the null hypothesis.

  • H 0 : There is no difference in happiness between actively smiling and not smiling.
  • H a : Actively smiling leads to more happiness than not smiling.

Test statistics and p values

Every statistical test produces:

  • A test statistic that indicates how closely your data match the null hypothesis.
  • A corresponding p value that tells you the probability of obtaining this result if the null hypothesis is true.

The p value determines statistical significance. An extremely low p value indicates high statistical significance, while a high p value means low or no statistical significance.

Next, you perform a t test to see whether actively smiling leads to more happiness. Using the difference in average happiness between the two groups, you calculate:

  • a t value (the test statistic) that tells you how much the sample data differs from the null hypothesis,
  • a p value showing the likelihood of finding this result if the null hypothesis is true.

Here's why students love Scribbr's proofreading services

Discover proofreading & editing

The significance level , or alpha (α), is a value that the researcher sets in advance as the threshold for statistical significance. It is the maximum risk of making a false positive conclusion ( Type I error ) that you are willing to accept .

In a hypothesis test, the  p value is compared to the significance level to decide whether to reject the null hypothesis.

  • If the p value is  higher than the significance level, the null hypothesis is not refuted, and the results are not statistically significant .
  • If the p value is lower than the significance level, the results are interpreted as refuting the null hypothesis and reported as statistically significant .

Usually, the significance level is set to 0.05 or 5%. That means your results must have a 5% or lower chance of occurring under the null hypothesis to be considered statistically significant.

The significance level can be lowered for a more conservative test. That means an effect has to be larger to be considered statistically significant.

The significance level may also be set higher for significance testing in non-academic marketing or business contexts. This makes the study less rigorous and increases the probability of finding a statistically significant result.

As best practice, you should set a significance level before you begin your study. Otherwise, you can easily manipulate your results to match your research predictions.

It’s important to note that hypothesis testing can only show you whether or not to reject the null hypothesis in favor of the alternative hypothesis. It can never “prove” the null hypothesis, because the lack of a statistically significant effect doesn’t mean that absolutely no effect exists.

When reporting statistical significance, include relevant descriptive statistics about your data (e.g., means and standard deviations ) as well as the test statistic and p value.

There are various critiques of the concept of statistical significance and how it is used in research.

Researchers classify results as statistically significant or non-significant using a conventional threshold that lacks any theoretical or practical basis. This means that even a tiny 0.001 decrease in a p value can convert a research finding from statistically non-significant to significant with almost no real change in the effect.

On its own, statistical significance may also be misleading because it’s affected by sample size. In extremely large samples , you’re more likely to obtain statistically significant results, even if the effect is actually small or negligible in the real world. This means that small effects are often exaggerated if they meet the significance threshold, while interesting results are ignored when they fall short of meeting the threshold.

The strong emphasis on statistical significance has led to a serious publication bias and replication crisis in the social sciences and medicine over the last few decades. Results are usually only published in academic journals if they show statistically significant results—but statistically significant results often can’t be reproduced in high quality replication studies.

As a result, many scientists call for retiring statistical significance as a decision-making tool in favor of more nuanced approaches to interpreting results.

That’s why APA guidelines advise reporting not only p values but also  effect sizes and confidence intervals wherever possible to show the real world implications of a research outcome.

Aside from statistical significance, clinical significance and practical significance are also important research outcomes.

Practical significance shows you whether the research outcome is important enough to be meaningful in the real world. It’s indicated by the effect size of the study.

Clinical significance is relevant for intervention and treatment studies. A treatment is considered clinically significant when it tangibly or substantially improves the lives of patients.

Prevent plagiarism. Run a free check.

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

  • Normal distribution
  • Descriptive statistics
  • Measures of central tendency
  • Correlation coefficient

Methodology

  • Cluster sampling
  • Stratified sampling
  • Types of interviews
  • Cohort study
  • Thematic analysis

Research bias

  • Implicit bias
  • Cognitive bias
  • Survivorship bias
  • Availability heuristic
  • Nonresponse bias
  • Regression to the mean

Statistical significance is a term used by researchers to state that it is unlikely their observations could have occurred under the null hypothesis of a statistical test . Significance is usually denoted by a p -value , or probability value.

Statistical significance is arbitrary – it depends on the threshold, or alpha value, chosen by the researcher. The most common threshold is p < 0.05, which means that the data is likely to occur less than 5% of the time under the null hypothesis .

When the p -value falls below the chosen alpha value, then we say the result of the test is statistically significant.

A p -value , or probability value, is a number describing how likely it is that your data would have occurred under the null hypothesis of your statistical test .

P -values are usually automatically calculated by the program you use to perform your statistical test. They can also be estimated using p -value tables for the relevant test statistic .

P -values are calculated from the null distribution of the test statistic. They tell you how often a test statistic is expected to occur under the null hypothesis of the statistical test, based on where it falls in the null distribution.

If the test statistic is far from the mean of the null distribution, then the p -value will be small, showing that the test statistic is not likely to have occurred under the null hypothesis.

No. The p -value only tells you how likely the data you have observed is to have occurred under the null hypothesis .

If the p -value is below your threshold of significance (typically p < 0.05), then you can reject the null hypothesis, but this does not necessarily mean that your alternative hypothesis is true.

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

Bhandari, P. (2023, June 22). An Easy Introduction to Statistical Significance (With Examples). Scribbr. Retrieved June 24, 2024, from https://www.scribbr.com/statistics/statistical-significance/

Is this article helpful?

Pritha Bhandari

Pritha Bhandari

Other students also liked, understanding p values | definition and examples, what is effect size and why does it matter (examples), hypothesis testing | a step-by-step guide with easy examples, what is your plagiarism score.

New Guidelines for Null Hypothesis Significance Testing in Hypothetico-Deductive IS Research

  • First Online: 15 October 2023

Cite this chapter

null hypothesis in practical research

  • Willem Mertens   ORCID: orcid.org/0000-0002-1635-3041 6 &
  • Jan Recker   ORCID: orcid.org/0000-0002-2072-5792 7  

Part of the book series: Technology, Work and Globalization ((TWG))

283 Accesses

We are concerned about the design, analysis, reporting and reviewing of quantitative IS studies that draw on null hypothesis significance testing (NHST). We observe that debates about misinterpretations, abuse, and issues with NHST, while having persisted for about half a century, remain largely absent in IS. We find this an untenable position for a discipline with a proud quantitative tradition. We discuss traditional and emergent threats associated with the application of NHST and examine how they manifest in recent IS scholarship. To encourage the development of new standards for NHST in hypothetico-deductive IS research, we develop a balanced account of possible actions that are implementable short-term or long-term and that incentivize or penalize specific practices. To promote an immediate push for change, we also develop two sets of guidelines that IS scholars can adopt right away.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

That is, the entire IS scholarly ecosystem of authors, reviewers, editors/publishers, and educators/supervisors.

We will also discuss some of the problems inherent to NHST, but our clear focus is on our own fallibilities and how they could be mitigated.

Remarkably, contrary to several fields, the experiences at the AIS Transactions on Replication Research after three years of publishing replication research indicate that a meaningful proportion of research replications have produced results that are essentially the same as the original study (Dennis et al., 2018 ).

This trend is evidenced, for example, in the emergent number of IS research articles on these topics in our own journals (e.g., Berente et al., 2019 ; Howison et al., 2011 ; Levy & Germonprez, 2017 ; Lukyanenko et al., 2019 ).

To illustrate the magnitude of the conversation, in June 2019, The American Statistician published a special issue on null hypothesis significance testing that contains 43 articles on the topic (Wasserstein et al., 2019 ).

An analogous, more detailed example using the relationship between mammograms and the likelihood of breast cancer is provided by Gigerenzer et al. ( 2008 ).

See Lin et al. ( 2013 ) for several examples.

To illustrate, consider this tweet from June 3, 2019: “Discussion on the #statisticalSignificance has reached ISR. “Null hypothesis significance testing in quantitative IS research: a call to reconsider our practices [submission to a second AIS Senior Scholar Basket of 8 Journal, received Major Revisions]” a new paper by @janrecker” ( https://twitter.com/AgloAnivel/status/1135466967354290176 )

Our query terms were: [ Management Information Systems Quarterly OR MIS Quarterly OR MISQ], [ European Journal of Information Systems OR EJIS], [ Information Systems Journal OR IS Journal OR ISJ], [ Information Systems Research OR ISR], [ Journal of the Association for Information Systems OR Journal of the AIS OR JAIS], [ Journal of Information Technology OR Journal of IT OR JIT], [ Journal of Management Information Systems OR Journal of MIS OR JMIS], [ Journal of Strategic Information Systems OR Journal of SIS OR JSIS]. We checked for and excluded inaccurate results, such as papers from MISQ Executive , European Journal of Interdisciplinary Studies (EJIS), etc.

We used the definitions by Creswell ( 2009 , p. 148): random sampling means each unit in the population has an equal probability of being selected, systematic sampling means that specific characteristics are used to stratify the sample such that the true proportion of units in the studied population is reflected, and convenience sampling means that a nonprobability sample of available or accessible units is used.

Amrhein, V., Greenland, S., & McShane, B. (2019). Scientists rise up against statistical significance. Nature, 567 , 305–307.

Article   Google Scholar  

Bagozzi, R. P. (2011). Measurement and meaning in information systems and organizational research: Methodological and philosophical foundations. MIS Quarterly, 35 (2), 261–292.

Baker, M. (2016). Statisticians issue warning over misuse of p values. Nature, 531 (7593), 151–151.

Baroudi, J. J., & Orlikowski, W. J. (1989). The problem of statistical power in MIS research. MIS Quarterly, 13 (1), 87–106.

Bedeian, A. G., Taylor, S. G., & Miller, A. N. (2010). Management science on the credibility bubble: Cardinal sins and various misdemeanors. Academy of Management Learning & Education, 9 (4), 715–725.

Google Scholar  

Begg, C., Cho, M., Eastwood, S., Horton, R., Moher, D., Olkin, I., et al. (1996). Improving the quality of reporting of randomized controlled trials: The consort statement. Journal of the American Medical Association, 276 (8), 637–639.

Berente, N., Seidel, S., & Safadi, H. (2019). Data-driven computationally-intensive theory development. Information Systems Research, 30 (1), 50–64.

Bettis, R. A. (2012). The search for asterisks: Compromised statistical tests and flawed theories. Strategic Management Journal, 33 (1), 108–113.

Bettis, R. A., Ethiraj, S., Gambardella, A., Helfat, C., & Mitchell, W. (2016). Creating repeatable cumulative knowledge in strategic management. Strategic Management Journal, 37 (2), 257–261.

Branch, M. (2014). Malignant side effects of null-hypothesis significance testing. Theory & Psychology, 24 (2), 256–277.

Bruns, S. B., & Ioannidis, J. P. A. (2016). P-curve and p-hacking in observational research. PLoS One, 11 (2), e0149144.

Burmeister, O. K. (2016). A post publication review of “A review and comparative analysis of security risks and safety measures of mobile health apps.”. Australasian Journal of Information Systems, 20 , 1–4.

Burtch, G., Ghose, A., & Wattal, S. (2013). An empirical examination of the antecedents and consequences of contribution patterns in crowd-funded markets. Information Systems Research, 24 (3), 499–519.

Burton-Jones, A., & Lee, A. S. (2017). Thinking about measures and measurement in positivist research: A proposal for refocusing on fundamentals. Information Systems Research, 28 (3), 451–467.

Burton-Jones, A., Recker, J., Indulska, M., Green, P., & Weber, R. (2017). Assessing representation theory with a framework for pursuing success and failure. MIS Quarterly, 41 (4), 1307–1333.

Button, K. S., Bal, L., Clark, A., & Shipley, T. (2016). Preventing the ends from justifying the means: Withholding results to address publication bias in peer-review. BMC Psychology, 4 , 59.

Chen, H., Chiang, R., & Storey, V. C. (2012). Business intelligence and analytics: From big data to big impacts. MIS Quarterly, 36 (4), 1165–1188.

Christensen, R. (2005). Testing Fisher, Neyman, Pearson, and Bayes. The American Statistician, 59 (2), 121–126.

Cohen, J. (1994). The earth is round (p <0.05). American Psychologist, 49 (12), 997–1003.

Creswell, J. W. (2009). Research design: Qualitative, quantitative, and mixed methods approaches (3rd ed.). SAGE.

David, P. A. (2004). Understanding the emergence of “open science” institutions: Functionalist economics in historical context. Industrial and Corporate Change, 13 (4), 571–589.

Dennis, A. R., Brown, S. A., Wells, T., & Rai, A. (2018). Information systems replication project . https://aisel.aisnet.org/trr/aimsandscope.html .

Dennis, A. R., & Valacich, J. S. (2015). A replication manifesto. AIS Transactions on Replication Research, 1 (1), 1–4.

Dennis, A. R., Valacich, J. S., Fuller, M. A., & Schneider, C. (2006). Research standards for promotion and tenure in information systems. MIS Quarterly, 30 (1), 1–12.

Dewan, S., & Ramaprasad, J. (2014). Social media, traditional media, and music sales. MIS Quarterly, 38 (1), 101–121.

Dixon, P. (2003). The p-value fallacy and how to avoid it. Canadian Journal of Experimental Psychology/Revue canadienne de psychologie expérimentale, 57 (3), 189–202.

Edwards, J. R., & Berry, J. W. (2010). The presence of something or the absence of nothing: Increasing theoretical precision in management research. Organizational Research Methods, 13 (4), 668–689.

Emerson, G. B., Warme, W. J., Wolf, F. M., Heckman, J. D., Brand, R. A., & Leopold, S. S. (2010). Testing for the presence of positive-outcome bias in peer review: A randomized controlled trial. Archives of Internal Medicine, 170 (21), 1934–1939.

Falk, R., & Greenbaum, C. W. (1995). Significance tests die hard: The amazing persistence of a probabilistic misconception. Theory & Psychology, 5 (1), 75–98.

Faul, F., Erdfelder, E., Lang, A.-G., & Axel, B. (2007). G*power 3: A flexible statistical power analysis for the social, behavioral, and biomedical sciences. Behavior Research Methods, 39 (2), 175–191.

Field, A. (2013). Discovering statistics using IBM SPSS statistics . SAGE.

Fisher, R. A. (1935a). The design of experiments . Oliver & Boyd.

Fisher, R. A. (1935b). The logic of inductive inference. Journal of the Royal Statistical Society, 98 (1), 39–82.

Fisher, R. A. (1955). Statistical methods and scientific induction. Journal of the Royal Statistical Society. Series B (Methodological), 17 (1), 69–78.

Freelon, D. (2014). On the interpretation of digital trace data in communication and social computing research. Journal of Broadcasting & Electronic Media, 58 (1), 59–75.

Gefen, D., Rigdon, E. E., & Straub, D. W. (2011). An update and extension to SEM guidelines for administrative and social science research. MIS Quarterly, 35 (2), iii–xiv.

Gelman, A. (2013). P values and statistical practice. Epidemiology, 24 (1), 69–72.

Gelman, A. (2015). Statistics and research integrity. European Science Editing, 41 , 13–14.

Gelman, A., & Stern, H. (2006). The difference between “significant” and “not significant” is not itself statistically significant. The American Statistician, 60 (4), 328–331.

George, G., Haas, M. R., & Pentland, A. (2014). From the editors: Big data and management. Academy of Management Journal, 57 (2), 321–326.

Gerow, J. E., Grover, V., Roberts, N., & Thatcher, J. B. (2010). The diffusion of second-generation statistical techniques in information systems research from 1990-2008. Journal of Information Technology Theory and Application, 11 (4), 5–28.

Gigerenzer, G. (2004). Mindless statistics. Journal of Socio-Economics, 33 (5), 587–606.

Gigerenzer, G., Gaissmeyer, W., Kurz-Milcke, E., Schwartz, L. M., & Woloshin, S. (2008). Helping doctors and patients make sense of health statistics. Psychological Science in the Public Interest, 8 (2), 53–96.

Godfrey-Smith, P. (2003). Theory and reality: An introduction to the philosophy of science . University of Chicago Press.

Book   Google Scholar  

Goldfarb, B., & King, A. A. (2016). Scientific apophenia in strategic management research: Significance tests & mistaken inference. Strategic Management Journal, 37 (1), 167–176.

Goodhue, D. L., Lewis, W., & Thompson, R. L. (2007). Statistical power in analyzing interaction effects: Questioning the advantage of PLS with product indicators. Information Systems Research, 18 (2), 211–227.

Gray, P. H., & Cooper, W. H. (2010). Pursuing failure. Organizational Research Methods, 13 (4), 620–643.

Greenland, S., Senn, S. J., Rothman, K. J., Carlin, J. B., Poole, C., Goodman, S. N., & Altman, D. G. (2016). Statistical tests, p values, confidence intervals, and power: A guide to misinterpretations. European Journal of Epidemiology, 31 (4), 337–350.

Gregor, S. (2006). The nature of theory in information systems. MIS Quarterly, 30 (3), 611–642.

Gregor, S., & Klein, G. (2014). Eight obstacles to overcome in the theory testing genre. Journal of the Association for Information Systems, 15 (11), i–xix.

Greve, W., Bröder, A., & Erdfelder, E. (2013). Result-blind peer reviews and editorial decisions: A missing pillar of scientific culture. European Psychologist, 18 (4), 286–294.

Grover, V., & Lyytinen, K. (2015). New state of play in information systems research: The push to the edges. MIS Quarterly, 39 (2), 271–296.

Grover, V., Straub, D. W., & Galluch, P. (2009). Editor’s comments: Turning the corner: The influence of positive thinking on the information systems field. MIS Quarterly, 33 (1), iii-viii.

Guide, V. D. R., Jr., & Ketokivi, M. (2015). Notes from the editors: Redefining some methodological criteria for the journal. Journal of Operations Management, 37 , v-viii.

Hair, J. F., Sarstedt, M., Ringle, C. M., & Mena, J. A. (2012). An assessment of the use of partial least squares structural equation modeling in marketing research. Journal of the Academy of Marketing Science, 40 (3), 414–433.

Haller, H., & Kraus, S. (2002). Misinterpretations of significance: A problem students share with their teachers? Methods of Psychological Research, 7 (1), 1–20.

Harrison, J. S., Banks, G. C., Pollack, J. M., O’Boyle, E. H., & Short, J. (2014). Publication bias in strategic management research. Journal of Management, 43 (2), 400–425.

Harzing, A.-W. (2010). The publish or perish book: Your guide to effective and responsible citation analysis . Tarma Software Research.

Howison, J., Wiggins, A., & Crowston, K. (2011). Validity issues in the use of social network analysis with digital trace data. Journal of the Association for Information Systems, 12 (12), 767–797.

Hubbard, R. (2004). Alphabet soup. Blurring the distinctions between p’s and a’s in psychological research. Theory & Psychology, 14 (3), 295–327.

Ioannidis, J. P. A., Fanelli, D., Drunne, D. D., & Goodman, S. N. (2015). Meta-research: Evaluation and improvement of research methods and practices. PLoS Biology, 13 (10), e1002264.

Johnson, V. E., Payne, R. D., Wang, T., Asher, A., & Mandal, S. (2017). On the reproducibility of psychological science. Journal of the American Statistical Association, 112 (517), 1–10.

Kaplan, A. (1998/1964). The conduct of inquiry: Methodology for behavioral science. Transaction Publishers.

Kerr, N. L. (1998). Harking: Hypothesizing after the results are known. Personality and Social Psychology Review, 2 (3), 196–217.

Lang, J. M., Rothman, K. J., & Cann, C. I. (1998). That confounded p-value. Epidemiology, 9 (1), 7–8.

Lazer, D., Pentland, A. P., Adamic, L. A., Aral, S., Barabási, A.-L., Brewer, D., et al. (2009). Computational social science. Science, 323 (5915), 721–723.

Leahey, E. (2005). Alphas and asterisks: The development of statistical significance testing standards in sociology. Social Forces, 84 (1), 1–24.

Lee, A. S., & Baskerville, R. (2003). Generalizing generalizability in information systems research. Information Systems Research, 14 (3), 221–243.

Lee, A. S., & Hubona, G. S. (2009). A scientific basis for rigor in information systems research. MIS Quarterly, 33 (2), 237–262.

Lee, A. S., Mohajeri, K., & Hubona, G. S. (2017). Three roles for statistical significance and the validity frontier in theory testing . Paper presented at the 50th Hawaii international conference on system sciences.

Lehmann, E. L. (1993). The Fisher, Neyman-Pearson theories of testing hypotheses: One theory or two? Journal of the American Statistical Association, 88 (424), 1242–1249.

Lenzer, J., Hoffman, J. R., Furberg, C. D., & Ioannidis, J. P. A. (2013). Ensuring the integrity of clinical practice guidelines: A tool for protecting patients. British Medical Journal, 347 , f5535.

Levy, M., & Germonprez, M. (2017). The potential for citizen science in information systems research. Communications of the Association for Information Systems, 40 (2), 22–39.

Lin, M., Lucas, H. C., Jr., & Shmueli, G. (2013). Too big to fail: Large samples and the p-value problem. Information Systems Research, 24 (4), 906–917.

Locascio, J. J. (2019). The impact of results blind science publishing on statistical consultation and collaboration. The American Statistician, 73 (supp1), 346–351.

Lu, X., Ba, S., Huang, L., & Feng, Y. (2013). Promotional marketing or word-of-mouth? Evidence from online restaurant reviews. Information Systems Research, 24 (3), 596–612.

Lukyanenko, R., Parsons, J., Wiersma, Y. F., & Maddah, M. (2019). Expecting the unexpected: Effects of data collection design choices on the quality of crowdsourced user-generated content. MIS Quarterly, 43 (2), 623–647.

Lyytinen, K., Baskerville, R., Iivari, J., & Te‘Eni, D. (2007). Why the old world cannot publish? Overcoming challenges in publishing high-impact is research. European Journal of Information Systems, 16 (4), 317–326.

MacKenzie, S. B., Podsakoff, P. M., & Podsakoff, N. P. (2011). Construct measurement and validation procedures in mis and behavioral research: Integrating new and existing techniques. MIS Quarterly, 35 (2), 293–334.

Madden, L. V., Shah, D. A., & Esker, P. D. (2015). Does the p value have a future in plant pathology? Phytopathology, 105 (11), 1400–1407.

Matthews, R. A. J. (2019). Moving towards the post p < 0.05 era via the analysis of credibility. The American Statistician, 73 (Sup 1), 202–212.

McNutt, M. (2016). Taking up top. Science, 352 (6290), 1147.

McShane, B. B., & Gal, D. (2017). Blinding us to the obvious? The effect of statistical training on the evaluation of evidence. Management Science, 62 (6), 1707–1718.

Meehl, P. E. (1967). Theory-testing in psychology and physics: A methodological paradox. Philosophy of Science, 34 (2), 103–115.

Meehl, P. E. (1978). Theoretical risks and tabular asterisks: Sir Karl, sir Ronald, and the slow progress of soft psychology. Journal of Consulting and Clinical Psychology, 46 , 806–834.

Mertens, W., Pugliese, A., & Recker, J. (2017). Quantitative data analysis: A companion for accounting and information systems research . Springer.

Miller, J. (2009). What is the probability of replicating a statistically significant effect? Psychonomic Bulletin & Review, 16 (4), 617–640.

Mithas, S., Tafti, A., & Mitchell, W. (2013). How a firm's competitive environment and digital strategic posture influence digital business strategy. MIS Quarterly, 37 (2), 511.

Moher, D., Liberati, A., Tetzlaff, J., Altman, D. G., & The PRISMA Group. (2009). Preferred reporting items for systematic reviews and meta-analyses: The PRISMA statement. PLoS Medicine, 6 (7), e1000100.

Munafò, M. R., Nosek, B. A., Bishop, D. V. M., Button, K. S., Chambers, C. D., du Sert, N. P., et al. (2017). A manifesto for reproducible science. Nature Human Behaviour, 1 (0021), 1–9.

Nakagawa, S., & Cuthill, I. C. (2007). Effect size, confidence interval and statistical significance: A practical guide for biologists. Biological Reviews, 82 (4), 591–605.

NCBI Insights. (2018). Pubmed commons to be discontinued . https://ncbiinsights.ncbi.nlm . nih.gov/2018/02/01/pubmed-commons-to-be-discontinued /.

Nelson, L. D., Simmons, J. P., & Simonsohn, U. (2018). Psychology’s renaissance. Annual Review of Psychology, 69 , 511–534.

Neyman, J., & Pearson, E. S. (1928). On the use and interpretation of certain test criteria for purposes of statistical inference: Part I. Biometrika, 20A (1/2), 175–240.

Neyman, J., & Pearson, E. S. (1933). On the problem of the most efficient tests of statistical hypotheses. Philosophical transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character, 231 , 289–337.

Nickerson, R. S. (2000). Null hypothesis significance testing: A review of an old and continuing controversy. Psychological Methods, 5 (2), 241–301.

Nielsen, M. (2011). Reinventing discovery: The new era of networked science . Princeton University Press.

Nosek, B. A., Alter, G., Banks, G. C., Borsboom, D., Bowman, S. D., Breckler, S. J., et al. (2015). Promoting an open research culture. Science, 348 (6242), 1422–1425.

Nosek, B. A., Ebersole, C. R., DeHaven, A. C., & Mellor, D. T. (2018). The preregistration revolution. Proceedings of the National Academy of Sciences, 115 (11), 2600–2606.

Nuzzo, R. (2014). Statistical errors: P values, the “gold standard” of statistical validity, are not as reliable as many scientists assume. Nature, 506 (150), 150–152.

O’Boyle, E. H., Banks, G. C., & Gonzalez-Mulé, E. (2017). The chrysalis effect: How ugly initial results metamorphosize into beautiful articles. Journal of Management, 43 (2), 376–399.

Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349 (6251), 943.

Pernet, C. (2016). Null hypothesis significance testing: A guide to commonly misunderstood concepts and recommendations for good practice [version 5; peer review: 2 approved, 2 not approved]. F1000Research, 4 (621). https://doi.org/10.12688/f1000research.6963.5 .

publons. (2017). 5 steps to writing a winning post-publication peer review . https://publons.com/blog/5-steps-to-writing-a-winning-post-publication-peer-review/ .

Reinhart, A. (2015). Statistics done wrong: The woefully complete guide . No Starch Press.

Ringle, C. M., Sarstedt, M., & Straub, D. W. (2012). Editor’s comments: A critical look at the use of PLS-SEM in MIS quarterly . MIS Quarterly, 36 (1), iii–xiv.

Rishika, R., Kumar, A., Janakiraman, R., & Bezawada, R. (2013). The effect of customers’ social media participation on customer visit frequency and profitability: An empirical investigation. Information Systems Research, 24 (1), 108–127.

Rönkkö, M., & Evermann, J. (2013). A critical examination of common beliefs about partial least squares path modeling. Organizational Research Methods, 16 (3), 425–448.

Rönkkö, M., McIntosh, C. N., Antonakis, J., & Edwards, J. R. (2016). Partial least squares path modeling: Time for some serious second thoughts. Journal of Operations Management, 47-48 , 9–27.

Saunders, C. (2005). Editor’s comments: Looking for diamond cutters. MIS Quarterly, 29 (1), iii–viii.

Saunders, C., Brown, S. A., Bygstad, B., Dennis, A. R., Ferran, C., Galletta, D. F., et al. (2017). Goals, values, and expectations of the ais family of journals. Journal of the Association for Information Systems, 18 (9), 633–647.

Schönbrodt, F. D. (2018). P-checker: One-for-all p-value analyzer . http://shinyapps.org/apps/p-checker/ .

Schwab, A., Abrahamson, E., Starbuck, W. H., & Fidler, F. (2011). Perspective: Researchers should make thoughtful assessments instead of null-hypothesis significance tests. Organization Science, 22 (4), 1105–1120.

Shaw, J. D., & Ertug, G. (2017). From the editors: The suitability of simulations and meta-analyses for submissions to academy of management journal . Academy of Management Journal, 60 (6), 2045–2049.

Siegfried, T. (2014). To make science better, watch out for statistical flaws. ScienceNews Context Blog, 2019, February 7, 2014. https://www.sciencenews.org/blog/context/make-science-better-watch-out-statistical-flaws .

Simonsohn, U., Nelson, L. D., & Simmons, J. P. (2014). P-curve: A key to the file-drawer. Journal of Experimental Psychology: General, 143 (2), 534–547.

Sivo, S. A., Saunders, C., Chang, Q., & Jiang, J. J. (2006). How low should you go? Low response rates and the validity of inference in is questionnaire research. Journal of the Association for Information Systems, 7 (6), 351–414.

Smith, S. M., Fahey, T., & Smucny, J. (2014). Antibiotics for acute bronchitis. Journal of the American Medical Association, 312 (24), 2678–2679.

Starbuck, W. H. (2013). Why and where do academics publish? Management, 16 (5), 707–718.

Starbuck, W. H. (2016). 60th anniversary essay: How journals could improve research practices in social science. Administrative Science Quarterly, 61 (2), 165–183.

Straub, D. W. (1989). Validating instruments in MIS research. MIS Quarterly, 13 (2), 147–169.

Straub, D. W. (2008). Editor’s comments: Type II reviewing errors and the search for exciting papers. MIS Quarterly, 32 (2), v–x.

Straub, D. W., Boudreau, M.-C., & Gefen, D. (2004). Validation guidelines for is positivist research. Communications of the Association for Information Systems, 13 (24), 380–427.

Szucs, D., & Ioannidis, J. P. A. (2017). When null hypothesis significance testing is unsuitable for research: A reassessment. Frontiers in Human Neuroscience, 11 (390), 1–21.

Tams, S., & Straub, D. W. (2010). The effect of an IS article’s structure on its impact. Communications of the Association for Information Systems, 27 (10), 149–172.

The Economist. (2013). Trouble at the lab . The Economist. http://www.economist.com/news/briefing/21588057-scientists-think-science-self-correcting-alarming-degree-it-not-trouble .

Trafimow, D., & Marks, M. (2015). Editorial. Basic and Applied Social Psychology, 37 (1), 1–2.

Tryon, W. W., Patelis, T., Chajewski, M., & Lewis, C. (2017). Theory construction and data analysis. Theory & Psychology, 27 (1), 126–134.

Tsang, E. W. K., & Williams, J. N. (2012). Generalization and induction: Misconceptions, clarifications, and a classification of induction. MIS Quarterly, 36 (3), 729–748.

Twa, M. D. (2016). Transparency in biomedical research: An argument against tests of statistical significance. Optometry & Vision Science, 93 (5), 457–458.

Venkatesh, V., Brown, S. A., & Bala, H. (2013). Bridging the qualitative-quantitative divide: Guidelines for conducting mixed methods research in information systems. MIS Quarterly, 37 (1), 21–54.

Vodanovich, S., Sundaram, D., & Myers, M. D. (2010). Research commentary: Digital natives and ubiquitous information systems. Information Systems Research, 21 (4), 711–723.

Walsh, E., Rooney, M., Appleby, L., & Wilkinson, G. (2000). Open peer review: A randomised controlled trial. The British Journal of Psychiatry, 176 (1), 47–51.

Warren, M. (2018). First analysis of “preregistered” studies shows sharp rise in null findings. Nature News, October 24, 2018, https://www.nature.com/articles/d41586-018-07118 .

Wasserstein, R. L., & Lazar, N. A. (2016). The ASA’s statement on p-values: Context, process, and purpose. The American Statistician, 70 (2), 129–133.

Wasserstein, R. L., Schirm, A. L., & Lazar, N. A. (2019). Moving to a world beyond “p < 0.05.”. The American Statistician, 73 (Sup 1), 1–19.

Xu, H., Zhang, N., & Zhou, L. (2019). Validity concerns in research using organic data. Journal of Management, 46 , 1257. https://doi.org/10.1177/0149206319862027

Yong, E. (2012). Nobel laureate challenges psychologists to clean up their act. Nature News, October 3, 2012. https://www.nature.com/news/nobel-laureate-challenges-psychologists-to-clean-up-their-act-1.11535 .

Yoo, Y. (2010). Computing in everyday life: A call for research on experiential computing. MIS Quarterly, 34 (2), 213–231.

Zeng, X., & Wei, L. (2013). Social ties and user content generation: Evidence from flickr. Information Systems Research, 24 (1), 71–87.

Download references

Acknowledgments

We are indebted to the senior editor at JAIS , Allen Lee, and two anonymous reviewers for constructive and developmental feedback that helped us improve the original chapter. We thank participants at seminars at Queensland University of Technology and University of Cologne for providing feedback on our work. We also thank Christian Hovestadt for his help in coding papers. All faults remain ours.

Author information

Authors and affiliations.

Colruyt Group, Halle, Belgium

Willem Mertens

Universität Hamburg, Faculty of Business Administration, Information Systems and Digital Innovation, Hamburg, Germany

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Jan Recker .

Editor information

Editors and affiliations.

Department of Management, London School of Economics and Political Science, London, UK

Leslie P. Willcocks

Labovitz School of Business and Economics, University of Minnesota Duluth, Duluth, MN, USA

Nik R. Hassan

HEC Montréal, Montreal, QC, Canada

Suzanne Rivard

Appendix A: Literature Review Procedures

Identification of papers.

In our intention to demonstrate “open science” practices (Locascio, 2019 ; Nosek et al., 2018 ; Warren, 2018 ) we preregistered our research procedures using the Open Science Framework “Registries” (doi:10.17605/OSF.IO/2GKCS).

We proceeded as follows: We identified the 100 top-cited papers (per year) between 2013 and 2016 in the AIS Senior Scholars’ basket of 8 IS journals using Harzing’s Publish or Perish version 6 (Harzing, 2010 ). We ran the queries separately on February 7, 2017, and then aggregated the results to identify the 100 most cited papers (based on citations per year) across the basket of eight journals. Footnote 9 The raw data (together with the coded data) is available at an open data repository hosted by Queensland University of Technology (doi:10.25912/5cede0024b1e1).

We identified from this set of papers those that followed the hypothetico-deductive model. First, we excluded 48 papers that did not involve empirical data: 31 papers that offered purely theoretical contributions, 11 that were commentaries in the form of forewords, introductions to special issues or editorials, 5 methodological essays, and 1 design science paper. Second, we identified from these 52 papers those that reported on collection and analysis of quantitative data. We found 46 such papers; of these, 39 were traditional quantitative research articles, 3 were essays on methodological aspects of quantitative research, 2 studies employed mixed-method designs involving quantitative empirical data, and 2 design science papers that involved quantitative data. Third, we eliminated from this set the three methodological essays as the focus of these papers was not on developing and testing new theory to explain and predict IS phenomena. This resulted in a final sample of 43 papers, including 2 design science and 2 mixed-method studies.

Coding of Papers

We developed a coding scheme in an excel repository to code the studies. The repository is available in our Open Science Framework (OSF) registry. We used the following criteria. Where applicable, we refer to literature that defined the variables we used during coding.

What is the main method of data collection and analysis (e.g., experiment, meta-analysis, panel, social network analysis, survey, text mining, economic modeling, multiple)?

Are testable hypotheses or propositions proposed (yes/in graphical form only/no)?

How precisely are the hypotheses formulated (using the classification of Edwards & Berry, 2010 )?

Is null hypothesis significance testing used (yes/no)?

Are exact p- values reported (yes/all/some/not at all)?

Are effect sizes reported and, if so, which ones primarily (e.g., R 2 , standardized means difference scores, f 2 , partial eta 2 )?

Are results declared as “statistically significant” (yes/sometimes/not at all)?

How many hypotheses are reported as supported (%)?

Are p- values used to argue the absence of an effect (yes/no)?

Are confidence intervals for test statistics reported (yes/selectively/no)?

What sampling method is used (i.e., convenient/random/systematic sampling, entire population)? Footnote 10

Is statistical power discussed and if so, where and how (e.g., sample size estimation, ex-post power analysis)?

Are competing theories tested explicitly (Gray & Cooper, 2010 )?

Are corrections made to adjust for multiple hypothesis testing, where applicable (e.g., Bonferroni, alpha-inflation, variance inflation)?

Are post hoc analyses reported for unexpected results?

We also extracted quotes that in our interpretation illuminated the view taken on NHST in the chapter. This was important for us to demonstrate the imbuement of practices in our research routines and the language used in using key NHST phrases such as “statistical significance” or “ p- value” (Gelman & Stern, 2006 ).

To be as unbiased as possible, we hired a research assistant to perform the coding of papers. Before he commenced coding, we explained the coding scheme to him during several meetings. We then conducted a pilot test to evaluate the quality of his coding: the research assistant coded five random papers from the set of papers and we met to review the coding by comparing our different individual understandings of the papers. Where inconsistencies arose, we clarified the coding scheme with him until we were confident that he understood it thoroughly. During the coding, the research assistant highlighted particular problematic or ambiguous coding elements and we met and resolved these ambiguities to arrive at a shared agreement. The coding process took three months to complete. The results of our coding are openly accessible at doi : 10.25912/5cede0024b1e1. Appendix B provides some summary statistics about our sample.

Selected Descriptive Statistics from 43 Frequently Cited IS Papers from 2013 to 2016

Main method for data collection and analysis

Experiment

5

Meta-analysis

2

 

Panel

5

 

Social network analysis

4

 

Survey

15

 

Text mining

5

 

Economic modeling

1

 

Multiple

6

Empirical data

Newly collected or analyzed primary data

40

 

Re-analyzed or secondary data

3

Hypotheses

Testable hypotheses or propositions proposed

38

 

No testable hypotheses or propositions proposed

5

 

Average percentage of hypotheses per study that were supported by the data

82%

Statement of hypotheses

As relations

0

 

As upper/lower limits

0

 

As directions

13

 

In non-nil form

0

 

In functional form

0

 

In contingent form

2

 

As comparisons

6

 

In multiple ways

15

 

Not formulated

2

 

Not applicable

5

NHST

Uses NHST techniques or terminology

42

 

Does not use NHST techniques or terminology

1

Exact values

Reports exact values

3

 

Reports exact values selectively

8

 

Reports indicators for different levels of statistical significance

28

 

Does not report values

3

Inverse use of values

Uses values to point at the absence of an effect or accept the null hypothesis

11

 

Does not use values to point at the absence of effect or accept the null hypothesis

29

 

Not applicable

3

“Statistical” significance

Does not explicitly refer to “statistical significance”

23

 

Consistently refers to “statistical significance”

3

 

Selectively refers to “statistical significance”

16

 

Not applicable

1

Effect sizes

Reports R measures

26

 

Reports mean difference score measures

2

 

Reports multiple effect size measures

4

 

Does not report effect size measures

10

 

Not applicable

1

Confidence intervals

Reports confidence intervals consistently

3

 

Reports confidence intervals selectively

2

 

Reports confidence intervals for bootstrapping results (no value available)

3

 

Does not report confidence intervals

34

 

Not applicable

1

Sampling

Convenient

22

 

Systematic

6

 

Random

4

 

Entire population

8

 

Not applicable

3

Competing theories

Tested explicitly

7

 

Not tested

35

 

Not applicable

1

A posteriori analyses

Provided

11

 

Not provided

31

 

Not applicable

1

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this chapter

Mertens, W., Recker, J. (2023). New Guidelines for Null Hypothesis Significance Testing in Hypothetico-Deductive IS Research. In: Willcocks, L.P., Hassan, N.R., Rivard, S. (eds) Advancing Information Systems Theories, Volume II. Technology, Work and Globalization. Palgrave Macmillan, Cham. https://doi.org/10.1007/978-3-031-38719-7_13

Download citation

DOI : https://doi.org/10.1007/978-3-031-38719-7_13

Published : 15 October 2023

Publisher Name : Palgrave Macmillan, Cham

Print ISBN : 978-3-031-38718-0

Online ISBN : 978-3-031-38719-7

eBook Packages : Business and Management Business and Management (R0)

Share this chapter

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

Logo for Portland State University Pressbooks

Want to create or adapt books like this? Learn more about how Pressbooks supports open publishing practices.

Understanding Null Hypothesis Testing

Rajiv S. Jhangiani; I-Chant A. Chiang; Carrie Cuttler; and Dana C. Leighton

Learning Objectives

  • Explain the purpose of null hypothesis testing, including the role of sampling error.
  • Describe the basic logic of null hypothesis testing.
  • Describe the role of relationship strength and sample size in determining statistical significance and make reasonable judgments about statistical significance based on these two factors.

 The Purpose of Null Hypothesis Testing

As we have seen, psychological research typically involves measuring one or more variables in a sample and computing descriptive summary data (e.g., means, correlation coefficients) for those variables. These descriptive data for the sample are called statistics .  In general, however, the researcher’s goal is not to draw conclusions about that sample but to draw conclusions about the population that the sample was selected from. Thus researchers must use sample statistics to draw conclusions about the corresponding values in the population. These corresponding values in the population are called parameters . Imagine, for example, that a researcher measures the number of depressive symptoms exhibited by each of 50 adults with clinical depression and computes the mean number of symptoms. The researcher probably wants to use this sample statistic (the mean number of symptoms for the sample) to draw conclusions about the corresponding population parameter (the mean number of symptoms for adults with clinical depression).

Unfortunately, sample statistics are not perfect estimates of their corresponding population parameters. This is because there is a certain amount of random variability in any statistic from sample to sample. The mean number of depressive symptoms might be 8.73 in one sample of adults with clinical depression, 6.45 in a second sample, and 9.44 in a third—even though these samples are selected randomly from the same population. Similarly, the correlation (Pearson’s  r ) between two variables might be +.24 in one sample, −.04 in a second sample, and +.15 in a third—again, even though these samples are selected randomly from the same population. This random variability in a statistic from sample to sample is called  sampling error . (Note that the term error  here refers to random variability and does not imply that anyone has made a mistake. No one “commits a sampling error.”)

One implication of this is that when there is a statistical relationship in a sample, it is not always clear that there is a statistical relationship in the population. A small difference between two group means in a sample might indicate that there is a small difference between the two group means in the population. But it could also be that there is no difference between the means in the population and that the difference in the sample is just a matter of sampling error. Similarly, a Pearson’s  r  value of −.29 in a sample might mean that there is a negative relationship in the population. But it could also be that there is no relationship in the population and that the relationship in the sample is just a matter of sampling error.

In fact, any statistical relationship in a sample can be interpreted in two ways:

  • There is a relationship in the population, and the relationship in the sample reflects this.
  • There is no relationship in the population, and the relationship in the sample reflects only sampling error.

The purpose of null hypothesis testing is simply to help researchers decide between these two interpretations.

The Logic of Null Hypothesis Testing

Null hypothesis testing (often called null hypothesis significance testing or NHST) is a formal approach to deciding between two interpretations of a statistical relationship in a sample. One interpretation is called the   null hypothesis  (often symbolized  H 0 and read as “H-zero”). This is the idea that there is no relationship in the population and that the relationship in the sample reflects only sampling error. Informally, the null hypothesis is that the sample relationship “occurred by chance.” The other interpretation is called the alternative hypothesis  (often symbolized as  H 1 ). This is the idea that there is a relationship in the population and that the relationship in the sample reflects this relationship in the population.

Again, every statistical relationship in a sample can be interpreted in either of these two ways: It might have occurred by chance, or it might reflect a relationship in the population. So researchers need a way to decide between them. Although there are many specific null hypothesis testing techniques, they are all based on the same general logic. The steps are as follows:

  • Assume for the moment that the null hypothesis is true. There is no relationship between the variables in the population.
  • Determine how likely the sample relationship would be if the null hypothesis were true.
  • If the sample relationship would be extremely unlikely, then reject the null hypothesis  in favor of the alternative hypothesis. If it would not be extremely unlikely, then  retain the null hypothesis .

Following this logic, we can begin to understand why Mehl and his colleagues concluded that there is no difference in talkativeness between women and men in the population. In essence, they asked the following question: “If there were no difference in the population, how likely is it that we would find a small difference of  d  = 0.06 in our sample?” Their answer to this question was that this sample relationship would be fairly likely if the null hypothesis were true. Therefore, they retained the null hypothesis—concluding that there is no evidence of a sex difference in the population. We can also see why Kanner and his colleagues concluded that there is a correlation between hassles and symptoms in the population. They asked, “If the null hypothesis were true, how likely is it that we would find a strong correlation of +.60 in our sample?” Their answer to this question was that this sample relationship would be fairly unlikely if the null hypothesis were true. Therefore, they rejected the null hypothesis in favor of the alternative hypothesis—concluding that there is a positive correlation between these variables in the population.

A crucial step in null hypothesis testing is finding the probability of the sample result or a more extreme result if the null hypothesis were true (Lakens, 2017). [1] This probability is called the p value . A low  p value means that the sample or more extreme result would be unlikely if the null hypothesis were true and leads to the rejection of the null hypothesis. A p value that is not low means that the sample or more extreme result would be likely if the null hypothesis were true and leads to the retention of the null hypothesis. But how low must the p value criterion be before the sample result is considered unlikely enough to reject the null hypothesis? In null hypothesis testing, this criterion is called α (alpha) and is almost always set to .05. If there is a 5% chance or less of a result at least as extreme as the sample result if the null hypothesis were true, then the null hypothesis is rejected. When this happens, the result is said to be statistically significant . If there is greater than a 5% chance of a result as extreme as the sample result when the null hypothesis is true, then the null hypothesis is retained. This does not necessarily mean that the researcher accepts the null hypothesis as true—only that there is not currently enough evidence to reject it. Researchers often use the expression “fail to reject the null hypothesis” rather than “retain the null hypothesis,” but they never use the expression “accept the null hypothesis.”

The Misunderstood  p  Value

The  p  value is one of the most misunderstood quantities in psychological research (Cohen, 1994) [2] . Even professional researchers misinterpret it, and it is not unusual for such misinterpretations to appear in statistics textbooks!

The most common misinterpretation is that the  p  value is the probability that the null hypothesis is true—that the sample result occurred by chance. For example, a misguided researcher might say that because the  p  value is .02, there is only a 2% chance that the result is due to chance and a 98% chance that it reflects a real relationship in the population. But this is incorrect . The  p  value is really the probability of a result at least as extreme as the sample result  if  the null hypothesis  were  true. So a  p  value of .02 means that if the null hypothesis were true, a sample result this extreme would occur only 2% of the time.

You can avoid this misunderstanding by remembering that the  p  value is not the probability that any particular  hypothesis  is true or false. Instead, it is the probability of obtaining the  sample result  if the null hypothesis were true.

Null Hypothesis. Image description available.

Role of Sample Size and Relationship Strength

Recall that null hypothesis testing involves answering the question, “If the null hypothesis were true, what is the probability of a sample result as extreme as this one?” In other words, “What is the  p  value?” It can be helpful to see that the answer to this question depends on just two considerations: the strength of the relationship and the size of the sample. Specifically, the stronger the sample relationship and the larger the sample, the less likely the result would be if the null hypothesis were true. That is, the lower the  p  value. This should make sense. Imagine a study in which a sample of 500 women is compared with a sample of 500 men in terms of some psychological characteristic, and Cohen’s  d  is a strong 0.50. If there were really no sex difference in the population, then a result this strong based on such a large sample should seem highly unlikely. Now imagine a similar study in which a sample of three women is compared with a sample of three men, and Cohen’s  d  is a weak 0.10. If there were no sex difference in the population, then a relationship this weak based on such a small sample should seem likely. And this is precisely why the null hypothesis would be rejected in the first example and retained in the second.

Of course, sometimes the result can be weak and the sample large, or the result can be strong and the sample small. In these cases, the two considerations trade off against each other so that a weak result can be statistically significant if the sample is large enough and a strong relationship can be statistically significant even if the sample is small. Table 13.1 shows roughly how relationship strength and sample size combine to determine whether a sample result is statistically significant. The columns of the table represent the three levels of relationship strength: weak, medium, and strong. The rows represent four sample sizes that can be considered small, medium, large, and extra large in the context of psychological research. Thus each cell in the table represents a combination of relationship strength and sample size. If a cell contains the word  Yes , then this combination would be statistically significant for both Cohen’s  d  and Pearson’s  r . If it contains the word  No , then it would not be statistically significant for either. There is one cell where the decision for  d  and  r  would be different and another where it might be different depending on some additional considerations, which are discussed in Section 13.2 “Some Basic Null Hypothesis Tests”

Sample Size Weak Medium Strong
Small (  = 20) No No  = Maybe

 = Yes

Medium (  = 50) No Yes Yes
Large (  = 100)  = Yes

 = No

Yes Yes
Extra large (  = 500) Yes Yes Yes

Although Table 13.1 provides only a rough guideline, it shows very clearly that weak relationships based on medium or small samples are never statistically significant and that strong relationships based on medium or larger samples are always statistically significant. If you keep this lesson in mind, you will often know whether a result is statistically significant based on the descriptive statistics alone. It is extremely useful to be able to develop this kind of intuitive judgment. One reason is that it allows you to develop expectations about how your formal null hypothesis tests are going to come out, which in turn allows you to detect problems in your analyses. For example, if your sample relationship is strong and your sample is medium, then you would expect to reject the null hypothesis. If for some reason your formal null hypothesis test indicates otherwise, then you need to double-check your computations and interpretations. A second reason is that the ability to make this kind of intuitive judgment is an indication that you understand the basic logic of this approach in addition to being able to do the computations.

Statistical Significance Versus Practical Significance

Table 13.1 illustrates another extremely important point. A statistically significant result is not necessarily a strong one. Even a very weak result can be statistically significant if it is based on a large enough sample. This is closely related to Janet Shibley Hyde’s argument about sex differences (Hyde, 2007) [3] . The differences between women and men in mathematical problem solving and leadership ability are statistically significant. But the word  significant  can cause people to interpret these differences as strong and important—perhaps even important enough to influence the college courses they take or even who they vote for. As we have seen, however, these statistically significant differences are actually quite weak—perhaps even “trivial.”

This is why it is important to distinguish between the  statistical  significance of a result and the  practical  significance of that result.  Practical significance refers to the importance or usefulness of the result in some real-world context. Many sex differences are statistically significant—and may even be interesting for purely scientific reasons—but they are not practically significant. In clinical practice, this same concept is often referred to as “clinical significance.” For example, a study on a new treatment for social phobia might show that it produces a statistically significant positive effect. Yet this effect still might not be strong enough to justify the time, effort, and other costs of putting it into practice—especially if easier and cheaper treatments that work almost as well already exist. Although statistically significant, this result would be said to lack practical or clinical significance.

Conditional Risk. Image description available.

Image Description

“Null Hypothesis” long description:  A comic depicting a man and a woman talking in the foreground. In the background is a child working at a desk. The man says to the woman, “I can’t believe schools are still teaching kids about the null hypothesis. I remember reading a big study that conclusively disproved it  years  ago.”  [Return to “Null Hypothesis”]

“Conditional Risk” long description:  A comic depicting two hikers beside a tree during a thunderstorm. A bolt of lightning goes “crack” in the dark sky as thunder booms. One of the hikers says, “Whoa! We should get inside!” The other hiker says, “It’s okay! Lightning only kills about 45 Americans a year, so the chances of dying are only one in 7,000,000. Let’s go on!” The comic’s caption says, “The annual death rate among people who know that statistic is one in six.”  [Return to “Conditional Risk”]

Media Attributions

  • Null Hypothesis  by XKCD  CC BY-NC (Attribution NonCommercial)
  • Conditional Risk  by XKCD  CC BY-NC (Attribution NonCommercial)
  • Lakens, D. (2017, December 25). About p -values: Understanding common misconceptions. [Blog post] Retrieved from https://correlaid.org/en/blog/understand-p-values/ ↵
  • Cohen, J. (1994). The world is round: p < .05. American Psychologist, 49 , 997–1003. ↵
  • Hyde, J. S. (2007). New directions in the study of gender similarities and differences. Current Directions in Psychological Science, 16 , 259–263. ↵

Descriptive data that involves measuring one or more variables in a sample and computing descriptive summary data (e.g., means, correlation coefficients) for those variables.

Corresponding values in the population.

The random variability in a statistic from sample to sample.

A formal approach to deciding between two interpretations of a statistical relationship in a sample.

The idea that there is no relationship in the population and that the relationship in the sample reflects only sampling error (often symbolized H0 and read as “H-zero”).

An alternative to the null hypothesis (often symbolized as H1), this hypothesis proposes that there is a relationship in the population and that the relationship in the sample reflects this relationship in the population.

A decision made by researchers using null hypothesis testing which occurs when the sample relationship would be extremely unlikely.

A decision made by researchers in null hypothesis testing which occurs when the sample relationship would not be extremely unlikely.

The probability of obtaining the sample result or a more extreme result if the null hypothesis were true.

The criterion that shows how low a p-value should be before the sample result is considered unlikely enough to reject the null hypothesis (Usually set to .05).

An effect that is unlikely due to random chance and therefore likely represents a real effect in the population.

Refers to the importance or usefulness of the result in some real-world context.

Understanding Null Hypothesis Testing Copyright © by Rajiv S. Jhangiani; I-Chant A. Chiang; Carrie Cuttler; and Dana C. Leighton is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License , except where otherwise noted.

Share This Book

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List

Logo of plosone

Why we habitually engage in null-hypothesis significance testing: A qualitative study

Jonah stunt.

1 Department of Health Sciences, Section of Methodology and Applied Statistics, Vrije Universiteit, Amsterdam, The Netherlands

2 Department of Radiation Oncology, Erasmus Medical Center, Rotterdam, The Netherlands

Leonie van Grootel

3 Rathenau Institute, The Hague, The Netherlands

4 Department of Philosophy, Vrije Universiteit, Amsterdam, The Netherlands

5 Department of Epidemiology and Data Science, Amsterdam University Medical Centers, Amsterdam, The Netherlands

David Trafimow

6 Psychology Department, New Mexico State University, Las Cruces, New Mexico, United States of America

Trynke Hoekstra

Michiel de boer.

7 Department of General Practice and Elderly Care, University Medical Center Groningen, Groningen, The Netherlands

Associated Data

A full study protocol, including a detailed data analysis plan, was preregistered ( https://osf.io/4qg38/ ). At the start of this study, preregistration forms for qualitative studies were not developed yet. Therefore, preregistration for this study is based on an outdated form. Presently, there is a preregistration form available for qualitative studies. Information about data collection, data management, data sharing and data storage is described in a Data Management Plan. Sensitive data is stored in Darkstor, an offline archive for storing sensitive information or data (information that involves i.e., privacy or copyright). As the recordings and transcripts of the interviews and focus groups contain privacy-sensitive data, these files are archived in Darkstor and can be accessed only on request by authorized individuals (i.e., the original researcher or a research coordinator)1. Non-sensitive data is stored in DANS ( https://doi.org/10.17026/dans-2at-nzfs ) (Data Archiving and Networked Services; the Netherlands institute for permanent access to digital research resources). 1. Data requests can be send to ln.uv@mdr .

Null Hypothesis Significance Testing (NHST) is the most familiar statistical procedure for making inferences about population effects. Important problems associated with this method have been addressed and various alternatives that overcome these problems have been developed. Despite its many well-documented drawbacks, NHST remains the prevailing method for drawing conclusions from data. Reasons for this have been insufficiently investigated. Therefore, the aim of our study was to explore the perceived barriers and facilitators related to the use of NHST and alternative statistical procedures among relevant stakeholders in the scientific system.

Individual semi-structured interviews and focus groups were conducted with junior and senior researchers, lecturers in statistics, editors of scientific journals and program leaders of funding agencies. During the focus groups, important themes that emerged from the interviews were discussed. Data analysis was performed using the constant comparison method, allowing emerging (sub)themes to be fully explored. A theory substantiating the prevailing use of NHST was developed based on the main themes and subthemes we identified.

Twenty-nine interviews and six focus groups were conducted. Several interrelated facilitators and barriers associated with the use of NHST and alternative statistical procedures were identified. These factors were subsumed under three main themes: the scientific climate, scientific duty, and reactivity. As a result of the factors, most participants feel dependent in their actions upon others, have become reactive, and await action and initiatives from others. This may explain why NHST is still the standard and ubiquitously used by almost everyone involved.

Our findings demonstrate how perceived barriers to shift away from NHST set a high threshold for actual behavioral change and create a circle of interdependency between stakeholders. By taking small steps it should be possible to decrease the scientific community’s strong dependence on NHST and p-values.

Introduction

Empirical studies often start from the idea that there might be an association between a specific factor and a certain outcome within a population. This idea is referred to as the alternative hypothesis (H1). Its complement, the null hypothesis (H0), typically assumes no association or effect (although it is possible to test other effect sizes than no effect with the null hypothesis). At the stage of data-analysis, the probability of obtaining the observed, or a more extreme, association is calculated under the assumption of no effect in the population (H0) and a number of inferential assumptions [ 1 ]. The probability of obtaining the observed, or more extreme, data is known as ‘the p-value’. The p-value demonstrates the compatibility between the observed data and the expected data under the null hypothesis, where 0 is complete incompatibility and 1 is perfect compatibility [ 2 ]. When the p-value is smaller than a prespecified value (labelled as alpha, usually set at 5% (0.05)), results are generally declared to be statistically significant. At this point, researchers commonly reject the null hypothesis and accept the alternative hypothesis [ 2 ]. Assessing statistical significance by means of contrasting the data with the null hypothesis is called Null Hypothesis Significance Testing (NHST). NHST is the best known and most widely used statistical procedure for making inferences about population effects. The procedure has become the prevailing paradigm in empirical science [ 3 ], and reaching and being able to report statistically significant results has become the ultimate goal for many researchers.

Despite its widespread use, NHST and the p-value have been criticized since its inception. Numerous publications have addressed problems associated with NHST and p-values. Arguably the most important drawback is the fact that NHST is a form of indirect or inverse inference: researchers usually want to know if the null or alternative hypothesis can be accepted and use NHST to conclude either way. But with NHST, the probability of a finding, or more extreme findings, given the null hypothesis is calculated [ 4 ]. Ergo, NHST doesn’t tell us what we want to know. In fact, p-values were never meant to serve as a basis to draw conclusions, but as a continuous measure of incompatibility between empirical findings and a statistical model [ 2 ]. Moreover, the procedure promotes a dichotomous way of thinking, by using the outcome of a significance test as a dichotomous indicator for an effect (p<0.05: effect, p>0.05: no effect). Reducing empirical findings to two categories also results in a great loss of information. Further, a significant outcome is often unjustly interpreted as relevant, but a p-value does not convey any information about the strength or importance of the association. Worse yet, the p-values on which NHST is based confound effect size and sample size. A trivial effect size may nevertheless result in statistical significance provided a sufficiently large sample size. Or an important effect size may fail to result in statistical significance if the sample size is too small. P-values do not validly index the size, relevance, or precision of an effect [ 5 ]. Furthermore, statistical models include not only null hypotheses, but additional assumptions, some of which are wrong, such as the ubiquitous assumption of random and independent sampling from a defined population [ 1 ]. Therefore, although p-values validly index the incompatibility of data with models, p-values do not validly index incompatibility of data with hypotheses that are embedded in wrong models. These are important drawbacks rendering NHST unsuitable as the default procedure for drawing conclusions from empirical data [ 2 , 3 , 5 – 13 ].

A number of alternatives have been developed that overcome these pitfalls, such as Bayesian inference methods [ 7 , 11 , 14 , 15 ], informative hypothesis testing [ 9 , 16 ] and a priori inferential statistics [ 4 , 17 ]. These alternatives build on the idea that research usually starts with a more informed research-question than one merely assuming the null hypothesis of no effect. These methods overcome the problem of inverse inference, although the first two might still lead to dichotomous thinking with the use of thresholds. Despite the availability of alternatives, statistical behavior in the research community has hardly changed. Researchers have been slow to adopt alternative methods and NHST is still the prevailing paradigm for making inferences about population effects [ 3 ].

Until now, reasons for the continuous and ubiquitous use of NHST and the p-value have scarcely been investigated. One explanation is that NHST provides a very simple means for drawing conclusions from empirical data, usually based on the 5% cut-off. Secondly, most researchers are unaware of the pitfalls of NHST; it has been shown that NHST and the p-value are often misunderstood and misinterpreted [ 2 , 3 , 8 , 11 , 18 , 19 ]. Thirdly, NHST has a central role in most methods and statistics courses in higher education. Courses on alternative methods are increasingly being offered but are usually not mandatory. To our knowledge, there is a lack of in depth, empirical research, aimed at elucidating why NHST nevertheless remains the dominant approach, or what actions can be taken to shift the sciences away from NHST. Therefore, the aim of our study was to explore the perceived barriers and facilitators, as well as behavioral intentions related to the use of NHST and alternatives statistical procedures, among all relevant stakeholders in the scientific system.

Theoretical framework

In designing our study, we used two theories. Firstly, we used the ‘diffusion of innovation theory’ of Rogers [ 20 ]. This theory describes the dissemination of an innovation as a process consisting of four elements: 1) an innovation is 2) communicated through certain channels 3) over time 4) among the members of a social system [ 20 ]. In the current study, the innovation consists of the idea that we should stop with the default use of NHST and instead consider using alternative methods for drawing conclusions from empirical data. The science system forms the social structure in which the innovation should take place. The most important members, and potential adopters of the innovation, we identified are researchers, lecturers, editors of scientific journals and representatives of funding agencies. Rogers describes phases in the adoption process, which coincide with characteristics of the (potential) adopters of the idea: 1) innovators, 2) early adopters, 3) early majority adopters, 4) late majority adopters and 5) laggards. Innovators are the first to adopt an innovation. There are few innovators but these few are very important for bringing in new ideas. Early adopters form the second group to adopt an innovation. This group includes opinion leaders and role models for other stakeholders. The largest group consists of the early and late majority who follow the early adopters, and then there is a smaller group of laggards who resist the innovation until they are certain the innovation will not fail. The process of innovation adoption by individuals is described as a normal distribution ( Fig 1 ). For these five groups, the adoption of a new idea is influenced by the following five characteristics of the innovative idea and 1) its relative advantage, 2) its compatibility with current experiences, 3) its complexity, 4) its flexibility, and 5) its visibility [ 20 ]. Members of all four stakeholder groups could play an important role in the diffusion of the innovation of replacing NHST by its alternatives.

An external file that holds a picture, illustration, etc.
Object name is pone.0258330.g001.jpg

The innovativeness dimension, measured by the time at which an individual from an adopter category adopts an innovation. Each category is one of more standard deviations removed from the average time of adoption [ 20 ].

Another important theory for our study is the ‘theory of planned behavior’, that was developed in the 1960s [ 21 ]. This theory describes how human behavior in a certain context can be predicted and explained. The theory was updated in 2010, under the name ‘the reasoned action approach’ [ 22 ]. A central factor in this theory is the intention to perform a certain behavior, in this case, to change the default use of NHST. According to the theory, people’s intentions determine their behaviors. An intention indexes to what extent someone is motivated to perform the behavior. Intentions are determined by three independent determinants: the person’s attitudes toward the behavior—the degree to which a person sees the behavior as favorable or unfavorable, perceived subjective norms regarding the behavior—the perceived social pressure to perform the behavior or not, and perceptions of control regarding the behavior—the perceived ease or difficulty of performing the behavior. Underlying (i.e. responsible for) these three constructs are corresponding behavioral, normative, and control beliefs [ 21 , 22 ] (see Fig 2 ).

An external file that holds a picture, illustration, etc.
Object name is pone.0258330.g002.jpg

Both theories have served as a lens for both data collection and analysis. We used sensitizing concepts [ 23 ] within the framework of the grounded theory approach [ 24 ] from both theories as a starting point for this qualitative study, and more specifically, for the topic list for the interviews and focus groups, providing direction and guidance for the data collection and data analysis.

Many of the concepts of Rogers’ and Fishbein and Ajzen’s theory can be seen as facilitators and barriers for embracing and implementing innovation in the scientific system.

A qualitative study among stakeholders using semi-structured interviews and focus groups was performed. Data collection and analysis were guided by the principle of constant comparison traditional to the grounded theory approach we followed [ 24 ]. The grounded theory is a methodology that uses inductive reasoning, and aims to construct a theory through the collection and analysis of data. Constant comparison is the iterative process whereby each part of the data that emerges from the data analysis is compared with other parts of the data to thoroughly explore and validate the data. Concepts that have been extracted from the data are tagged with codes that are grouped into categories. These categories constitute themes, which (may) become the basis for a new theory. Data collection and analysis were continued until no new information was gained and data saturation had likely occurred within the identified themes.

The target population consisted of stakeholders relevant to our topic: junior and senior researchers, lecturers in statistics, editors of scientific journals and program leaders of funding agencies (see Tables ​ Tables1 1 and ​ and2). 2 ). We approached participants in the field of medical sciences, health- and life sciences and psychology. In line with the grounded theory approach, theoretical sampling was used to identify and recruit eligible participants. Theoretical sampling is a form of purposive sampling. This means that we aimed to purposefully select participants, based on their characteristics that fit the parameters of the research questions [ 25 ]. Recruitment took place by approaching persons in our professional networks and or the networks of the approached persons.

Stakeholder group:
:
136:7
158:7
115:6
83:5
Stakeholder group:Workplace:
2114
99
448
88
513 (1)9
9 (3 )4 (2)13
2 (1)13
2 (2)2

*The numbers between brackets represents the number of participants that were also interviewed.

Data collection

We conducted individual semi-structured interviews followed by focus groups. The aim of the interviews was to gain insight into the views of participants on the use of NHST and alternative methods and to examine potential barriers and facilitators related to these methods. The aim of the focus groups was to validate and further explore interview findings and to develop a comprehensive understanding of participants’ views and beliefs.

For the semi-structured interviews, we used a topic list (see Appendix 1 in S1 Appendix ). Questions addressed participants’ knowledge and beliefs about the concept of NHST, their familiarity with NHST, perceived attractiveness and drawbacks of the use of NHST, knowledge of the current NHST debate, knowledge of and views on alternative procedures and their views on the future of NHST. The topic list was slightly adjusted based on the interviews with editors and representatives from funding agencies (compared to the topic list for interviews with researchers and lecturers). Questions particularly focused on research and education were replaced by questions focused on policy (see Appendix 1 in S1 Appendix ).

The interviews were conducted between October 2017 and June 2018 by two researchers (L.v.G. and J.S.), both trained in qualitative research methods. Interviews lasted about one hour (range 31–86 minutes) and were voice-recorded. One interview was conducted by telephone; all others were face to face and took place at a location convenient for the participants, in most cases the participants’ work location.

Focus groups

During the focus groups, important themes that emerged from the interviews were discussed and explored. These include perceptions on NHST and alternatives and essential conditions to shift away from the default use of NHST.

Five focus groups included representatives from the different stakeholder groups. One focus group was homogenous, including solely lecturers. The focus groups consisted of ‘old’ as well as ‘new’ participants, that is, some of the participants of the focus groups were also in the interview sample. We also selected persons that were open for further contribution to the NHST debate and were willing to help think about (implementing) alternatives for NHST.

The focus groups were conducted between September and December 2018 by three researchers (L.v.G., J.S. and A.d.K.), all trained in qualitative research methods. The focus groups lasted about one-and-a-half hours (range 86–100 minutes).

Data analysis

All interviews and focus groups were transcribed verbatim. Atlas.ti 8.0 software was used for data management and analysis. All transcripts were read thoroughly several times to identify meaningful and relevant text fragments and analyzed by two researchers (J.S. and L.v.G.). Deductive predefined themes and theoretical concepts were used to guide the development of the topic list for the semi-structured interviews and focus groups, and were used as sensitizing concepts [ 23 ] in data collection and data analysis. Inductive themes were identified during the interview process and analysis of the data [ 26 ].

Transcripts were open-, axial- and selectively coded by two researchers (J.S. and L.v.G.). Open coding is the first step in the data-analysis, whereby phenomena found in the text are identified and named (coded). With axial coding, connections between codes are drawn. Selective coding is the process of selecting one central category and relating all other categories to that category, capturing the essence of the research. The constant comparison method [ 27 ] was applied allowing emerging (sub)themes to be fully explored. First, the two researchers independently developed a set of initial codes. Subsequently, findings were discussed until consensus was reached. Codes were then grouped into categories that were covered under subthemes, belonging to main themes. Finally, a theory substantiating the prevailing use of NHST was developed based on the main themes and subthemes.

Ethical issues

This research was conducted in accordance with the Dutch "General Data Protection Regulation" and the “Netherland’s code of conduct for research integrity”. The research protocol had been submitted for review and approved by the ethical review committee of the VU Faculty of Behavioral and Movement Sciences. In addition, the project had been submitted to the Medical Ethics Committee (METC) of the Amsterdam University Medical Centre who decided that the project is not subject to the Medical Research (Human Subjects) Act ( WMO). At the start of data collection, all participants signed an informed consent form.

A full study protocol, including a detailed data analysis plan, was preregistered ( https://osf.io/4qg38/ ). At the start of this study, preregistration forms for qualitative studies were not developed yet. Therefore, preregistration for this study is based on an outdated form. Presently, there is a preregistration form available for qualitative studies [ 28 ]. Information about data collection, data management, data sharing and data storage is described in a Data Management Plan. Sensitive data is stored in Darkstor, an offline archive for storing sensitive information or data (information that involves i.e., privacy or copyright). As the recordings and transcripts of the interviews and focus groups contain privacy-sensitive data, these files are archived in Darkstor and can be accessed only on request by authorized individuals (i.e., the original researcher or a research coordinator) (Data requests can be send to ln.uv@mdr ). Non-sensitive data is stored in DANS ( https://doi.org/10.17026/dans-2at-nzfs ) (Data Archiving and Networked Services; the Netherlands institute for permanent access to digital research resources).

Participant characteristics

Twenty-nine individual interviews and six focus groups were conducted. The focus groups included four to six participants per session. A total of 47 participants were included in the study (13 researchers, 15 lecturers, 11 editors of scientific journals and 8 representatives of funding agencies). Twenty-nine participants were interviewed. Twenty-seven participants took part in the focus group. Nine of the twenty-seven participants were both interviewed and took part in the focus groups. Some participants had multiple roles (i.e., editor and researcher, editor and lecturer or lecturer and researcher) but were classified based on their primary role (assistant professors were classified as lecturers). The lecturers in statistics in our sample were not statisticians themselves. Although they all received training in statistics, they were primarily trained as psychologists, medical doctors, or health scientists. Some lecturers in our sample taught an applied subject, with statistics as part of it. Other lectures taught Methodology and Statistics courses. Statistical skills and knowledge among lecturers varied from modest to quite advanced. Statistical skills and knowledge among participants from the other stakeholder groups varied from poor to quite advanced. All participants were working in the Netherlands. A general overview of the participants is presented in Table 1 . Participant characteristics split up by interviews and focus groups are presented in Table 2 .

Three main themes with sub-themes and categories emerged ( Fig 3 ): the green-colored compartments hold the three main themes: The scientific climate , The scientific duty and Reactivity . Each of these three main themes consists of subthemes, depicted by the yellow-colored compartments. In turn, some (but not all) of the 9 subthemes also have categories. These ‘lower level’ findings are not included in the figure but will be mentioned in the elaboration on the findings and are depicted in Appendix 2 in S1 Appendix . Fig 3 shows how the themes are related to each other. The blue arrows indicate that the themes are interrelated; factors influence each other. The scientific climate affects the way stakeholders perceive and fulfil their scientific duty, the way stakeholders give substance to their scientific duty shapes and maintain the scientific climate. The scientific duty and the scientific climate cause a state of reactivity. Many participants have adopted a ’wait and see’ attitude regarding behavioral changes with respect to statistical methods. They feel dependent on someone else’s action. This leads to a reactive (instead of a proactive) attitude and a low sense of responsibility. ‘Reactivity’ is the core theme, explaining the most critical problem with respect to the continuous and ubiquitous use of NHST.

An external file that holds a picture, illustration, etc.
Object name is pone.0258330.g003.jpg

Main themes and subthemes are numbered. Categories are mentioned in the body of the text in bold. ‘P’ stands for participant; ‘I’ stands for interviewer.

1. The scientific climate

The theme, ‘the scientific climate’, represents researchers’ (Dutch) perceptions of the many written and unwritten rules they face in the research environment. This theme concerns the opportunities and challenges participants encounter when working in the science system. Dutch academics feel pressured to publish fast and regularly, and to follow conventions and directions of those on whom they depend. They feel this comes at the expense of the quality of their work. Thus, the scientific climate in the Netherlands has a strong influence on the behavior of participants regarding how they set their priorities and control the quality of their work.

1 . 1 Quality control . Monitoring the quality of research is considered very important. Researchers, funding agencies and editors indicate they rely on their own knowledge, expertise, and insight, and those of their colleagues, to guarantee this quality. However, editors or funding agencies are often left with little choice when it comes to compiling an evaluation committee or a review panel. The choice is often like-knows-like-based. Given the limited choice, they are forced to trust the opinion of their consultants, but the question is whether this is justified.

I: “The ones who evaluate the statistics, do they have sufficient statistical knowledge?” P: “Ehhr, no, I don’t think so.” I: “Okay, interesting. So, there are manuscripts published of which you afterwards might think….” P: “Yes yes.” (Interview 18; Professor/editor, Medical Sciences)

1 . 2 Convention . The scientific system is built on mores and conventions, as this participant describes:

P: “There is science, and there is the sociology of science, that is, how we talk to each other, what we believe, how we connect. And at some point, it was agreed upon that we would talk to each other in this way.” (Interview 28, researcher, Medical Sciences)

And to these conventions, one (naturally) conforms. Stakeholders copy behavior and actions of others within their discipline, thereby causing particular behaviors and values to become conventional or normative. One of those conventions is the use of NHST and p-values. Everyone is trained with NHST and is used to applying this method. Another convention is the fact that significant results mean ‘success’, in the sense of successful research and being a successful researcher. Everyone is aware that ‘p is smaller than 0.05’ means the desired results are achieved and that publication and citation chances are increased.

P: “You want to find a significant result so badly. (…) Because people constantly think: I must find a significant result, otherwise my study is worthless.” (Focus group 4, lecturer, Medical Sciences)

Stakeholders rigidly hold on to the above-mentioned conventions and are not inclined to deviate from existing norms; they are, in other words, quite conservative . ‘We don’t know any better’ has been brought up as a valid argument by participants from various stakeholder groups to stick to current rules and conventions. Consequently, the status quo in the scientific system is being maintained.

P: “People hold on to….” I: ‘Everyone maintains the system?’ P: ‘Yes, we kind of hang to the conservative manner. This is what we know, what someone, everyone, accepts.” (Interview 17, researcher, Health Sciences)

Everyone is trained with NHST and considers it an accessible and easy to interpret method. The familiarity and perceived simplicity of NHST, user-friendly software such as SPSS and the clear cut-off value for significance are important facilitators for the use of NHST and at the same time barriers to start using alternative methods. Applied researchers stressed the importance of the accessibility of NHST as a method to test hypotheses and draw conclusions. This accessibility also justifies the use of NHST when researchers want to communicate their study results and messages in understandable ways to their readership.

P: “It is harder, also to explain, to use an alternative. So, I think, but maybe I’m overstepping, but if you want to go in that direction [alternative methods] it needs to be better facilitated for researchers. Because at the moment… I did some research, but, you know, there are those uncommon statistical packages.” (Interview 16, researcher/editor, Medical Sciences)

1 . 3 Publication pressure . Most researchers mentioned that they perceive publication pressure. This motivates them to use NHST and hope for significant results, as ‘significant p-values’ increase publication chances. They perceive a high workload and the way the scientific reward system is constructed as barriers for behavioral change pertaining to the use of statistical methods; potential negative consequences for publication and career chances prevent researchers from deviating from (un)written rules.

P: “I would like to learn it [alternative methods], but it might very well be that I will not be able to apply it, because I will not get my paper published. I find that quite tricky.” (Interview 1, Assistant Professor, Health Sciences)

2. The scientific duty

Throughout the interviews, participants reported a sense of duty in several variations. “What does it mean to be a scientific researcher?” seemed to be a question that was reflected upon during rather than prior to the interview, suggesting that many scientists had not really thought about the moral and professional obligations of being a scientist in general—let alone what that would mean for their use of NHST. Once they had given it some thought, the opinions concerning what constitutes the scientific duty varied to a large extent. Some participants attached great importance to issues such as reproducibility and transparency in scientific research and continuing education and training for researchers. For others, these topics seemed to play a less important role. A distinction was made between moral and professional obligations that participants described concerning their scientific duty.

2 . 1 Moral obligation . The moral obligation concerns issues such as doing research in a thorough and honest way, refraining from questionable research practices (QRPs) and investing in better research. It concerns tasks and activities that are not often rewarded or acknowledged.

Throughout the interviews and the focus groups, participants very frequently touched upon the responsibility they felt for doing ‘the right thing’ and making the right choice in doing research and using NHST, in particular. The extent to which they felt responsible varied among participants. When it comes to choices during doing research—for example, drawing conclusions from data—participants felt a strong sense of responsibility to do this correctly. However, when it comes to innovation and new practices, and feeling responsible for your own research, let alone improving scientific practice in general, opinions differed. This quotation from one of the focus groups illustrates that:

P1: “If you people [statisticians, methodologists] want me to improve the statistics I use in my research, then you have to hand it to me. I am not going to make any effort to improve that myself. “P3: “No. It is your responsibility as an academic to keep growing and learning and so, also to start familiarizing yourself when you notice that your statistics might need improvement.” (Focus group 2, participant 1 (PhD researcher, Medical Sciences) and 3 (Associate Professor, Health Sciences)

The sense of responsibility for improving research practices regarding the use of NHST was strongly felt and emphasized by a small group of participants. They emphasized the responsibility of the researcher to think, interpret and be critical when interpreting the p -value in NHST. It was felt that you cannot leave that up to the reader. Moreover, scrutinizing and reflecting upon research results was considered a primary responsibility of a scientist, and failing to do so, as not living up to what your job demands you to do:

P: “Yes, and if I want to be very provocative—and I often want that, because then people tend to wake up and react: then I say that hiding behind alpha.05 is just scientific laziness. Actually, it is worse: it is scientific cowardice. I would even say it is ‘relieving yourself from your duty’, but that may sound a bit harsh…” (Interview 2, Professor, Health Sciences)

These participants were convinced that scientists have a duty to keep scientific practice in general at the highest level possible.

The avoidance of questionable research practices (QRPs) was considered a means or a way to keep scientific practices high level and was often touched upon during the interviews and focus groups as being part of the scientific duty. Statisticians saw NHST as directly facilitating QRPs and providing ample examples of how the use of NHST leads to QRPs, whereas most applied researchers perceived NHST as the common way of doing research and were not aware of the risks related to QRPs. Participants did mention the violation of assumptions underlying NHST as being a QRP. Then, too, participants considered overinterpreting results as a QRP, including exaggerating the degree of significance. Although participants stated they were careful about interpreting and reporting p-values, they ‘admitted’ that statistical significance was a starting point for them. Most researchers indicated they search for information that could get their study published, which usually includes a low p-value (this also relates to the theme ‘Scientific climate’).

P: “We all know that a lot of weight is given to the p-value. So, if it is not significant, then that’s the end of it. If it ís significant, it just begins.” (Interview 5, lecturer, Psychology)

The term ‘sloppy science’ was mentioned in relation to efforts by researchers to reduce the p -value (a.k.a. p-hacking, data-dredging, and HARKing. HARKing is an acronym that refers to the questionable research question of Hypothesizing After the Results are Known. It occurs when researchers formulate a hypothesis after the data have been collected and analyzed, but make it look like it is an a priori hypothesis [ 29 ]). Preregistration and replication were mentioned as being promising solutions for some of the problems caused by NHST.

2 . 2 . Professional obligation . The theme professional obligation reflects participants’ expressions about what methodological knowledge scientists should have about NHST. In contrast moral obligations, there appeared to be some consensus about scientists’ professional obligations. Participants considered critical evaluation of research results a core professional obligation. Also, within all the stakeholder groups, participants agreed that sufficient statistical knowledge is required for using NHST, but they varied in their insights in the principles, potential and limitations of NHST. This also applied to the extent to which participants were aware of the current debate about NHST.

Participants considered critical thinking as a requirement for fulfilling their professional obligation. It specifically refers to the process of interpreting outcomes and taking all relevant contextual information into consideration. Critical thinking was not only literally referred to by participants, but also emerged by interpreting text fragments on the emphasis within their research. Researchers differed quite strongly in where the emphasis of their research outcomes should be put and what kind of information is required when reporting study results. Participants mentioned the proven effectiveness of a particular treatment, giving a summary of the research results, effect sizes, clinical relevance, p-values, or whether you have made a considerable contribution to science or society.

P: “I come back to the point where I said that people find it arbitrary to state that two points difference on a particular scale is relevant. They prefer to hide behind an alpha of 0.05, as if it is a God given truth, that it counts for one and for all. But it is just as well an invented concept and an invented guideline, an invented cut-off value, that isn’t more objective than other methods?” (Interview 2, Professor, Health Sciences)

For some participants, especially those representing funding agencies, critical thinking was primarily seen as a prerequisite for the utility of the research. The focus, when formulating the research question and interpreting the results, should be on practical relevance and the contribution the research makes to society.

The term ‘ignorance’ arose in the context of the participants’ concern regarding the level of statistical knowledge scientists and other stakeholders have versus what knowledge they should have to adequately apply statistical analysis in their research. The more statistically competent respondents in the sample felt quite strongly about how problematic the lack of knowledge about NHST is among those who regularly use it in their research, let alone the lack of knowledge about alternative methods. They felt that regularly retraining yourself in research methods is an essential part of the professional obligation one has. Applied researchers in the sample agreed that a certain level of background knowledge on NHST was required to apply it properly to research and acknowledged their own ignorance. However, they had different opinions about what level of knowledge is required. Moreover, not all of them regarded it as part of their scientific duty to be informed about all ins and outs of NHST. Some saw it as the responsibility of statisticians to actively inform them (see also the subtheme periphery). Some participants were not aware of their ignorance or stated that some of their colleagues are not aware of their ignorance, i.e., that they are unconsciously incompetent and without realizing it, poorly understood what the p-value and associated outcome measures actually mean.

P: “The worst, and I honestly think that this is the most common, is unconsciously incompetent, people don’t even understand that…” I: “Ignorance.” P: “Yes, but worse, ignorant and not even knowing you are ignorant.” (Interview 2, Professor, Health Sciences)

The lack of proper knowledge about statistical procedures was especially prevalent in the medical sciences. Participants working in or with the medical sciences all confirmed that there is little room for proper statistical training for medical students and that the level of knowledge is fairly low. NHST is often used because of its simplicity. It is especially attractive for medical PhD students because they need their PhD to get ahead in their medical career instead of pursuing a scientific career.

P: “I am not familiar with other ways of doing research. I would really like to learn, but I do not know where I could go. And I do not know whether there are better ways. So sometimes I do read studies of which I think: ‘this is something I could investigate with a completely different test. Apparently, this is also possible, but I don’t know how.’ Yes, there are courses, but I do not know what they are. And here in the medical center, a lot of research is done by medical doctors and these people have hardly been taught any statistics. Maybe they will get one or two statistics courses, they know how to do a t-test and that is about it. (…) And the courses have a very low level of statistics, so to say.” (Interview 1, Assistant Professor, Health Sciences)

Also, the term ‘ awareness ’ arose. Firstly, it refers to being conscious about the limitations of NHST. Secondly, it refers to the awareness of the ongoing discussions about NHST and more broadly, about the replication crisis. The statisticians in the sample emphasized the importance of knowing that NHST has limitations and that it cannot be considered the holy grail of data analysis. They also emphasized the importance of being aware of the debate. A certain level of awareness was considered a necessary requirement for critical thinking. There was variation in that awareness. Some participants were quite informed and were also fairly engaged in the discussion whereas others were very new to the discussion and larger contextual factors, such as the replication crisis.

I: “Are you aware of the debate going on in academia on this topic [NHST]? P: “No, I occasionally see some article sent by a colleague passing by. I have the idea that something is going on, but I do not know how the debate is conducted and how advanced it is. (Interview 6, lecturer, Psychology)

With respect to the theme, ‘the scientific duty’, participants differed to what extent they felt responsible for better and open science, for pioneering, for reviewing, and for growing and learning as a scientist. Participants had one commonality: although they strived for adherence to the norms of good research, the rampant feeling is that this is very difficult, due to the scientific climate. Consequently, participants perceive an internal conflict : a discrepancy between what they want or believe , and what they do . Participants often found themselves struggling with the responsibility they felt they had. Making the scientifically most solid choice was often difficult due to feasibility, time constraints, or certain expectations from supervisors (this is also directly related to the themes ‘Scientific climate’ and ‘Reactivity’). Thus, the scientific climate strongly influences the behavior of scientists regarding how they set their priorities and fulfill their scientific duties. The strong sense of scientific duty was perceived by some participants as a facilitator and by others as a barrier for the use of alternative methods.

3. Reactivity

A consequence of the foregoing factors is that most stakeholders have adopted a reactive attitude and behave accordingly. People are disinclined to take responsibility and await external signals and initiatives of others. This might explain why NHST is being continuously used and remains the default procedure to make inferences about population effects.

The core theme ‘reactivity’ can be explained by the following subthemes and categories:

3 . 1 Periphery . The NHST-problem resides in the periphery in several ways. First, it is a subject that is not given much priority. Secondly, some applied researchers and editors believe that methodological knowledge, as it is not their field of expertise, should not be part of their job requirement. This also applies to the NHST debate. Thirdly, and partly related to the second point, there is a lack of cooperation within and between disciplines.

The term ‘ priority’ was mentioned often when participants were asked to what extent the topic of NHST was subject of discussion in their working environment. Participants indicated that (too) little priority is given to statistics and the problems related to the subject. There is simply a lot going on in their research field and daily work, so there are always more important or urgent issues on the agenda.

P: “Discussions take place in the periphery; many people find it complicated. Or are just a little too busy.” (Interview 5, lecturer, Psychology)

As the NHST debate is not prioritized, initiatives with respect to this issue are not forthcoming. Moreover, researchers and lecturers claim there is neither time nor money available for training in statistics in general or acquiring more insight and skills with respect to (the use of) alternative methods. Busy working schedules were mentioned as an important barrier for improving statistical knowledge and skills.

P: “Well you can use your time once, so it is an issue low on the priority list.” (Focus group 5, researcher, Medical Sciences)

The NHST debate is perceived as the domain of statisticians and methodologists. Also, cooperation between different domains and domain-specific experts is perceived as complicated, as different perceptions and ways of thinking can clash. Therefore, some participants feel that separate worlds should be kept separate; put another way: stick to what you know!

P: “This part is not our job. The editorial staff, we have the assignment to ensure that it is properly written down. But the discussion about that [alternatives], that is outside our territory.” (Interview 26, editor, Medical Sciences)

Within disciplines, individuals tend to act on their own, not being aware that others are working on the same subject and that it would be worthwhile to join forces. The interviews and focus groups exposed that a modest number of participants actively try to change the current situation, but in doing that, feel like lone voices in the wilderness.

P1: “I mean, you become a lone voice in the wilderness.” P2: “Indeed, you don’t want that.” P1: “I get it, but no one listens. There is no audience.” (Focus Group 3, P1: MD, lecturer, medical Sciences, P2: editor, Medical Sciences)

To succeed at positive change, participants emphasized that it is essential that people (interdisciplinary) cooperate and join forces, rather than operate on individual levels, focusing solely on their own working environment.

The caution people show with respect to taking initiative is reenforced by the fear of encountering resistance from their working environment when one voices that change regarding the use of NHST is needed. A condition that was mentioned as essential to bring about change was tactical implementation , that is, taking very small steps. As everyone is still using NHST, taking big steps brings the risk of losing especially the more conservative people along the way. Also, the adjustment of policy, guidelines and educational programs are processes for which we need to provide time and scope.

P: “Everyone still uses it, so I think we have to be more critical, and I think we have to look at some kind of culture change, that means that we are going to let go of it (NHST) more and we will also use other tests, that in the long term will overthrow NHST. I: and what about alternatives? P: I think you should never be too fanatic in those discussion, because then you will provoke resistance. (…) That is not how it works in communication. You will touch them on a sore spot, and they will think: ‘and who are you?’ I: “and what works?” P: “well, gradualness. Tell them to use NHST, do not burn it to the ground, you do not want to touch peoples work, because it is close to their hearts. Instead, you say: ‘try to do another test next to NHST’. Be a pioneer yourself.” (Interview 5, lecturer, Psychology)

3 . 2 . Efficacy . Most participants stated they feel they are not in the position to initiate change. On the one hand, this feeling is related to their hierarchical positions within their working environments. On the other hand, the feeling is caused by the fact that statistics is perceived as a very complex field of expertise and people feel they lack sufficient knowledge and skills, especially about alternative methods.

Many participants stated they felt little sense of empowerment, or self-efficacy. The academic system is perceived as hierarchical, having an unequal balance of power. Most participants believe that it is not in their power to take a lead in innovative actions or to stand up against establishment, and think that this responsibility lies with other stakeholders, that have more status .

P: “Ideally, there would be a kind of an emergency letter from several people whose names open up doors, in which they indicate that in the medical sciences we are throwing away money because research is not being interpreted properly. Well, if these people that we listen to send such an emergency letter to the board of The Netherlands Organization for Health Research and Development [the largest Dutch funding agency for innovation and research in healthcare], I can imagine that this will initiate a discussion.” (…) I: “and with a big name you mean someone from within the science system? P: well, you know, ideally a chairman, or chairmen of the academic medical center. At that level. If they would put a letter together. Yes, that of course would have way more impact. Or some prominent medical doctors, yes, that would have more impact, than if some other person would send a letter yes.” (Interview 19, representative from funding agency, Physical Sciences)

Some participants indicated that they did try to make a difference but encountered too much resistance and therefore gave up their efforts. PhD students feel they have insufficient power to choose their own directions and make their own choices.

P: I am dependent on funding agencies and professors. In the end, I will write a grant application in that direction that gives me the greatest chance of eventually receiving that grant. Not primarily research that I think is the most optimal (…) If I know that reviewers believe the p-value is very important, well, of course I write down a method in which the p-value is central.” (Focus group 2, PhD-student, Medical Sciences)

With a sense of imperturbability, most participants accept that they cannot really change anything.

Lastly, the complexity of the subject is an obstacle for behavioral change. Statistics is perceived as a difficult subject. Participants indicate that they have a lack of knowledge and skills and that they are unsure about their own abilities. This applies to the ‘standard’ statistical methods (NHST), but to a greater extent to alternative methods. Many participants feel that they do not have the capacity to pursue a true understanding of (alternative) statistical methods.

P: “Statistics is just very hard. Time and again, research demonstrates that scientists, even the smartest, have a hard time with statistics.” (Focus group 3, PhD researcher, Psychology)

3 . 3 . Interdependency . As mentioned, participants feel they are not in a sufficiently strong position to take initiative or to behave in an anti-establishment manner. Therefore, they await external signals from people within the scientific system with more status, power, or knowledge. This can be people within their own stakeholder group, or from other stakeholder groups. As a consequence of this attitude, a situation arises in which peoples’ actions largely depend on others. That is, a complex state of interdependency evolves: scientists argue that if the reward system does not change, they are not able to alter their statistical behavior. According to researchers, editors and funding agencies are still very much focused on NHST and especially (significant) p-values, and thus, scientists wait for editors and funders to adjust their policy regarding statistics:

P: “I wrote an article and submitted it to an internal medicine journal. I only mentioned confidence intervals. Then I was asked to also write down the p-values. So, I had to do that. This is how they [editors] can use their power. They decide.” (Interview 1, Assistant Professor, Health Sciences)

Editors and funders in their turn claim they do not maintain a strict policy. Their main position is that scientists should reach consensus about the best statistical procedure, and they will then adjust their policy and guidelines.

P: “We actually believe that the research field itself should direct the quality of its research, and thus, also the discussions.” (Interview 22, representative from funding agency, Neurosciences)

Lecturers, for their part, argue that they cannot revise their educational programs due to the academic system, and university policies are adapted to NHST and p-values.

As most participants seem not to be aware of this process, a circle of interdependency arises that is difficult to break.

P: “Yes, the stupid thing about this perpetual circle is that you are educating people, let’s say in the department of cardiology. They must of course grow, and so they need to publish. If you want to publish you must meet the norms and values of the cardiology journals, so they will write down all those p-values. These people are trained and in twenty years they are on the editorial board of those journals, and then you never get rid of it [the p-value].” (Interview 18, Professor, editor, Medical Sciences)

3 . 4 . Degree of eagerness . Exerting certain behavior or behavioral change is (partly) determined by the extent to which people want to employ particular behavior, their behavioral intention [ 22 ]. Some participants indicated they are willing to change their behavior regarding the use of statistical methods, but only if it is absolutely necessary, imposed or if they think that the current conventions have too many negative consequences. Thus, true, intrinsic will-power to change behavior is lacking among these participants. Instead, they have a rather opportunistic attitude, meaning that their behavior is mostly driven by circumstances, not by principles.

P: “If tomorrow an alternative is offered by people that make that call, than I will move along. But I am not the one calling the shots on this issue.” (Interview 26, editor, Medical Sciences)

In addition, pragmatism often outweighs the perceived urgency to change. Participants argue they ‘just want to do their jobs’ and consider the practical consequences mainly in their actions. This attitude creates a certain degree of inertia. Although participants claim they are willing to change their behavior, this would contain much more than ‘doing their jobs, and thus, in the end, the NHST-debate is subject to ‘coffee talk’. People are open to discussion, but when it comes to taking action (and motivating others to do so), no one takes action.

P: “The endless analysis of your data to get something with a p-value less than 0.05… There are people that are more critical about that, and there are people that are less critical. But that is a subject for during the coffee break.” (Interview 18, professor, editor, Medical Sciences)

The goal of our study was to acquire in-depth insight into reasons why so many stakeholders from the scientific system keep using NHST as the default method to draw conclusions, despite its many well-documented drawbacks. Furthermore, we wanted to gain insight into the reasons for their reluctance to apply alternative methods. Using a theoretical framework [ 20 , 21 ], several interrelated facilitators and barriers associated with the use of NHST and alternative methods were identified. The identified factors are subsumed under three main themes: the scientific climate, the scientific duty and reactivity. The scientific climate is dominated by conventions, behavioral rules, and beliefs, of which the use of NHST and p-values is part. At the same time, stakeholders feel they have a (moral or professional) duty. For many participants, these two sides of the same coin are incompatible, leading to internal conflicts. There is a discrepancy between what participants want and what they do . As a result of these factors, the majority feels dependent on others and have thereby become reactive. Most participants are not inclined to take responsibility themselves but await action and initiatives from others. This may explain why NHST is still the standard and used by almost everyone involved.

The current study is closely related to the longstanding debate regarding NHST which recently increased to a level not seen before. In 2015, the editors of the journal ‘Basic and Applied Social Psychology’ (BASP) prohibited the use of NHST (and p-values and confidence intervals) [ 30 ]. Subsequently, in 2016, the American Statistical Association published the so-called ‘Statement on p-values’ in the American Statistician. This statement consists of critical standpoints regarding the use of NHST and p-values and warns against the abuse of the procedure. In 2019, the American Statistician devoted an entire edition to the implementation of reforms regarding the use of NHST; in more than forty articles, scientists debated statistical significance, advocated to embrace uncertainty, and suggested alternatives such as the use of s-values, False Positive Risks, reporting results as effect sizes and confidence intervals and more holistic approaches to p-values and outcome measures [ 31 ]. In addition, in the same year, several articles appeared in which an appeal was made to stop using statistical significance testing [ 32 , 33 ]. A number of counter-reactions were published [ 34 – 36 ], stating (i.e.) that banning statistical significance and, with that, abandoning clear rules for statistical analyses may create new problems with regard to statistical interpretation, study interpretations and objectivity. Also, some methodologists expressed the view that under certain circumstances the use of NHST and p-values is not problematic and can in fact provide useful answers [ 37 ]. Until recently, the NHST-debate was limited to mainly methodologists and statisticians. However, a growing number of scientists are getting involved in this lively debate and believe that a paradigm shift is desirable or even necessary.

The aforementioned publications have constructively contributed to this debate. In fact, since the publication of the special edition of the American Statistician, numerous scientific journals published editorials or revised, to a greater or lesser extent, their author guidelines [ 38 – 45 ]. Furthermore, following the American Statistical Association (ASA), the National Institute of Statistical Sciences (NISS) in the United States has also taken up the reform issue. However, real changes are still barely visible. It takes a long time before these kinds of initiatives translate into behavioral changes, and the widespread adoption by most of the scientific community is still far from accomplished. Debate alone will not lead to real changes, and therefore, our efforts to elucidate behavioral barriers and facilitators could provide a framework for potential effective initiatives that could be taken to reduce the default use of NHST. In fact, the debate could counteract behavioral change. If there is no consensus among statisticians and methodologists (the innovators), changing behavior cannot be expected from stakeholders with less statistical and methodological expertise. In other words, without agreement among innovators, early adopters might be reluctant to adopt the innovation.

Research has recently been conducted to explore the potential of behavioral change to improve Open Science behaviors. The adoption of open science behavior has increased in the last years, but uptake has been slow, due to firm barriers such as a lack of awareness about the subject, concerns about constrainment of the creative process, worries about being “scooped” and holding on to existing working practices [ 46 ]. The development regarding open science practices and the parallels these lines of research shows with the current study, might be of benefit to subserve behavioral change regarding the use of statistical methods.

The described obstacles to change behavior are related to features of both the ‘innovative idea’ and the potential adopters of the idea. First, there are characteristics of ‘the innovation’ that form barriers. The first barrier is the complexity of the innovation: most participants perceive alternative methods as difficult to understand and to use. A second barrier concerns the feasibility of trying the innovation; most people do not feel flexible about trying out or experimenting with the new idea. There is a lack of time and monetary resources to get acquainted with alternative methods (for example, by following a course). Also, the possible negative consequences of the use of alternatives (lower publications chances, the chance that the statistical method and message is too complicated for one’s readership) is holding people back from experimenting with these alternatives. And lastly, it is unclear for most participants what the visibility of the results of the new idea are. Up until now, the debate has mainly taken place among a small group of statisticians and methodologists. Many researchers are still not aware of the NHST debate and the idea to shift away from NHST and use alternative methods instead. Therefore, the question is how easily the benefits of the innovation can be made visible for a larger part of the scientific community. Thus, our study shows that, although the compatibility of the innovation is largely consistent with existing values (participants are critical about (the use of) NHST and the p-value and believe that there are better alternatives to NHST), important attributes of the innovative idea negatively affect the rate of adoption and consequently the diffusion of the innovation.

Due to the barriers mentioned above, most stakeholders do not have the intention to change their behavior and adopt the innovative idea. From the theory of planned behavior [ 21 ], it is known that behavioral intentions directly relate to performances of behaviors. The strength of the intention is shaped by attitudes, subjective norms, and perceived power. If people evaluate the suggested behavior as positive (attitude), and if they think others want them to perform the behavior (subjective norm), this leads to a stronger intention to perform that behavior. When an individual also perceives they have enough control over the behavior, they are likely to perform it. Although most participants have a positive attitude towards the behavior, or the innovative idea at stake, many participants think that others in their working environment believe that they should not perform the behavior—i.e., they do not approve of the use of alternative methods (social normative pressure). This is expressed, for example, in lower publication chances, negative judgements by supervisors or failing the requirements that are imposed by funding agencies. Thus, the perception about a particular behavior—the use of alternative methods—is negatively influenced by the (perceived) judgment of others. Moreover, we found that many participants have a low self-efficacy, meaning that there is a perceived lack of behavioral control, i.e., their perceived ability to engage in the behavior at issue is low. Also, participants feel a lack of authority (in the sense of knowledge and skills, but also power) to initiate behavioral change. The existing subjective norms and perceived behavioral control, and the negative attitudes towards performing the behavior, lead to a lower behavioral intention, and, ultimately, a lower chance of the performance of the actual behavior.

Several participants mentioned there is a need for people of stature (belonging to the group of early adopters) to take the lead and break down perceived barriers. Early adopters serve as role models and have opinion leadership, and form the next group (after the innovators, in this case statisticians and methodologists) to adopt an innovative idea [ 20 ] ( Fig 2 ). If early adopters would stand up, conveying a positive attitude towards the innovation, breaking down the described perceived barriers and facilitating the use of alternatives (for example by adjusting policy, guidelines and educational programs and making available financial resources for further training), this could positively affect the perceived social norms and self-efficacy of the early and late majority and ultimately laggards, which could ultimately lead to behavioral change among all stakeholders within the scientific community.

A strength of our study is that it is the first empirical study on views on the use of NHST, its alternatives and reasons for the prevailing use of NHST. Another strength is the method of coding which corresponds to the thematic approach from Braun & Clarke [ 47 ], which allows the researcher to move beyond just categorizing and coding the data, but also analyze how the codes are related to each other [ 47 ]. It provides a rich description of what is studied, linked to theory, but also generating new hypotheses. Moreover, two independent researchers coded all transcripts, which adds to the credibility of the study. All findings and the coding scheme were discussed by the two researchers, until consensus was reached. Also, interview results were further explored, enriched and validated by means of (mixed) focus groups. Important themes that emanated from the interviews, such as interdependency, perceptions on the scientific duty, perceived disadvantages of alternatives or the consequences of the current scientific climate, served as starting points and main subjects of the focus groups. This set-up provided more data, and more insight about the data and validation of the data. Lastly, the use of a theoretical framework [ 20 , 21 ] to develop the topic list, guide the interviews and focus groups, and guide their analysis is a strength as it provides structure to the analysis and substantiation of the results.

A limitation of this study is its sampling method. By using the network of members of the project group, and the fact that a relatively high proportion of those invited to participate refused because they thought they knew too little about the subject to be able to contribute, our sample was biased towards participants that are (somewhat) aware of the NHST debate. Our sample may also consist of people that are relatively critical towards the use of NHST, compared to the total population of researchers. It was not easy to include participants who were indifferent about or who were pro-NHST, as those were presumably less willing to make time and participate in this study. Even in our sample we found that the majority of our participants solely used NHST and perceived it as difficult if not impossible to change their behavior. These perceptions are thus probably even stronger in the target population. Another limitation, that is inherent to qualitative research, is the risk of interviewer bias. Respondents are unable, unwilling, or afraid to answer questions in good conscience, and instead provide socially desirable answers. In the context of our research, people are aware that, especially as a scientist, it does not look good to be conservative, complacent, or ignorant, or not to be open to innovation and new ideas. Therefore, some participants might have given a too favorable view of themselves. The interviewer bias can also take the other direction when values and expectations of the interviewer consciously or unconsciously influence the answers of the respondents. Although we have tried to be as neutral and objective as possible in asking questions and interpreting answers, we cannot rule out the chance that our views and opinions on the use of NHST have at times steered the respondents somewhat, potentially leading to the foregoing desirable answers.

Generalizability is a topic that is often debated in qualitative research methodology. Many researchers do not consider generalizability the purpose of qualitative research, but rather finding in-depth insights and explanations. However, this is an unjustified simplification, as generalizing of findings from qualitative research is possible. Three types of generalization in qualitative research are described: representational generalization (whether what is found in a sample can be generalized to the parent population of the sample), inferential generalization (whether findings from the study can be generalized to other settings), and theoretical generalization (where one draws theoretical statements from the findings of the study for more general application) [ 48 ]. The extent to which our results are generalizable is uncertain, as we used a theoretical sampling method, and our study was conducted exclusively in the Netherlands. We expect that the generic themes (reactivity, the scientific duty and the scientific climate) are applicable to academia in many countries across the world (inferential generalization). However, some elements, such as the Dutch educational system, will differ to a more or lesser extent from other countries (and thus can only be representationally generalized). In the Netherlands there is, for example, only one educational route after secondary school that has an academic orientation (scientific education, equivalent to the US university level education). This route consists of a bachelor’s program (typically 3 years), and a master’s program (typically 1, 2 or 3 years). Not every study program contains (compulsory) statistical courses, and statistical courses differ in depth and difficulty levels depending on the study program. Thus, not all the results will hold for other parts of the world, and further investigation is required.

Our findings demonstrate how perceived barriers to shift away from NHST set a high threshold for behavioral change and create a circle of interdependency. Behavioral change is a complex process. As ‘the stronger the intention to engage in a behavior, the more likely should be its performance’[ 21 ], further research on this subject should focus on how to influence the intention of behavior; i.e. which perceived barriers for the use of alternatives are most promising to break down in order to increase the intention for behavioral change. The present study shows that negative normative beliefs and a lack of perceived behavioral control regarding the innovation among individuals in the scientific system is a substantial problem. When social norms change in favor of the innovation, and control over the behavior increases, then the behavioral intention becomes a sufficient predictor of behavior [ 49 ]. An important follow-up question will therefore be: how can people be enthused and empowered, to ultimately take up the use of alternative methods instead of NHST? Answering this question can, in the long run, lead to the diffusion of the innovation through the scientific system as a whole.

NHST has been the leading paradigm for many decades and is deeply rooted in our science system, despite longstanding criticism. The aim of this study was to gain insight as to why we continue to use NHST. Our findings have demonstrated how perceived barriers to make a shift away from NHST set a high threshold for actual behavioral change and create a circle of interdependency between stakeholders in the scientific system. Consequently, people find themselves in a state of reactivity, which limits behavioral change with respect to the use of NHST. The next step would be to get more insight into ways to effectively remove barriers and thereby increase the intention to take a step back from NHST. A paradigm shift within a couple of years is not realistic. However, we believe that by taking small steps, one at a time, it is possible to decrease the scientific community’s strong dependence on NHST and p-values.

Supporting information

S1 appendix, acknowledgments.

The authors are grateful to Anja de Kruif for her contribution to the design of the study and for moderating one of the focus groups.

Funding Statement

This research was funded by the NWO (Nederlandse Organisatie voor Wetenschappelijk Onderzoek; Dutch Organization for Scientific Research) ( https://www.nwo.nl/ ) The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Data Availability

User Preferences

Content preview.

Arcu felis bibendum ut tristique et egestas quis:

  • Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris
  • Duis aute irure dolor in reprehenderit in voluptate
  • Excepteur sint occaecat cupidatat non proident

Keyboard Shortcuts

6.4 - practical significance.

In the last lesson, you learned how to identify statistically significant differences using hypothesis testing methods. If the p value is less than the \(\alpha\) level (typically 0.05), then the results are  statistically significant . Results are said to be statistically significant when the difference between the hypothesized population parameter and observed sample statistic is large enough to conclude that it is unlikely to have occurred by chance. 

Practical significance  refers to the magnitude of the difference, which is known as the  effect size . Results are practically significant when the difference is large enough to be meaningful in real life. What is meaningful may be subjective and may depend on the context.

Note that statistical significance is directly impacted by sample size. Recall that there is an inverse relationship between sample size and the standard error (i.e., standard deviation of the sampling distribution). Very small differences will be statistically significant with a very large sample size. Thus, when results are statistically significant it is important to also examine practical significance. Practical significance is not directly influenced by sample size.

Example: Weight-Loss Program Section  

Researchers are studying a new weight-loss program. Using a large sample they construct a 95% confidence interval for the mean amount of weight loss after six months on the program to be [0.12, 0.20]. All measurements were taken in pounds. Note that this confidence interval does not contain 0, so we know that their results were statistically significant at a 0.05 alpha level. However, most people would say that the results are not practically significant because a six month weight-loss program should yield a mean weight loss much greater than the one observed in this study. 

Effect Size Section  

For some tests there are commonly used measures of effect size. For example, when comparing the difference in two means we often compute Cohen's \(d\) which is the difference between the two observed sample means in standard deviation units:

\[d=\frac{\overline x_1 - \overline x_2}{s_p}\]

Where \(s_p\) is the pooled standard deviation

\[s_p= \sqrt{\frac{(n_1-1)s_1^2 + (n_2 -1)s_2^2}{n_1+n_2-2}}\]

Below are commonly used standards when interpreting Cohen's \(d\):

Cohen's \(d\) Interpretation
0 - 0.2 Little or no effect
0.2 - 0.5 Small effect size
0.5 - 0.8 Medium effect size
0.8 or more Large effect size

For a single mean, you can compute the difference between the observed mean and hypothesized mean in standard deviation units: \[d=\frac{\overline x - \mu_0}{s}\]

For correlation and regression we can compute \(r^2\) which is known as the coefficient of determination. This is the proportion of shared variation. We will learn more about \(r^2\) when we study simple linear regression and correlation at the end of this course.

Example: SAT-Math Scores Section  

Test Taking

Research question :  Are SAT-Math scores at one college greater than the known population mean of 500?

\(H_0\colon \mu = 500\)

\(H_a\colon \mu >500\)

Data are collected from a random sample of 1,200 students at that college. In that sample, \(\overline{x}=506\) and the sample standard deviation was 100. A one-sample mean test was performed and the resulting p-value was 0.0188. Because \(p \leq \alpha\), the null hypothesis should be rejected. These results are statistically significant. There is evidence that the population mean is greater than 500.

But, let's also consider practical significance. The difference between an SAT-Math score 500 and an SAT-Math score of 506 is very small. With a standard deviation of 100, this difference is only \(\frac{506-500}{100}=0.06\) standard deviations. In most cases, this would not be considered practically significant. 

Example: Commute Times Section  

Research question:  Are the mean commute times different in Atlanta and St. Louis?

Descriptive Statistics: Commute Time
City N Mean StDev
Atlanta 500 29.110 20.718
St. Louis 500 21.970 14.232

Using the dataset built in to StatKey , a two-tailed randomization test was conducted resulting in a p value < 0.001. Because the null hypothesis was rejected, the results are said to be statistically significant.

Practical significance can be examined by computing Cohen's d. We'll use the equations from above:

First, we compute the pooled standard deviation:

\[s_p= \sqrt{\frac{(500-1)20.718^2 + (500-1)14.232^2}{500+500-2}}\]

\[s_p= \sqrt{\frac{(499)(429.236)+ (499)(202.550)}{998}}\]

\[s_p= \sqrt{\frac{214188.527+ 101072.362}{998}}\]

\[s_p= \sqrt{\frac{315260.853}{998}}\]

\[s_p= \sqrt{315.893}\]

\[s_p= 17.773\]

Note: The pooled standard deviation should always be between the two sample standard deviations.

Next, we can compute Cohen's d:

\[d=\frac{29.110-21.970}{17.773}\]

\[d=\frac{7.14}{17.773}\]

\[d= 0.402\]

The mean commute time in Atlanta was 0.402 standard deviations greater than the mean commute time in St. Louis. Using the guidelines for interpreting Cohen's d in the table above, this is a small effect size. 

IMAGES

  1. 15 Null Hypothesis Examples (2024)

    null hypothesis in practical research

  2. statistics

    null hypothesis in practical research

  3. Difference between Null and Alternative Hypothesis

    null hypothesis in practical research

  4. Hypothesis Examples

    null hypothesis in practical research

  5. Testing Hypotheses: Null Vs. Alternative, The Key To Hypothesis Testing

    null hypothesis in practical research

  6. Examples

    null hypothesis in practical research

VIDEO

  1. What is Null Hypothesis Testing Mean? || Academic Research || Ettienne-Murphy

  2. How To Formulate The Hypothesis/What is Hypothesis?

  3. Research understanding

  4. Selecting the Appropriate Hypothesis Test [FIL]

  5. Difference between null and alternative hypothesis |research methodology in tamil #sscomputerstudies

  6. Null hypothesis vs alternative hypothesis

COMMENTS

  1. Understanding Null Hypothesis Testing

    A crucial step in null hypothesis testing is finding the likelihood of the sample result if the null hypothesis were true. This probability is called the p value. A low p value means that the sample result would be unlikely if the null hypothesis were true and leads to the rejection of the null hypothesis. A high p value means that the sample ...

  2. Null & Alternative Hypotheses

    The null and alternative hypotheses offer competing answers to your research question. When the research question asks "Does the independent variable affect the dependent variable?": The null hypothesis ( H0) answers "No, there's no effect in the population.". The alternative hypothesis ( Ha) answers "Yes, there is an effect in the ...

  3. 10.2: Understanding Null Hypothesis Testing

    The Logic of Null Hypothesis Testing. Null hypothesis testing (often called null hypothesis significance testing or NHST) is a formal approach to deciding between two interpretations of a statistical relationship in a sample. One interpretation is called the null hypothesis (often symbolized H0 and read as "H-zero").

  4. Null hypothesis

    In scientific research, the null hypothesis (often denoted H 0) is the claim that the effect being studied does not exist. The null hypothesis can also be described as the hypothesis in which no relationship exists between two sets of data or variables being analyzed. If the null hypothesis is true, any experimentally observed effect is due to ...

  5. How to Write a Strong Hypothesis

    6. Write a null hypothesis. If your research involves statistical hypothesis testing, you will also have to write a null hypothesis. The null hypothesis is the default position that there is no association between the variables. The null hypothesis is written as H 0, while the alternative hypothesis is H 1 or H a.

  6. What Is The Null Hypothesis & When To Reject It

    A tutorial on a practical Bayesian alternative to null-hypothesis significance testing. Behavior research methods, 43, 679-690. Nickerson, R. S. (2000). Null hypothesis significance testing: a review of an old and continuing controversy. Psychological methods, 5(2), 241. Rozeboom, W. W. (1960). The fallacy of the null-hypothesis significance test.

  7. How to Formulate a Null Hypothesis (With Examples)

    To distinguish it from other hypotheses, the null hypothesis is written as H 0 (which is read as "H-nought," "H-null," or "H-zero"). A significance test is used to determine the likelihood that the results supporting the null hypothesis are not due to chance. A confidence level of 95% or 99% is common. Keep in mind, even if the confidence level is high, there is still a small chance the ...

  8. Null hypothesis significance testing: a short tutorial

    Abstract: "null hypothesis significance testing is the statistical method of choice in biological, biomedical and social sciences to investigate if an effect is likely". No, NHST is the method to test the hypothesis of no effect. I agree - yet people use it to investigate (not test) if an effect is likely.

  9. An Introduction to Statistics: Understanding Hypothesis Testing and

    HYPOTHESIS TESTING. A clinical trial begins with an assumption or belief, and then proceeds to either prove or disprove this assumption. In statistical terms, this belief or assumption is known as a hypothesis. Counterintuitively, what the researcher believes in (or is trying to prove) is called the "alternate" hypothesis, and the opposite ...

  10. Understanding Null Hypothesis Testing

    The Logic of Null Hypothesis Testing. Null hypothesis testing (often called null hypothesis significance testing or NHST) is a formal approach to deciding between two interpretations of a statistical relationship in a sample. One interpretation is called the null hypothesis (often symbolized H0 and read as "H-zero").

  11. A Practical Guide to Writing Quantitative and Qualitative Research

    A Practical Guide to Writing Quantitative and Qualitative Research Questions and Hypotheses in Scholarly Articles. ... a research hypothesis is an educated statement of an expected outcome. ... ,4 3) state a negative relationship between two variables (null hypothesis),4,11,15 4) replace the working hypothesis if rejected (alternative ...

  12. Null Hypothesis Significance Testing and p Values

    Savvy professionals turn to the research literature for practical guidance, but critically evaluating a study and interpreting its findings is a daunting task. ... Goodman (1993) explained why the p statistic alone is insufficient reason to reject the null hypothesis. Researchers and research consumers should examine other features of the study ...

  13. Importance of Null Hypothesis in Research

    The null hypothesis often denoted as H0, is a statement in statistical inference that suggests no statistical significance exists in a set of observed data. In other words, it assumes that any kind of difference or importance you see in a set of data is due to chance. The null hypothesis is the initial claim that researchers set out to test.

  14. 13.1 Understanding Null Hypothesis Testing

    A crucial step in null hypothesis testing is finding the likelihood of the sample result if the null hypothesis were true. This probability is called the p value. A low p value means that the sample result would be unlikely if the null hypothesis were true and leads to the rejection of the null hypothesis. A high p value means that the sample ...

  15. 7.3: The Research Hypothesis and the Null Hypothesis

    This null hypothesis can be written as: H0: X¯ = μ H 0: X ¯ = μ. For most of this textbook, the null hypothesis is that the means of the two groups are similar. Much later, the null hypothesis will be that there is no relationship between the two groups. Either way, remember that a null hypothesis is always saying that nothing is different.

  16. 13.1 Understanding Null Hypothesis Testing

    A crucial step in null hypothesis testing is finding the likelihood of the sample result if the null hypothesis were true. This probability is called the p value. A low p value means that the sample result would be unlikely if the null hypothesis were true and leads to the rejection of the null hypothesis. A p value that is not low means that ...

  17. An Easy Introduction to Statistical Significance (With Examples)

    In quantitative research, data are analyzed through null hypothesis significance testing, or hypothesis testing. This is a formal procedure for assessing whether a relationship between variables or a difference between groups is statistically significant. ... Aside from statistical significance, clinical significance and practical significance ...

  18. 13.2: Understanding Null Hypothesis Testing

    The Logic of Null Hypothesis Testing. Null hypothesis testing (often called null hypothesis significance testing or NHST) is a formal approach to deciding between two interpretations of a statistical relationship in a sample. One interpretation is called the null hypothesis (often symbolized H0 and read as "H-zero").

  19. New Guidelines for Null Hypothesis Significance Testing in ...

    Neyman and Pearson (1928, p. 205) did passingly use a probability of 5% in one of their examples and as one of multiple arguments for why the tested hypothesis may best be rejected.Fisher also argued at some point that results with higher than a 5% or even a 1% probability should not be seen as "unexpected" and should therefore be simply ignored.

  20. 13.1: Understanding Null Hypothesis Testing

    A crucial step in null hypothesis testing is finding the likelihood of the sample result if the null hypothesis were true. This probability is called the p value. A low p value means that the sample result would be unlikely if the null hypothesis were true and leads to the rejection of the null hypothesis. A p value that is not low means that ...

  21. Understanding Null Hypothesis Testing

    The Logic of Null Hypothesis Testing. Null hypothesis testing (often called null hypothesis significance testing or NHST) is a formal approach to deciding between two interpretations of a statistical relationship in a sample. One interpretation is called the null hypothesis (often symbolized H0 and read as "H-zero").

  22. Why we habitually engage in null-hypothesis significance testing: A

    Assessing statistical significance by means of contrasting the data with the null hypothesis is called Null Hypothesis Significance Testing (NHST). NHST is the best known and most widely used statistical procedure for making inferences about population effects. The procedure has become the prevailing paradigm in empirical science [ 3 ], and ...

  23. 6.4

    6.4 - Practical Significance. In the last lesson, you learned how to identify statistically significant differences using hypothesis testing methods. If the p value is less than the α level (typically 0.05), then the results are statistically significant. Results are said to be statistically significant when the difference between the ...