P-Value And Statistical Significance: What It Is & Why It Matters

Saul McLeod, PhD

Editor-in-Chief for Simply Psychology

BSc (Hons) Psychology, MRes, PhD, University of Manchester

Saul McLeod, PhD., is a qualified psychology teacher with over 18 years of experience in further and higher education. He has been published in peer-reviewed journals, including the Journal of Clinical Psychology.

Learn about our Editorial Process

Olivia Guy-Evans, MSc

Associate Editor for Simply Psychology

BSc (Hons) Psychology, MSc Psychology of Education

Olivia Guy-Evans is a writer and associate editor for Simply Psychology. She has previously worked in healthcare and educational sectors.

On This Page:

The p-value in statistics quantifies the evidence against a null hypothesis. A low p-value suggests data is inconsistent with the null, potentially favoring an alternative hypothesis. Common significance thresholds are 0.05 or 0.01.

P-Value Explained in Normal Distribution

Hypothesis testing

When you perform a statistical test, a p-value helps you determine the significance of your results in relation to the null hypothesis.

The null hypothesis (H0) states no relationship exists between the two variables being studied (one variable does not affect the other). It states the results are due to chance and are not significant in supporting the idea being investigated. Thus, the null hypothesis assumes that whatever you try to prove did not happen.

The alternative hypothesis (Ha or H1) is the one you would believe if the null hypothesis is concluded to be untrue.

The alternative hypothesis states that the independent variable affected the dependent variable, and the results are significant in supporting the theory being investigated (i.e., the results are not due to random chance).

What a p-value tells you

A p-value, or probability value, is a number describing how likely it is that your data would have occurred by random chance (i.e., that the null hypothesis is true).

The level of statistical significance is often expressed as a p-value between 0 and 1.

The smaller the p -value, the less likely the results occurred by random chance, and the stronger the evidence that you should reject the null hypothesis.

Remember, a p-value doesn’t tell you if the null hypothesis is true or false. It just tells you how likely you’d see the data you observed (or more extreme data) if the null hypothesis was true. It’s a piece of evidence, not a definitive proof.

Example: Test Statistic and p-Value

Suppose you’re conducting a study to determine whether a new drug has an effect on pain relief compared to a placebo. If the new drug has no impact, your test statistic will be close to the one predicted by the null hypothesis (no difference between the drug and placebo groups), and the resulting p-value will be close to 1. It may not be precisely 1 because real-world variations may exist. Conversely, if the new drug indeed reduces pain significantly, your test statistic will diverge further from what’s expected under the null hypothesis, and the p-value will decrease. The p-value will never reach zero because there’s always a slim possibility, though highly improbable, that the observed results occurred by random chance.

P-value interpretation

The significance level (alpha) is a set probability threshold (often 0.05), while the p-value is the probability you calculate based on your study or analysis.

A p-value less than or equal to your significance level (typically ≤ 0.05) is statistically significant.

A p-value less than or equal to a predetermined significance level (often 0.05 or 0.01) indicates a statistically significant result, meaning the observed data provide strong evidence against the null hypothesis.

This suggests the effect under study likely represents a real relationship rather than just random chance.

For instance, if you set α = 0.05, you would reject the null hypothesis if your p -value ≤ 0.05. 

It indicates strong evidence against the null hypothesis, as there is less than a 5% probability the null is correct (and the results are random).

Therefore, we reject the null hypothesis and accept the alternative hypothesis.

Example: Statistical Significance

Upon analyzing the pain relief effects of the new drug compared to the placebo, the computed p-value is less than 0.01, which falls well below the predetermined alpha value of 0.05. Consequently, you conclude that there is a statistically significant difference in pain relief between the new drug and the placebo.

What does a p-value of 0.001 mean?

A p-value of 0.001 is highly statistically significant beyond the commonly used 0.05 threshold. It indicates strong evidence of a real effect or difference, rather than just random variation.

Specifically, a p-value of 0.001 means there is only a 0.1% chance of obtaining a result at least as extreme as the one observed, assuming the null hypothesis is correct.

Such a small p-value provides strong evidence against the null hypothesis, leading to rejecting the null in favor of the alternative hypothesis.

A p-value more than the significance level (typically p > 0.05) is not statistically significant and indicates strong evidence for the null hypothesis.

This means we retain the null hypothesis and reject the alternative hypothesis. You should note that you cannot accept the null hypothesis; we can only reject it or fail to reject it.

Note : when the p-value is above your threshold of significance,  it does not mean that there is a 95% probability that the alternative hypothesis is true.

One-Tailed Test

Probability and statistical significance in ab testing. Statistical significance in a b experiments

Two-Tailed Test

statistical significance two tailed

How do you calculate the p-value ?

Most statistical software packages like R, SPSS, and others automatically calculate your p-value. This is the easiest and most common way.

Online resources and tables are available to estimate the p-value based on your test statistic and degrees of freedom.

These tables help you understand how often you would expect to see your test statistic under the null hypothesis.

Understanding the Statistical Test:

Different statistical tests are designed to answer specific research questions or hypotheses. Each test has its own underlying assumptions and characteristics.

For example, you might use a t-test to compare means, a chi-squared test for categorical data, or a correlation test to measure the strength of a relationship between variables.

Be aware that the number of independent variables you include in your analysis can influence the magnitude of the test statistic needed to produce the same p-value.

This factor is particularly important to consider when comparing results across different analyses.

Example: Choosing a Statistical Test

If you’re comparing the effectiveness of just two different drugs in pain relief, a two-sample t-test is a suitable choice for comparing these two groups. However, when you’re examining the impact of three or more drugs, it’s more appropriate to employ an Analysis of Variance ( ANOVA) . Utilizing multiple pairwise comparisons in such cases can lead to artificially low p-values and an overestimation of the significance of differences between the drug groups.

How to report

A statistically significant result cannot prove that a research hypothesis is correct (which implies 100% certainty).

Instead, we may state our results “provide support for” or “give evidence for” our research hypothesis (as there is still a slight probability that the results occurred by chance and the null hypothesis was correct – e.g., less than 5%).

Example: Reporting the results

In our comparison of the pain relief effects of the new drug and the placebo, we observed that participants in the drug group experienced a significant reduction in pain ( M = 3.5; SD = 0.8) compared to those in the placebo group ( M = 5.2; SD  = 0.7), resulting in an average difference of 1.7 points on the pain scale (t(98) = -9.36; p < 0.001).

The 6th edition of the APA style manual (American Psychological Association, 2010) states the following on the topic of reporting p-values:

“When reporting p values, report exact p values (e.g., p = .031) to two or three decimal places. However, report p values less than .001 as p < .001.

The tradition of reporting p values in the form p < .10, p < .05, p < .01, and so forth, was appropriate in a time when only limited tables of critical values were available.” (p. 114)

  • Do not use 0 before the decimal point for the statistical value p as it cannot equal 1. In other words, write p = .001 instead of p = 0.001.
  • Please pay attention to issues of italics ( p is always italicized) and spacing (either side of the = sign).
  • p = .000 (as outputted by some statistical packages such as SPSS) is impossible and should be written as p < .001.
  • The opposite of significant is “nonsignificant,” not “insignificant.”

Why is the p -value not enough?

A lower p-value  is sometimes interpreted as meaning there is a stronger relationship between two variables.

However, statistical significance means that it is unlikely that the null hypothesis is true (less than 5%).

To understand the strength of the difference between the two groups (control vs. experimental) a researcher needs to calculate the effect size .

When do you reject the null hypothesis?

In statistical hypothesis testing, you reject the null hypothesis when the p-value is less than or equal to the significance level (α) you set before conducting your test. The significance level is the probability of rejecting the null hypothesis when it is true. Commonly used significance levels are 0.01, 0.05, and 0.10.

Remember, rejecting the null hypothesis doesn’t prove the alternative hypothesis; it just suggests that the alternative hypothesis may be plausible given the observed data.

The p -value is conditional upon the null hypothesis being true but is unrelated to the truth or falsity of the alternative hypothesis.

What does p-value of 0.05 mean?

If your p-value is less than or equal to 0.05 (the significance level), you would conclude that your result is statistically significant. This means the evidence is strong enough to reject the null hypothesis in favor of the alternative hypothesis.

Are all p-values below 0.05 considered statistically significant?

No, not all p-values below 0.05 are considered statistically significant. The threshold of 0.05 is commonly used, but it’s just a convention. Statistical significance depends on factors like the study design, sample size, and the magnitude of the observed effect.

A p-value below 0.05 means there is evidence against the null hypothesis, suggesting a real effect. However, it’s essential to consider the context and other factors when interpreting results.

Researchers also look at effect size and confidence intervals to determine the practical significance and reliability of findings.

How does sample size affect the interpretation of p-values?

Sample size can impact the interpretation of p-values. A larger sample size provides more reliable and precise estimates of the population, leading to narrower confidence intervals.

With a larger sample, even small differences between groups or effects can become statistically significant, yielding lower p-values. In contrast, smaller sample sizes may not have enough statistical power to detect smaller effects, resulting in higher p-values.

Therefore, a larger sample size increases the chances of finding statistically significant results when there is a genuine effect, making the findings more trustworthy and robust.

Can a non-significant p-value indicate that there is no effect or difference in the data?

No, a non-significant p-value does not necessarily indicate that there is no effect or difference in the data. It means that the observed data do not provide strong enough evidence to reject the null hypothesis.

There could still be a real effect or difference, but it might be smaller or more variable than the study was able to detect.

Other factors like sample size, study design, and measurement precision can influence the p-value. It’s important to consider the entire body of evidence and not rely solely on p-values when interpreting research findings.

Can P values be exactly zero?

While a p-value can be extremely small, it cannot technically be absolute zero. When a p-value is reported as p = 0.000, the actual p-value is too small for the software to display. This is often interpreted as strong evidence against the null hypothesis. For p values less than 0.001, report as p < .001

Further Information

  • P Value Calculator From T Score
  • P-Value Calculator For Chi-Square
  • P-values and significance tests (Kahn Academy)
  • Hypothesis testing and p-values (Kahn Academy)
  • Wasserstein, R. L., Schirm, A. L., & Lazar, N. A. (2019). Moving to a world beyond “ p “< 0.05”.
  • Criticism of using the “ p “< 0.05”.
  • Publication manual of the American Psychological Association
  • Statistics for Psychology Book Download

Bland, J. M., & Altman, D. G. (1994). One and two sided tests of significance: Authors’ reply.  BMJ: British Medical Journal ,  309 (6958), 874.

Goodman, S. N., & Royall, R. (1988). Evidence and scientific research.  American Journal of Public Health ,  78 (12), 1568-1574.

Goodman, S. (2008, July). A dirty dozen: twelve p-value misconceptions . In  Seminars in hematology  (Vol. 45, No. 3, pp. 135-140). WB Saunders.

Lang, J. M., Rothman, K. J., & Cann, C. I. (1998). That confounded P-value.  Epidemiology (Cambridge, Mass.) ,  9 (1), 7-8.

Print Friendly, PDF & Email

  • Skip to secondary menu
  • Skip to main content
  • Skip to primary sidebar

Statistics By Jim

Making statistics intuitive

How to Find the P value: Process and Calculations

By Jim Frost 4 Comments

P values are everywhere in statistics . They’re in all types of hypothesis tests. But how do you calculate a p-value ? Unsurprisingly, the precise calculations depend on the test. However, there is a general process that applies to finding a p value.

In this post, you’ll learn how to find the p value. I’ll start by showing you the general process for all hypothesis tests. Then I’ll move on to a step-by-step example showing the calculations for a p value. This post includes a calculator so you can apply what you learn.

General Process for How to Find the P value

To find the p value for your sample , do the following:

  • Identify the correct test statistic.
  • Calculate the test statistic using the relevant properties of your sample.
  • Specify the characteristics of the test statistic’s sampling distribution.
  • Place your test statistic in the sampling distribution to find the p value.

Before moving on to the calculations example, I’ll summarize the purpose for each step. This part tells you the “why.” In the example calculations section, I show the “how.”

Identify the Correct Test Statistic

All hypothesis tests boil your sample data down to a single number known as a test statistic. T-tests use t-values. F-tests use F-values. Chi-square tests use chi-square values. Choosing the correct one depends on the type of data you have and how you want to analyze it. Before you can find the p value, you must determine which hypothesis test and test statistic you’ll use.

Test statistics assess how consistent your sample data are with the null hypothesis. As a test statistic becomes more extreme, it indicates a larger difference between your sample data and the null hypothesis.

Calculate the Test Statistic

How you calculate the test statistic depends on which one you’re using. Unsurprisingly, the method for calculating test statistics varies by test type. Consequently, to calculate the p value for any test, you’ll need to know the correct test statistic formula.

To learn more about test statistics and how to calculate them for other tests, read my article, Test Statistics .

Specify the Properties of the Test Statistic’s Sampling Distribution

Test statistics are unitless, making them tricky to interpret on their own. You need to place them in a larger context to understand how extreme they are.

The sampling distribution for the test statistic provides that context. Sampling distributions are a type of probability distribution. Consequently, they allow you to calculate probabilities related to your test statistic’s extremeness, which lets us find the p value!

Probability distribution plot that displays a t-distribution.

Like any distribution, the same sampling distribution (e.g., the t-distribution) can have a variety of shapes depending upon its parameters . For this step, you need to determine the characteristics of the sampling distribution that fit your design and data.

That usually entails specifying the degrees of freedom (changes its shape) and whether the test is one- or two-tailed (affects the directions the test can detect effects). In essence, you’re taking the general sampling distribution and tailoring it to your study so it provides the correct probabilities for finding the p value.

Each test statistic’s sampling distribution has unique properties you need to specify. At the end of this post, I provide links for several.

Learn more about degrees of freedom and one-tailed vs. two-tailed tests .

Placing Your Test Statistic in its Sampling Distribution to Find the P value

Finally, it’s time to find the p value because we have everything in place. We have calculated our test statistic and determined the correct properties for its sampling distribution. Now, we need to find the probability of values more extreme than our observed test statistic.

In this context, more extreme means further away from the null value in both directions for a two-tailed test or in one direction for a one-tailed test.

At this point, there are two ways to use the test statistic and distribution to calculate the p value. The formulas for probability distributions are relatively complex. Consequently, you won’t calculate it directly. Instead, you’ll use either an online calculator or a statistical table for the test statistic. I’ll show you both approaches in the step-by-step example.

In summary, calculating a p-value involves identifying and calculating your test statistic and then placing it in its sampling distribution to find the probability of more extreme values!

Let’s see this whole process in action with an example!

Step-by-Step Example of How to Find the P value for a T-test

For this example, assume we’re tasked with determining whether a sample mean is different from a hypothesized value. We’re given the sample statistics below and need to find the p value.

  • Mean: 330.6
  • Standard deviation: 154.2
  • Sample size: 25
  • Null hypothesis value: 260

Let’s work through the step-by-step process of how to calculate a p-value.

First, we need to identify the correct test statistic. Because we’re comparing one mean to a null value, we need to use a 1-sample t-test. Hence, the t-value is our test statistic, and the t-distribution is our sampling distribution.

Second, we’ll calculate the test statistic. The t-value formula for a 1-sample t-test is the following:

Test statistic formula for the 1-sample t-test.

  • x̄ is the sample mean.
  • µ 0 is the null hypothesis value.
  • s is the sample standard deviation.
  • n is the sample size
  • Collectively, the denominator is the standard error of the mean .

Let’s input our sample values into the equation to calculate the t-value.

Calculations for the t-value, which leads to the p-value.

Third, we need to specify the properties of the sampling distribution to find the p value. We’ll need the degrees of freedom.

The degrees of freedom for a 1-sample t-test is n – 1. Our sample size is 25. Hence, we have 24 DF. We’ll use a two-tailed test, which is the standard.

Now we’ve got all the necessary information to calculate the p-value. I’ll show you two ways to take the final step!

P-value Calculator

One method is to use an online p-value calculator, like the one I include below.

Enter the following in the calculator for our t-test example.

  • In What do you want? , choose Two-tailed p-value (the default).
  • In What do you have? , choose t-score .
  • In Degrees of freedom (d) , enter 24 .
  • In Your t-score , enter 2.289 .

The calculator displays a result of 0.031178.

There you go! Using the standard significance level of 0.05, our results are statistically significant!

Using a Statistical Table to Find the P Value

The other common method is using a statistical table. In this case, we’ll need to use a t-table. For this example, I’ll truncate the rows. You can find my full table here: T-Table .

This method won’t find the exact p value, but you’ll find a range and know whether your results are statistically significant.

T-table for finding the p value.

Start by looking in the row for 24 degrees of freedom, highlighted in light green. We need to find where our t-score of 2.289 fits in. I highlight the two table values that our t-value fits between, 2.064 and 2.492. Then we look at the two-tailed row at the top to find the corresponding p values for the two t-values.

In this case, our t-value of 2.289 produces a p value between 0.02 and 0.05 for a two-tailed test. Our results are statistically significant, and they are consistent with the calculator’s more precise results.

Displaying the P value in a Chart

In the example above, you saw how to calculate a p-value starting with the sample statistics. We calculated the t-value and placed it in the applicable t-distribution. I find that the calculations and numbers are dry by themselves. I love graphing things whenever possible, so I’ll use a probability distribution plot to illustrate the example.

Using statistical software, I’ll create the graphical equivalent of calculating the p-value above.

Chart of finding p value.

This chart has two shaded regions because we performed a two-tailed test. Each region has a probability of 0.01559. When you sum them, you obtain the p-value of 0.03118. In other words, the likelihood of a t-value falling in either shaded region when the null hypothesis is true is 0.03118.

I showed you how to find the p value for a t-test. Click the links below to see how it works for other hypothesis tests:

  • One-Way ANOVA F-test
  • Chi-square Test of Independence

Now that we’ve found the p value, how do you interpret it precisely? If you’re going beyond the significant/not significant decision and really want to understand what it means, read my posts, Interpreting P Values  and Statistical Significance: Definition & Meaning .

If you’re learning about hypothesis testing and like the approach I use in my blog, check out my Hypothesis Testing book! You can find it at Amazon and other retailers.

Cover image of my Hypothesis Testing: An Intuitive Guide ebook.

Share this:

hypothesis testing p value equal 1

Reader Interactions

' src=

January 9, 2024 at 9:58 am

how did you get the 0.01559? is it from the t table or somewhere else. please put me through

' src=

January 9, 2024 at 3:13 pm

The value of 0.01559 comes from the t-distribution. It’s the probability of each red shaded region in the graph I show. These regions are based on the t-value. Typically, you’ll use either statistical software or a t-distribution calculator to find probabilities associated with t-values. Or use a t-table. I used my statistical software. You don’t calculate those probabilities yourself because the calculations are complex.

I hope that helps!

' src=

November 23, 2022 at 2:08 am

Simply superb. Easy for us who are starters to enjoy statistic made enjoyable.

' src=

November 22, 2022 at 6:41 pm

I like the way your presentation so that every one can undersanf in the simplest way. If you can support this by power point it will be more intetrsted. I know it takes your valuable time. However, forwarding your knowledge to those who need is more valuable, supporting and appreciation. Continue doing this teaching approach. Thank you. I wish you all the best. God bless you.

Comments and Questions Cancel reply

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base

Hypothesis Testing | A Step-by-Step Guide with Easy Examples

Published on November 8, 2019 by Rebecca Bevans . Revised on June 22, 2023.

Hypothesis testing is a formal procedure for investigating our ideas about the world using statistics . It is most often used by scientists to test specific predictions, called hypotheses, that arise from theories.

There are 5 main steps in hypothesis testing:

  • State your research hypothesis as a null hypothesis and alternate hypothesis (H o ) and (H a  or H 1 ).
  • Collect data in a way designed to test the hypothesis.
  • Perform an appropriate statistical test .
  • Decide whether to reject or fail to reject your null hypothesis.
  • Present the findings in your results and discussion section.

Though the specific details might vary, the procedure you will use when testing a hypothesis will always follow some version of these steps.

Table of contents

Step 1: state your null and alternate hypothesis, step 2: collect data, step 3: perform a statistical test, step 4: decide whether to reject or fail to reject your null hypothesis, step 5: present your findings, other interesting articles, frequently asked questions about hypothesis testing.

After developing your initial research hypothesis (the prediction that you want to investigate), it is important to restate it as a null (H o ) and alternate (H a ) hypothesis so that you can test it mathematically.

The alternate hypothesis is usually your initial hypothesis that predicts a relationship between variables. The null hypothesis is a prediction of no relationship between the variables you are interested in.

  • H 0 : Men are, on average, not taller than women. H a : Men are, on average, taller than women.

Here's why students love Scribbr's proofreading services

Discover proofreading & editing

For a statistical test to be valid , it is important to perform sampling and collect data in a way that is designed to test your hypothesis. If your data are not representative, then you cannot make statistical inferences about the population you are interested in.

There are a variety of statistical tests available, but they are all based on the comparison of within-group variance (how spread out the data is within a category) versus between-group variance (how different the categories are from one another).

If the between-group variance is large enough that there is little or no overlap between groups, then your statistical test will reflect that by showing a low p -value . This means it is unlikely that the differences between these groups came about by chance.

Alternatively, if there is high within-group variance and low between-group variance, then your statistical test will reflect that with a high p -value. This means it is likely that any difference you measure between groups is due to chance.

Your choice of statistical test will be based on the type of variables and the level of measurement of your collected data .

  • an estimate of the difference in average height between the two groups.
  • a p -value showing how likely you are to see this difference if the null hypothesis of no difference is true.

Based on the outcome of your statistical test, you will have to decide whether to reject or fail to reject your null hypothesis.

In most cases you will use the p -value generated by your statistical test to guide your decision. And in most cases, your predetermined level of significance for rejecting the null hypothesis will be 0.05 – that is, when there is a less than 5% chance that you would see these results if the null hypothesis were true.

In some cases, researchers choose a more conservative level of significance, such as 0.01 (1%). This minimizes the risk of incorrectly rejecting the null hypothesis ( Type I error ).

The results of hypothesis testing will be presented in the results and discussion sections of your research paper , dissertation or thesis .

In the results section you should give a brief summary of the data and a summary of the results of your statistical test (for example, the estimated difference between group means and associated p -value). In the discussion , you can discuss whether your initial hypothesis was supported by your results or not.

In the formal language of hypothesis testing, we talk about rejecting or failing to reject the null hypothesis. You will probably be asked to do this in your statistics assignments.

However, when presenting research results in academic papers we rarely talk this way. Instead, we go back to our alternate hypothesis (in this case, the hypothesis that men are on average taller than women) and state whether the result of our test did or did not support the alternate hypothesis.

If your null hypothesis was rejected, this result is interpreted as “supported the alternate hypothesis.”

These are superficial differences; you can see that they mean the same thing.

You might notice that we don’t say that we reject or fail to reject the alternate hypothesis . This is because hypothesis testing is not designed to prove or disprove anything. It is only designed to test whether a pattern we measure could have arisen spuriously, or by chance.

If we reject the null hypothesis based on our research (i.e., we find that it is unlikely that the pattern arose by chance), then we can say our test lends support to our hypothesis . But if the pattern does not pass our decision rule, meaning that it could have arisen by chance, then we say the test is inconsistent with our hypothesis .

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

  • Normal distribution
  • Descriptive statistics
  • Measures of central tendency
  • Correlation coefficient

Methodology

  • Cluster sampling
  • Stratified sampling
  • Types of interviews
  • Cohort study
  • Thematic analysis

Research bias

  • Implicit bias
  • Cognitive bias
  • Survivorship bias
  • Availability heuristic
  • Nonresponse bias
  • Regression to the mean

Hypothesis testing is a formal procedure for investigating our ideas about the world using statistics. It is used by scientists to test specific predictions, called hypotheses , by calculating how likely it is that a pattern or relationship between variables could have arisen by chance.

A hypothesis states your predictions about what your research will find. It is a tentative answer to your research question that has not yet been tested. For some research projects, you might have to write several hypotheses that address different aspects of your research question.

A hypothesis is not just a guess — it should be based on existing theories and knowledge. It also has to be testable, which means you can support or refute it through scientific research methods (such as experiments, observations and statistical analysis of data).

Null and alternative hypotheses are used in statistical hypothesis testing . The null hypothesis of a test always predicts no effect or no relationship between variables, while the alternative hypothesis states your research prediction of an effect or relationship.

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

Bevans, R. (2023, June 22). Hypothesis Testing | A Step-by-Step Guide with Easy Examples. Scribbr. Retrieved September 3, 2024, from https://www.scribbr.com/statistics/hypothesis-testing/

Is this article helpful?

Rebecca Bevans

Rebecca Bevans

Other students also liked, choosing the right statistical test | types & examples, understanding p values | definition and examples, what is your plagiarism score.

Icon Partners

  • Quality Improvement
  • Talk To Minitab

Understanding Hypothesis Tests: Significance Levels (Alpha) and P values in Statistics

Topics: Hypothesis Testing , Statistics

What do significance levels and P values mean in hypothesis tests? What is statistical significance anyway? In this post, I’ll continue to focus on concepts and graphs to help you gain a more intuitive understanding of how hypothesis tests work in statistics.

To bring it to life, I’ll add the significance level and P value to the graph in my previous post in order to perform a graphical version of the 1 sample t-test. It’s easier to understand when you can see what statistical significance truly means!

Here’s where we left off in my last post . We want to determine whether our sample mean (330.6) indicates that this year's average energy cost is significantly different from last year’s average energy cost of $260.

Descriptive statistics for the example

The probability distribution plot above shows the distribution of sample means we’d obtain under the assumption that the null hypothesis is true (population mean = 260) and we repeatedly drew a large number of random samples.

I left you with a question: where do we draw the line for statistical significance on the graph? Now we'll add in the significance level and the P value, which are the decision-making tools we'll need.

We'll use these tools to test the following hypotheses:

  • Null hypothesis: The population mean equals the hypothesized mean (260).
  • Alternative hypothesis: The population mean differs from the hypothesized mean (260).

What Is the Significance Level (Alpha)?

The significance level, also denoted as alpha or α, is the probability of rejecting the null hypothesis when it is true. For example, a significance level of 0.05 indicates a 5% risk of concluding that a difference exists when there is no actual difference.

These types of definitions can be hard to understand because of their technical nature. A picture makes the concepts much easier to comprehend!

The significance level determines how far out from the null hypothesis value we'll draw that line on the graph. To graph a significance level of 0.05, we need to shade the 5% of the distribution that is furthest away from the null hypothesis.

Probability plot that shows the critical regions for a significance level of 0.05

In the graph above, the two shaded areas are equidistant from the null hypothesis value and each area has a probability of 0.025, for a total of 0.05. In statistics, we call these shaded areas the critical region for a two-tailed test. If the population mean is 260, we’d expect to obtain a sample mean that falls in the critical region 5% of the time. The critical region defines how far away our sample statistic must be from the null hypothesis value before we can say it is unusual enough to reject the null hypothesis.

Our sample mean (330.6) falls within the critical region, which indicates it is statistically significant at the 0.05 level.

We can also see if it is statistically significant using the other common significance level of 0.01.

Probability plot that shows the critical regions for a significance level of 0.01

The two shaded areas each have a probability of 0.005, which adds up to a total probability of 0.01. This time our sample mean does not fall within the critical region and we fail to reject the null hypothesis. This comparison shows why you need to choose your significance level before you begin your study. It protects you from choosing a significance level because it conveniently gives you significant results!

Thanks to the graph, we were able to determine that our results are statistically significant at the 0.05 level without using a P value. However, when you use the numeric output produced by statistical software , you’ll need to compare the P value to your significance level to make this determination.

Ready for a demo of Minitab Statistical Software? Just ask! 

Talk to Minitab

What Are P values?

P-values are the probability of obtaining an effect at least as extreme as the one in your sample data, assuming the truth of the null hypothesis.

This definition of P values, while technically correct, is a bit convoluted. It’s easier to understand with a graph!

To graph the P value for our example data set, we need to determine the distance between the sample mean and the null hypothesis value (330.6 - 260 = 70.6). Next, we can graph the probability of obtaining a sample mean that is at least as extreme in both tails of the distribution (260 +/- 70.6).

Probability plot that shows the p-value for our sample mean

In the graph above, the two shaded areas each have a probability of 0.01556, for a total probability 0.03112. This probability represents the likelihood of obtaining a sample mean that is at least as extreme as our sample mean in both tails of the distribution if the population mean is 260. That’s our P value!

When a P value is less than or equal to the significance level, you reject the null hypothesis. If we take the P value for our example and compare it to the common significance levels, it matches the previous graphical results. The P value of 0.03112 is statistically significant at an alpha level of 0.05, but not at the 0.01 level.

If we stick to a significance level of 0.05, we can conclude that the average energy cost for the population is greater than 260.

A common mistake is to interpret the P-value as the probability that the null hypothesis is true. To understand why this interpretation is incorrect, please read my blog post  How to Correctly Interpret P Values .

Discussion about Statistically Significant Results

A hypothesis test evaluates two mutually exclusive statements about a population to determine which statement is best supported by the sample data. A test result is statistically significant when the sample statistic is unusual enough relative to the null hypothesis that we can reject the null hypothesis for the entire population. “Unusual enough” in a hypothesis test is defined by:

  • The assumption that the null hypothesis is true—the graphs are centered on the null hypothesis value.
  • The significance level—how far out do we draw the line for the critical region?
  • Our sample statistic—does it fall in the critical region?

Keep in mind that there is no magic significance level that distinguishes between the studies that have a true effect and those that don’t with 100% accuracy. The common alpha values of 0.05 and 0.01 are simply based on tradition. For a significance level of 0.05, expect to obtain sample means in the critical region 5% of the time when the null hypothesis is true . In these cases, you won’t know that the null hypothesis is true but you’ll reject it because the sample mean falls in the critical region. That’s why the significance level is also referred to as an error rate!

This type of error doesn’t imply that the experimenter did anything wrong or require any other unusual explanation. The graphs show that when the null hypothesis is true, it is possible to obtain these unusual sample means for no reason other than random sampling error. It’s just luck of the draw.

Significance levels and P values are important tools that help you quantify and control this type of error in a hypothesis test. Using these tools to decide when to reject the null hypothesis increases your chance of making the correct decision.

If you like this post, you might want to read the other posts in this series that use the same graphical framework:

  • Previous: Why We Need to Use Hypothesis Tests
  • Next: Confidence Intervals and Confidence Levels

If you'd like to see how I made these graphs, please read: How to Create a Graphical Version of the 1-sample t-Test .

minitab-on-linkedin

You Might Also Like

  • Trust Center

© 2023 Minitab, LLC. All Rights Reserved.

  • Terms of Use
  • Privacy Policy
  • Cookies Settings

P-Value in Statistical Hypothesis Tests: What is it?

P value definition.

A p value is used in hypothesis testing to help you support or reject the null hypothesis . The p value is the evidence against a null hypothesis . The smaller the p-value, the stronger the evidence that you should reject the null hypothesis.

P values are expressed as decimals although it may be easier to understand what they are if you convert them to a percentage . For example, a p value of 0.0254 is 2.54%. This means there is a 2.54% chance your results could be random (i.e. happened by chance). That’s pretty tiny. On the other hand, a large p-value of .9(90%) means your results have a 90% probability of being completely random and not due to anything in your experiment. Therefore, the smaller the p-value, the more important (“ significant “) your results.

When you run a hypothesis test , you compare the p value from your test to the alpha level you selected when you ran the test. Alpha levels can also be written as percentages.

p value

P Value vs Alpha level

Alpha levels are controlled by the researcher and are related to confidence levels . You get an alpha level by subtracting your confidence level from 100%. For example, if you want to be 98 percent confident in your research, the alpha level would be 2% (100% – 98%). When you run the hypothesis test, the test will give you a value for p. Compare that value to your chosen alpha level. For example, let’s say you chose an alpha level of 5% (0.05). If the results from the test give you:

  • A small p (≤ 0.05), reject the null hypothesis . This is strong evidence that the null hypothesis is invalid.
  • A large p (> 0.05) means the alternate hypothesis is weak, so you do not reject the null.

P Values and Critical Values

p-value

What if I Don’t Have an Alpha Level?

In an ideal world, you’ll have an alpha level. But if you do not, you can still use the following rough guidelines in deciding whether to support or reject the null hypothesis:

  • If p > .10 → “not significant”
  • If p ≤ .10 → “marginally significant”
  • If p ≤ .05 → “significant”
  • If p ≤ .01 → “highly significant.”

How to Calculate a P Value on the TI 83

Example question: The average wait time to see an E.R. doctor is said to be 150 minutes. You think the wait time is actually less. You take a random sample of 30 people and find their average wait is 148 minutes with a standard deviation of 5 minutes. Assume the distribution is normal. Find the p value for this test.

  • Press STAT then arrow over to TESTS.
  • Press ENTER for Z-Test .
  • Arrow over to Stats. Press ENTER.
  • Arrow down to μ0 and type 150. This is our null hypothesis mean.
  • Arrow down to σ. Type in your std dev: 5.
  • Arrow down to xbar. Type in your sample mean : 148.
  • Arrow down to n. Type in your sample size : 30.
  • Arrow to <μ0 for a left tail test . Press ENTER.
  • Arrow down to Calculate. Press ENTER. P is given as .014, or about 1%.

The probability that you would get a sample mean of 148 minutes is tiny, so you should reject the null hypothesis.

Note : If you don’t want to run a test, you could also use the TI 83 NormCDF function to get the area (which is the same thing as the probability value).

Dodge, Y. (2008). The Concise Encyclopedia of Statistics . Springer. Gonick, L. (1993). The Cartoon Guide to Statistics . HarperPerennial.

p-value Calculator

Table of contents

Welcome to our p-value calculator! You will never again have to wonder how to find the p-value, as here you can determine the one-sided and two-sided p-values from test statistics, following all the most popular distributions: normal, t-Student, chi-squared, and Snedecor's F.

P-values appear all over science, yet many people find the concept a bit intimidating. Don't worry – in this article, we will explain not only what the p-value is but also how to interpret p-values correctly . Have you ever been curious about how to calculate the p-value by hand? We provide you with all the necessary formulae as well!

🙋 If you want to revise some basics from statistics, our normal distribution calculator is an excellent place to start.

What is p-value?

Formally, the p-value is the probability that the test statistic will produce values at least as extreme as the value it produced for your sample . It is crucial to remember that this probability is calculated under the assumption that the null hypothesis H 0 is true !

More intuitively, p-value answers the question:

Assuming that I live in a world where the null hypothesis holds, how probable is it that, for another sample, the test I'm performing will generate a value at least as extreme as the one I observed for the sample I already have?

It is the alternative hypothesis that determines what "extreme" actually means , so the p-value depends on the alternative hypothesis that you state: left-tailed, right-tailed, or two-tailed. In the formulas below, S stands for a test statistic, x for the value it produced for a given sample, and Pr(event | H 0 ) is the probability of an event, calculated under the assumption that H 0 is true:

Left-tailed test: p-value = Pr(S ≤ x | H 0 )

Right-tailed test: p-value = Pr(S ≥ x | H 0 )

Two-tailed test:

p-value = 2 × min{Pr(S ≤ x | H 0 ), Pr(S ≥ x | H 0 )}

(By min{a,b} , we denote the smaller number out of a and b .)

If the distribution of the test statistic under H 0 is symmetric about 0 , then: p-value = 2 × Pr(S ≥ |x| | H 0 )

or, equivalently: p-value = 2 × Pr(S ≤ -|x| | H 0 )

As a picture is worth a thousand words, let us illustrate these definitions. Here, we use the fact that the probability can be neatly depicted as the area under the density curve for a given distribution. We give two sets of pictures: one for a symmetric distribution and the other for a skewed (non-symmetric) distribution.

  • Symmetric case: normal distribution:

p-values for symmetric distribution — left-tailed, right-tailed, and two-tailed tests.

  • Non-symmetric case: chi-squared distribution:

p-values for non-symmetric distribution — left-tailed, right-tailed, and two-tailed tests.

In the last picture (two-tailed p-value for skewed distribution), the area of the left-hand side is equal to the area of the right-hand side.

How do I calculate p-value from test statistic?

To determine the p-value, you need to know the distribution of your test statistic under the assumption that the null hypothesis is true . Then, with the help of the cumulative distribution function ( cdf ) of this distribution, we can express the probability of the test statistics being at least as extreme as its value x for the sample:

Left-tailed test:

p-value = cdf(x) .

Right-tailed test:

p-value = 1 - cdf(x) .

p-value = 2 × min{cdf(x) , 1 - cdf(x)} .

If the distribution of the test statistic under H 0 is symmetric about 0 , then a two-sided p-value can be simplified to p-value = 2 × cdf(-|x|) , or, equivalently, as p-value = 2 - 2 × cdf(|x|) .

The probability distributions that are most widespread in hypothesis testing tend to have complicated cdf formulae, and finding the p-value by hand may not be possible. You'll likely need to resort to a computer or to a statistical table, where people have gathered approximate cdf values.

Well, you now know how to calculate the p-value, but… why do you need to calculate this number in the first place? In hypothesis testing, the p-value approach is an alternative to the critical value approach . Recall that the latter requires researchers to pre-set the significance level, α, which is the probability of rejecting the null hypothesis when it is true (so of type I error ). Once you have your p-value, you just need to compare it with any given α to quickly decide whether or not to reject the null hypothesis at that significance level, α. For details, check the next section, where we explain how to interpret p-values.

How to interpret p-value

As we have mentioned above, the p-value is the answer to the following question:

What does that mean for you? Well, you've got two options:

  • A high p-value means that your data is highly compatible with the null hypothesis; and
  • A small p-value provides evidence against the null hypothesis , as it means that your result would be very improbable if the null hypothesis were true.

However, it may happen that the null hypothesis is true, but your sample is highly unusual! For example, imagine we studied the effect of a new drug and got a p-value of 0.03 . This means that in 3% of similar studies, random chance alone would still be able to produce the value of the test statistic that we obtained, or a value even more extreme, even if the drug had no effect at all!

The question "what is p-value" can also be answered as follows: p-value is the smallest level of significance at which the null hypothesis would be rejected. So, if you now want to make a decision on the null hypothesis at some significance level α , just compare your p-value with α :

  • If p-value ≤ α , then you reject the null hypothesis and accept the alternative hypothesis; and
  • If p-value ≥ α , then you don't have enough evidence to reject the null hypothesis.

Obviously, the fate of the null hypothesis depends on α . For instance, if the p-value was 0.03 , we would reject the null hypothesis at a significance level of 0.05 , but not at a level of 0.01 . That's why the significance level should be stated in advance and not adapted conveniently after the p-value has been established! A significance level of 0.05 is the most common value, but there's nothing magical about it. Here, you can see what too strong a faith in the 0.05 threshold can lead to. It's always best to report the p-value, and allow the reader to make their own conclusions.

Also, bear in mind that subject area expertise (and common reason) is crucial. Otherwise, mindlessly applying statistical principles, you can easily arrive at statistically significant, despite the conclusion being 100% untrue.

How to use the p-value calculator to find p-value from test statistic

As our p-value calculator is here at your service, you no longer need to wonder how to find p-value from all those complicated test statistics! Here are the steps you need to follow:

Pick the alternative hypothesis : two-tailed, right-tailed, or left-tailed.

Tell us the distribution of your test statistic under the null hypothesis: is it N(0,1), t-Student, chi-squared, or Snedecor's F? If you are unsure, check the sections below, as they are devoted to these distributions.

If needed, specify the degrees of freedom of the test statistic's distribution.

Enter the value of test statistic computed for your data sample.

By default, the calculator uses the significance level of 0.05.

Our calculator determines the p-value from the test statistic and provides the decision to be made about the null hypothesis.

How do I find p-value from z-score?

In terms of the cumulative distribution function (cdf) of the standard normal distribution, which is traditionally denoted by Φ , the p-value is given by the following formulae:

Left-tailed z-test:

p-value = Φ(Z score )

Right-tailed z-test:

p-value = 1 - Φ(Z score )

Two-tailed z-test:

p-value = 2 × Φ(−|Z score |)

p-value = 2 - 2 × Φ(|Z score |)

🙋 To learn more about Z-tests, head to Omni's Z-test calculator .

We use the Z-score if the test statistic approximately follows the standard normal distribution N(0,1) . Thanks to the central limit theorem, you can count on the approximation if you have a large sample (say at least 50 data points) and treat your distribution as normal.

A Z-test most often refers to testing the population mean , or the difference between two population means, in particular between two proportions. You can also find Z-tests in maximum likelihood estimations.

How do I find p-value from t?

The p-value from the t-score is given by the following formulae, in which cdf t,d stands for the cumulative distribution function of the t-Student distribution with d degrees of freedom:

Left-tailed t-test:

p-value = cdf t,d (t score )

Right-tailed t-test:

p-value = 1 - cdf t,d (t score )

Two-tailed t-test:

p-value = 2 × cdf t,d (−|t score |)

p-value = 2 - 2 × cdf t,d (|t score |)

Use the t-score option if your test statistic follows the t-Student distribution . This distribution has a shape similar to N(0,1) (bell-shaped and symmetric) but has heavier tails – the exact shape depends on the parameter called the degrees of freedom . If the number of degrees of freedom is large (>30), which generically happens for large samples, the t-Student distribution is practically indistinguishable from the normal distribution N(0,1).

The most common t-tests are those for population means with an unknown population standard deviation, or for the difference between means of two populations , with either equal or unequal yet unknown population standard deviations. There's also a t-test for paired (dependent) samples .

🙋 To get more insights into t-statistics, we recommend using our t-test calculator .

p-value from chi-square score (χ² score)

Use the χ²-score option when performing a test in which the test statistic follows the χ²-distribution .

This distribution arises if, for example, you take the sum of squared variables, each following the normal distribution N(0,1). Remember to check the number of degrees of freedom of the χ²-distribution of your test statistic!

How to find the p-value from chi-square-score ? You can do it with the help of the following formulae, in which cdf χ²,d denotes the cumulative distribution function of the χ²-distribution with d degrees of freedom:

Left-tailed χ²-test:

p-value = cdf χ²,d (χ² score )

Right-tailed χ²-test:

p-value = 1 - cdf χ²,d (χ² score )

Remember that χ²-tests for goodness-of-fit and independence are right-tailed tests! (see below)

Two-tailed χ²-test:

p-value = 2 × min{cdf χ²,d (χ² score ), 1 - cdf χ²,d (χ² score )}

(By min{a,b} , we denote the smaller of the numbers a and b .)

The most popular tests which lead to a χ²-score are the following:

Testing whether the variance of normally distributed data has some pre-determined value. In this case, the test statistic has the χ²-distribution with n - 1 degrees of freedom, where n is the sample size. This can be a one-tailed or two-tailed test .

Goodness-of-fit test checks whether the empirical (sample) distribution agrees with some expected probability distribution. In this case, the test statistic follows the χ²-distribution with k - 1 degrees of freedom, where k is the number of classes into which the sample is divided. This is a right-tailed test .

Independence test is used to determine if there is a statistically significant relationship between two variables. In this case, its test statistic is based on the contingency table and follows the χ²-distribution with (r - 1)(c - 1) degrees of freedom, where r is the number of rows, and c is the number of columns in this contingency table. This also is a right-tailed test .

p-value from F-score

Finally, the F-score option should be used when you perform a test in which the test statistic follows the F-distribution , also known as the Fisher–Snedecor distribution. The exact shape of an F-distribution depends on two degrees of freedom .

To see where those degrees of freedom come from, consider the independent random variables X and Y , which both follow the χ²-distributions with d 1 and d 2 degrees of freedom, respectively. In that case, the ratio (X/d 1 )/(Y/d 2 ) follows the F-distribution, with (d 1 , d 2 ) -degrees of freedom. For this reason, the two parameters d 1 and d 2 are also called the numerator and denominator degrees of freedom .

The p-value from F-score is given by the following formulae, where we let cdf F,d1,d2 denote the cumulative distribution function of the F-distribution, with (d 1 , d 2 ) -degrees of freedom:

Left-tailed F-test:

p-value = cdf F,d1,d2 (F score )

Right-tailed F-test:

p-value = 1 - cdf F,d1,d2 (F score )

Two-tailed F-test:

p-value = 2 × min{cdf F,d1,d2 (F score ), 1 - cdf F,d1,d2 (F score )}

Below we list the most important tests that produce F-scores. All of them are right-tailed tests .

A test for the equality of variances in two normally distributed populations . Its test statistic follows the F-distribution with (n - 1, m - 1) -degrees of freedom, where n and m are the respective sample sizes.

ANOVA is used to test the equality of means in three or more groups that come from normally distributed populations with equal variances. We arrive at the F-distribution with (k - 1, n - k) -degrees of freedom, where k is the number of groups, and n is the total sample size (in all groups together).

A test for overall significance of regression analysis . The test statistic has an F-distribution with (k - 1, n - k) -degrees of freedom, where n is the sample size, and k is the number of variables (including the intercept).

With the presence of the linear relationship having been established in your data sample with the above test, you can calculate the coefficient of determination, R 2 , which indicates the strength of this relationship . You can do it by hand or use our coefficient of determination calculator .

A test to compare two nested regression models . The test statistic follows the F-distribution with (k 2 - k 1 , n - k 2 ) -degrees of freedom, where k 1 and k 2 are the numbers of variables in the smaller and bigger models, respectively, and n is the sample size.

You may notice that the F-test of an overall significance is a particular form of the F-test for comparing two nested models: it tests whether our model does significantly better than the model with no predictors (i.e., the intercept-only model).

Can p-value be negative?

No, the p-value cannot be negative. This is because probabilities cannot be negative, and the p-value is the probability of the test statistic satisfying certain conditions.

What does a high p-value mean?

A high p-value means that under the null hypothesis, there's a high probability that for another sample, the test statistic will generate a value at least as extreme as the one observed in the sample you already have. A high p-value doesn't allow you to reject the null hypothesis.

What does a low p-value mean?

A low p-value means that under the null hypothesis, there's little probability that for another sample, the test statistic will generate a value at least as extreme as the one observed for the sample you already have. A low p-value is evidence in favor of the alternative hypothesis – it allows you to reject the null hypothesis.

What do you want?

What do you know?

Your Z-score

Z-score : the test statistic follows the standard normal distribution N(0,1).

.css-m482sy.css-m482sy{color:#2B3148;background-color:transparent;font-family:var(--calculator-ui-font-family),Verdana,sans-serif;font-size:20px;line-height:24px;overflow:visible;padding-top:0px;position:relative;}.css-m482sy.css-m482sy:after{content:'';-webkit-transform:scale(0);-moz-transform:scale(0);-ms-transform:scale(0);transform:scale(0);position:absolute;border:2px solid #EA9430;border-radius:2px;inset:-8px;z-index:1;}.css-m482sy .js-external-link-button.link-like,.css-m482sy .js-external-link-anchor{color:inherit;border-radius:1px;-webkit-text-decoration:underline;text-decoration:underline;}.css-m482sy .js-external-link-button.link-like:hover,.css-m482sy .js-external-link-anchor:hover,.css-m482sy .js-external-link-button.link-like:active,.css-m482sy .js-external-link-anchor:active{text-decoration-thickness:2px;text-shadow:1px 0 0;}.css-m482sy .js-external-link-button.link-like:focus-visible,.css-m482sy .js-external-link-anchor:focus-visible{outline:transparent 2px dotted;box-shadow:0 0 0 2px #6314E6;}.css-m482sy p,.css-m482sy div{margin:0;display:block;}.css-m482sy pre{margin:0;display:block;}.css-m482sy pre code{display:block;width:-webkit-fit-content;width:-moz-fit-content;width:fit-content;}.css-m482sy pre:not(:first-child){padding-top:8px;}.css-m482sy ul,.css-m482sy ol{display:block margin:0;padding-left:20px;}.css-m482sy ul li,.css-m482sy ol li{padding-top:8px;}.css-m482sy ul ul,.css-m482sy ol ul,.css-m482sy ul ol,.css-m482sy ol ol{padding-top:0;}.css-m482sy ul:not(:first-child),.css-m482sy ol:not(:first-child){padding-top:4px;} .css-63uqft{margin:auto;background-color:white;overflow:auto;overflow-wrap:break-word;word-break:break-word;}.css-63uqft code,.css-63uqft kbd,.css-63uqft pre,.css-63uqft samp{font-family:monospace;}.css-63uqft code{padding:2px 4px;color:#444;background:#ddd;border-radius:4px;}.css-63uqft figcaption,.css-63uqft caption{text-align:center;}.css-63uqft figcaption{font-size:12px;font-style:italic;overflow:hidden;}.css-63uqft h3{font-size:1.75rem;}.css-63uqft h4{font-size:1.5rem;}.css-63uqft .mathBlock{font-size:24px;-webkit-padding-start:4px;padding-inline-start:4px;}.css-63uqft .mathBlock .katex{font-size:24px;text-align:left;}.css-63uqft .math-inline{background-color:#f0f0f0;display:inline-block;font-size:inherit;padding:0 3px;}.css-63uqft .videoBlock,.css-63uqft .imageBlock{margin-bottom:16px;}.css-63uqft .imageBlock__image-align--left,.css-63uqft .videoBlock__video-align--left{float:left;}.css-63uqft .imageBlock__image-align--right,.css-63uqft .videoBlock__video-align--right{float:right;}.css-63uqft .imageBlock__image-align--center,.css-63uqft .videoBlock__video-align--center{display:block;margin-left:auto;margin-right:auto;clear:both;}.css-63uqft .imageBlock__image-align--none,.css-63uqft .videoBlock__video-align--none{clear:both;margin-left:0;margin-right:0;}.css-63uqft .videoBlock__video--wrapper{position:relative;padding-bottom:56.25%;height:0;}.css-63uqft .videoBlock__video--wrapper iframe{position:absolute;top:0;left:0;width:100%;height:100%;}.css-63uqft .videoBlock__caption{text-align:left;}@font-face{font-family:'KaTeX_AMS';src:url(/katex-fonts/KaTeX_AMS-Regular.woff2) format('woff2'),url(/katex-fonts/KaTeX_AMS-Regular.woff) format('woff'),url(/katex-fonts/KaTeX_AMS-Regular.ttf) format('truetype');font-weight:normal;font-style:normal;}@font-face{font-family:'KaTeX_Caligraphic';src:url(/katex-fonts/KaTeX_Caligraphic-Bold.woff2) format('woff2'),url(/katex-fonts/KaTeX_Caligraphic-Bold.woff) format('woff'),url(/katex-fonts/KaTeX_Caligraphic-Bold.ttf) format('truetype');font-weight:bold;font-style:normal;}@font-face{font-family:'KaTeX_Caligraphic';src:url(/katex-fonts/KaTeX_Caligraphic-Regular.woff2) format('woff2'),url(/katex-fonts/KaTeX_Caligraphic-Regular.woff) format('woff'),url(/katex-fonts/KaTeX_Caligraphic-Regular.ttf) format('truetype');font-weight:normal;font-style:normal;}@font-face{font-family:'KaTeX_Fraktur';src:url(/katex-fonts/KaTeX_Fraktur-Bold.woff2) format('woff2'),url(/katex-fonts/KaTeX_Fraktur-Bold.woff) format('woff'),url(/katex-fonts/KaTeX_Fraktur-Bold.ttf) format('truetype');font-weight:bold;font-style:normal;}@font-face{font-family:'KaTeX_Fraktur';src:url(/katex-fonts/KaTeX_Fraktur-Regular.woff2) format('woff2'),url(/katex-fonts/KaTeX_Fraktur-Regular.woff) format('woff'),url(/katex-fonts/KaTeX_Fraktur-Regular.ttf) format('truetype');font-weight:normal;font-style:normal;}@font-face{font-family:'KaTeX_Main';src:url(/katex-fonts/KaTeX_Main-Bold.woff2) format('woff2'),url(/katex-fonts/KaTeX_Main-Bold.woff) format('woff'),url(/katex-fonts/KaTeX_Main-Bold.ttf) format('truetype');font-weight:bold;font-style:normal;}@font-face{font-family:'KaTeX_Main';src:url(/katex-fonts/KaTeX_Main-BoldItalic.woff2) format('woff2'),url(/katex-fonts/KaTeX_Main-BoldItalic.woff) format('woff'),url(/katex-fonts/KaTeX_Main-BoldItalic.ttf) format('truetype');font-weight:bold;font-style:italic;}@font-face{font-family:'KaTeX_Main';src:url(/katex-fonts/KaTeX_Main-Italic.woff2) format('woff2'),url(/katex-fonts/KaTeX_Main-Italic.woff) format('woff'),url(/katex-fonts/KaTeX_Main-Italic.ttf) format('truetype');font-weight:normal;font-style:italic;}@font-face{font-family:'KaTeX_Main';src:url(/katex-fonts/KaTeX_Main-Regular.woff2) format('woff2'),url(/katex-fonts/KaTeX_Main-Regular.woff) format('woff'),url(/katex-fonts/KaTeX_Main-Regular.ttf) format('truetype');font-weight:normal;font-style:normal;}@font-face{font-family:'KaTeX_Math';src:url(/katex-fonts/KaTeX_Math-BoldItalic.woff2) format('woff2'),url(/katex-fonts/KaTeX_Math-BoldItalic.woff) format('woff'),url(/katex-fonts/KaTeX_Math-BoldItalic.ttf) format('truetype');font-weight:bold;font-style:italic;}@font-face{font-family:'KaTeX_Math';src:url(/katex-fonts/KaTeX_Math-Italic.woff2) format('woff2'),url(/katex-fonts/KaTeX_Math-Italic.woff) format('woff'),url(/katex-fonts/KaTeX_Math-Italic.ttf) format('truetype');font-weight:normal;font-style:italic;}@font-face{font-family:'KaTeX_SansSerif';src:url(/katex-fonts/KaTeX_SansSerif-Bold.woff2) format('woff2'),url(/katex-fonts/KaTeX_SansSerif-Bold.woff) format('woff'),url(/katex-fonts/KaTeX_SansSerif-Bold.ttf) format('truetype');font-weight:bold;font-style:normal;}@font-face{font-family:'KaTeX_SansSerif';src:url(/katex-fonts/KaTeX_SansSerif-Italic.woff2) format('woff2'),url(/katex-fonts/KaTeX_SansSerif-Italic.woff) format('woff'),url(/katex-fonts/KaTeX_SansSerif-Italic.ttf) format('truetype');font-weight:normal;font-style:italic;}@font-face{font-family:'KaTeX_SansSerif';src:url(/katex-fonts/KaTeX_SansSerif-Regular.woff2) format('woff2'),url(/katex-fonts/KaTeX_SansSerif-Regular.woff) format('woff'),url(/katex-fonts/KaTeX_SansSerif-Regular.ttf) format('truetype');font-weight:normal;font-style:normal;}@font-face{font-family:'KaTeX_Script';src:url(/katex-fonts/KaTeX_Script-Regular.woff2) format('woff2'),url(/katex-fonts/KaTeX_Script-Regular.woff) format('woff'),url(/katex-fonts/KaTeX_Script-Regular.ttf) format('truetype');font-weight:normal;font-style:normal;}@font-face{font-family:'KaTeX_Size1';src:url(/katex-fonts/KaTeX_Size1-Regular.woff2) format('woff2'),url(/katex-fonts/KaTeX_Size1-Regular.woff) format('woff'),url(/katex-fonts/KaTeX_Size1-Regular.ttf) format('truetype');font-weight:normal;font-style:normal;}@font-face{font-family:'KaTeX_Size2';src:url(/katex-fonts/KaTeX_Size2-Regular.woff2) format('woff2'),url(/katex-fonts/KaTeX_Size2-Regular.woff) format('woff'),url(/katex-fonts/KaTeX_Size2-Regular.ttf) format('truetype');font-weight:normal;font-style:normal;}@font-face{font-family:'KaTeX_Size3';src:url(/katex-fonts/KaTeX_Size3-Regular.woff2) format('woff2'),url(/katex-fonts/KaTeX_Size3-Regular.woff) format('woff'),url(/katex-fonts/KaTeX_Size3-Regular.ttf) format('truetype');font-weight:normal;font-style:normal;}@font-face{font-family:'KaTeX_Size4';src:url(/katex-fonts/KaTeX_Size4-Regular.woff2) format('woff2'),url(/katex-fonts/KaTeX_Size4-Regular.woff) format('woff'),url(/katex-fonts/KaTeX_Size4-Regular.ttf) format('truetype');font-weight:normal;font-style:normal;}@font-face{font-family:'KaTeX_Typewriter';src:url(/katex-fonts/KaTeX_Typewriter-Regular.woff2) format('woff2'),url(/katex-fonts/KaTeX_Typewriter-Regular.woff) format('woff'),url(/katex-fonts/KaTeX_Typewriter-Regular.ttf) format('truetype');font-weight:normal;font-style:normal;}.css-63uqft .katex{font:normal 1.21em KaTeX_Main,Times New Roman,serif;line-height:1.2;text-indent:0;text-rendering:auto;}.css-63uqft .katex *{-ms-high-contrast-adjust:none!important;border-color:currentColor;}.css-63uqft .katex .katex-version::after{content:'0.13.13';}.css-63uqft .katex .katex-mathml{position:absolute;clip:rect(1px,1px,1px,1px);padding:0;border:0;height:1px;width:1px;overflow:hidden;}.css-63uqft .katex .katex-html>.newline{display:block;}.css-63uqft .katex .base{position:relative;display:inline-block;white-space:nowrap;width:-webkit-min-content;width:-moz-min-content;width:-webkit-min-content;width:-moz-min-content;width:min-content;}.css-63uqft .katex .strut{display:inline-block;}.css-63uqft .katex .textbf{font-weight:bold;}.css-63uqft .katex .textit{font-style:italic;}.css-63uqft .katex .textrm{font-family:KaTeX_Main;}.css-63uqft .katex .textsf{font-family:KaTeX_SansSerif;}.css-63uqft .katex .texttt{font-family:KaTeX_Typewriter;}.css-63uqft .katex .mathnormal{font-family:KaTeX_Math;font-style:italic;}.css-63uqft .katex .mathit{font-family:KaTeX_Main;font-style:italic;}.css-63uqft .katex .mathrm{font-style:normal;}.css-63uqft .katex .mathbf{font-family:KaTeX_Main;font-weight:bold;}.css-63uqft .katex .boldsymbol{font-family:KaTeX_Math;font-weight:bold;font-style:italic;}.css-63uqft .katex .amsrm{font-family:KaTeX_AMS;}.css-63uqft .katex .mathbb,.css-63uqft .katex .textbb{font-family:KaTeX_AMS;}.css-63uqft .katex .mathcal{font-family:KaTeX_Caligraphic;}.css-63uqft .katex .mathfrak,.css-63uqft .katex .textfrak{font-family:KaTeX_Fraktur;}.css-63uqft .katex .mathtt{font-family:KaTeX_Typewriter;}.css-63uqft .katex .mathscr,.css-63uqft .katex .textscr{font-family:KaTeX_Script;}.css-63uqft .katex .mathsf,.css-63uqft .katex .textsf{font-family:KaTeX_SansSerif;}.css-63uqft .katex .mathboldsf,.css-63uqft .katex .textboldsf{font-family:KaTeX_SansSerif;font-weight:bold;}.css-63uqft .katex .mathitsf,.css-63uqft .katex .textitsf{font-family:KaTeX_SansSerif;font-style:italic;}.css-63uqft .katex .mainrm{font-family:KaTeX_Main;font-style:normal;}.css-63uqft .katex .vlist-t{display:inline-table;table-layout:fixed;border-collapse:collapse;}.css-63uqft .katex .vlist-r{display:table-row;}.css-63uqft .katex .vlist{display:table-cell;vertical-align:bottom;position:relative;}.css-63uqft .katex .vlist>span{display:block;height:0;position:relative;}.css-63uqft .katex .vlist>span>span{display:inline-block;}.css-63uqft .katex .vlist>span>.pstrut{overflow:hidden;width:0;}.css-63uqft .katex .vlist-t2{margin-right:-2px;}.css-63uqft .katex .vlist-s{display:table-cell;vertical-align:bottom;font-size:1px;width:2px;min-width:2px;}.css-63uqft .katex .vbox{display:-webkit-inline-box;display:-webkit-inline-flex;display:-ms-inline-flexbox;display:inline-flex;-webkit-flex-direction:column;-ms-flex-direction:column;flex-direction:column;-webkit-align-items:baseline;-webkit-box-align:baseline;-ms-flex-align:baseline;align-items:baseline;}.css-63uqft .katex .hbox{display:-webkit-inline-box;display:-webkit-inline-flex;display:-ms-inline-flexbox;display:inline-flex;-webkit-flex-direction:row;-ms-flex-direction:row;flex-direction:row;width:100%;}.css-63uqft .katex .thinbox{display:-webkit-inline-box;display:-webkit-inline-flex;display:-ms-inline-flexbox;display:inline-flex;-webkit-flex-direction:row;-ms-flex-direction:row;flex-direction:row;width:0;max-width:0;}.css-63uqft .katex .msupsub{text-align:left;}.css-63uqft .katex .mfrac>span>span{text-align:center;}.css-63uqft .katex .mfrac .frac-line{display:inline-block;width:100%;border-bottom-style:solid;}.css-63uqft .katex .mfrac .frac-line,.css-63uqft .katex .overline .overline-line,.css-63uqft .katex .underline .underline-line,.css-63uqft .katex .hline,.css-63uqft .katex .hdashline,.css-63uqft .katex .rule{min-height:1px;}.css-63uqft .katex .mspace{display:inline-block;}.css-63uqft .katex .llap,.css-63uqft .katex .rlap,.css-63uqft .katex .clap{width:0;position:relative;}.css-63uqft .katex .llap>.inner,.css-63uqft .katex .rlap>.inner,.css-63uqft .katex .clap>.inner{position:absolute;}.css-63uqft .katex .llap>.fix,.css-63uqft .katex .rlap>.fix,.css-63uqft .katex .clap>.fix{display:inline-block;}.css-63uqft .katex .llap>.inner{right:0;}.css-63uqft .katex .rlap>.inner,.css-63uqft .katex .clap>.inner{left:0;}.css-63uqft .katex .clap>.inner>span{margin-left:-50%;margin-right:50%;}.css-63uqft .katex .rule{display:inline-block;border:solid 0;position:relative;}.css-63uqft .katex .overline .overline-line,.css-63uqft .katex .underline .underline-line,.css-63uqft .katex .hline{display:inline-block;width:100%;border-bottom-style:solid;}.css-63uqft .katex .hdashline{display:inline-block;width:100%;border-bottom-style:dashed;}.css-63uqft .katex .sqrt>.root{margin-left:0.27777778em;margin-right:-0.55555556em;}.css-63uqft .katex .sizing.reset-size1.size1,.css-63uqft .katex .fontsize-ensurer.reset-size1.size1{font-size:1em;}.css-63uqft .katex .sizing.reset-size1.size2,.css-63uqft .katex .fontsize-ensurer.reset-size1.size2{font-size:1.2em;}.css-63uqft .katex .sizing.reset-size1.size3,.css-63uqft .katex .fontsize-ensurer.reset-size1.size3{font-size:1.4em;}.css-63uqft .katex .sizing.reset-size1.size4,.css-63uqft .katex .fontsize-ensurer.reset-size1.size4{font-size:1.6em;}.css-63uqft .katex .sizing.reset-size1.size5,.css-63uqft .katex .fontsize-ensurer.reset-size1.size5{font-size:1.8em;}.css-63uqft .katex .sizing.reset-size1.size6,.css-63uqft .katex .fontsize-ensurer.reset-size1.size6{font-size:2em;}.css-63uqft .katex .sizing.reset-size1.size7,.css-63uqft .katex .fontsize-ensurer.reset-size1.size7{font-size:2.4em;}.css-63uqft .katex .sizing.reset-size1.size8,.css-63uqft .katex .fontsize-ensurer.reset-size1.size8{font-size:2.88em;}.css-63uqft .katex .sizing.reset-size1.size9,.css-63uqft .katex .fontsize-ensurer.reset-size1.size9{font-size:3.456em;}.css-63uqft .katex .sizing.reset-size1.size10,.css-63uqft .katex .fontsize-ensurer.reset-size1.size10{font-size:4.148em;}.css-63uqft .katex .sizing.reset-size1.size11,.css-63uqft .katex .fontsize-ensurer.reset-size1.size11{font-size:4.976em;}.css-63uqft .katex .sizing.reset-size2.size1,.css-63uqft .katex .fontsize-ensurer.reset-size2.size1{font-size:0.83333333em;}.css-63uqft .katex .sizing.reset-size2.size2,.css-63uqft .katex .fontsize-ensurer.reset-size2.size2{font-size:1em;}.css-63uqft .katex .sizing.reset-size2.size3,.css-63uqft .katex .fontsize-ensurer.reset-size2.size3{font-size:1.16666667em;}.css-63uqft .katex .sizing.reset-size2.size4,.css-63uqft .katex .fontsize-ensurer.reset-size2.size4{font-size:1.33333333em;}.css-63uqft .katex .sizing.reset-size2.size5,.css-63uqft .katex .fontsize-ensurer.reset-size2.size5{font-size:1.5em;}.css-63uqft .katex .sizing.reset-size2.size6,.css-63uqft .katex .fontsize-ensurer.reset-size2.size6{font-size:1.66666667em;}.css-63uqft .katex .sizing.reset-size2.size7,.css-63uqft .katex .fontsize-ensurer.reset-size2.size7{font-size:2em;}.css-63uqft .katex .sizing.reset-size2.size8,.css-63uqft .katex .fontsize-ensurer.reset-size2.size8{font-size:2.4em;}.css-63uqft .katex .sizing.reset-size2.size9,.css-63uqft .katex .fontsize-ensurer.reset-size2.size9{font-size:2.88em;}.css-63uqft .katex .sizing.reset-size2.size10,.css-63uqft .katex .fontsize-ensurer.reset-size2.size10{font-size:3.45666667em;}.css-63uqft .katex .sizing.reset-size2.size11,.css-63uqft .katex .fontsize-ensurer.reset-size2.size11{font-size:4.14666667em;}.css-63uqft .katex .sizing.reset-size3.size1,.css-63uqft .katex .fontsize-ensurer.reset-size3.size1{font-size:0.71428571em;}.css-63uqft .katex .sizing.reset-size3.size2,.css-63uqft .katex .fontsize-ensurer.reset-size3.size2{font-size:0.85714286em;}.css-63uqft .katex .sizing.reset-size3.size3,.css-63uqft .katex .fontsize-ensurer.reset-size3.size3{font-size:1em;}.css-63uqft .katex .sizing.reset-size3.size4,.css-63uqft .katex .fontsize-ensurer.reset-size3.size4{font-size:1.14285714em;}.css-63uqft .katex .sizing.reset-size3.size5,.css-63uqft .katex .fontsize-ensurer.reset-size3.size5{font-size:1.28571429em;}.css-63uqft .katex .sizing.reset-size3.size6,.css-63uqft .katex .fontsize-ensurer.reset-size3.size6{font-size:1.42857143em;}.css-63uqft .katex .sizing.reset-size3.size7,.css-63uqft .katex .fontsize-ensurer.reset-size3.size7{font-size:1.71428571em;}.css-63uqft .katex .sizing.reset-size3.size8,.css-63uqft .katex .fontsize-ensurer.reset-size3.size8{font-size:2.05714286em;}.css-63uqft .katex .sizing.reset-size3.size9,.css-63uqft .katex .fontsize-ensurer.reset-size3.size9{font-size:2.46857143em;}.css-63uqft .katex .sizing.reset-size3.size10,.css-63uqft .katex .fontsize-ensurer.reset-size3.size10{font-size:2.96285714em;}.css-63uqft .katex .sizing.reset-size3.size11,.css-63uqft .katex .fontsize-ensurer.reset-size3.size11{font-size:3.55428571em;}.css-63uqft .katex .sizing.reset-size4.size1,.css-63uqft .katex .fontsize-ensurer.reset-size4.size1{font-size:0.625em;}.css-63uqft .katex .sizing.reset-size4.size2,.css-63uqft .katex .fontsize-ensurer.reset-size4.size2{font-size:0.75em;}.css-63uqft .katex .sizing.reset-size4.size3,.css-63uqft .katex .fontsize-ensurer.reset-size4.size3{font-size:0.875em;}.css-63uqft .katex .sizing.reset-size4.size4,.css-63uqft .katex .fontsize-ensurer.reset-size4.size4{font-size:1em;}.css-63uqft .katex .sizing.reset-size4.size5,.css-63uqft .katex .fontsize-ensurer.reset-size4.size5{font-size:1.125em;}.css-63uqft .katex .sizing.reset-size4.size6,.css-63uqft .katex .fontsize-ensurer.reset-size4.size6{font-size:1.25em;}.css-63uqft .katex .sizing.reset-size4.size7,.css-63uqft .katex .fontsize-ensurer.reset-size4.size7{font-size:1.5em;}.css-63uqft .katex .sizing.reset-size4.size8,.css-63uqft .katex .fontsize-ensurer.reset-size4.size8{font-size:1.8em;}.css-63uqft .katex .sizing.reset-size4.size9,.css-63uqft .katex .fontsize-ensurer.reset-size4.size9{font-size:2.16em;}.css-63uqft .katex .sizing.reset-size4.size10,.css-63uqft .katex .fontsize-ensurer.reset-size4.size10{font-size:2.5925em;}.css-63uqft .katex .sizing.reset-size4.size11,.css-63uqft .katex .fontsize-ensurer.reset-size4.size11{font-size:3.11em;}.css-63uqft .katex .sizing.reset-size5.size1,.css-63uqft .katex .fontsize-ensurer.reset-size5.size1{font-size:0.55555556em;}.css-63uqft .katex .sizing.reset-size5.size2,.css-63uqft .katex .fontsize-ensurer.reset-size5.size2{font-size:0.66666667em;}.css-63uqft .katex .sizing.reset-size5.size3,.css-63uqft .katex .fontsize-ensurer.reset-size5.size3{font-size:0.77777778em;}.css-63uqft .katex .sizing.reset-size5.size4,.css-63uqft .katex .fontsize-ensurer.reset-size5.size4{font-size:0.88888889em;}.css-63uqft .katex .sizing.reset-size5.size5,.css-63uqft .katex .fontsize-ensurer.reset-size5.size5{font-size:1em;}.css-63uqft .katex .sizing.reset-size5.size6,.css-63uqft .katex .fontsize-ensurer.reset-size5.size6{font-size:1.11111111em;}.css-63uqft .katex .sizing.reset-size5.size7,.css-63uqft .katex .fontsize-ensurer.reset-size5.size7{font-size:1.33333333em;}.css-63uqft .katex .sizing.reset-size5.size8,.css-63uqft .katex .fontsize-ensurer.reset-size5.size8{font-size:1.6em;}.css-63uqft .katex .sizing.reset-size5.size9,.css-63uqft .katex .fontsize-ensurer.reset-size5.size9{font-size:1.92em;}.css-63uqft .katex .sizing.reset-size5.size10,.css-63uqft .katex .fontsize-ensurer.reset-size5.size10{font-size:2.30444444em;}.css-63uqft .katex .sizing.reset-size5.size11,.css-63uqft .katex .fontsize-ensurer.reset-size5.size11{font-size:2.76444444em;}.css-63uqft .katex .sizing.reset-size6.size1,.css-63uqft .katex .fontsize-ensurer.reset-size6.size1{font-size:0.5em;}.css-63uqft .katex .sizing.reset-size6.size2,.css-63uqft .katex .fontsize-ensurer.reset-size6.size2{font-size:0.6em;}.css-63uqft .katex .sizing.reset-size6.size3,.css-63uqft .katex .fontsize-ensurer.reset-size6.size3{font-size:0.7em;}.css-63uqft .katex .sizing.reset-size6.size4,.css-63uqft .katex .fontsize-ensurer.reset-size6.size4{font-size:0.8em;}.css-63uqft .katex .sizing.reset-size6.size5,.css-63uqft .katex .fontsize-ensurer.reset-size6.size5{font-size:0.9em;}.css-63uqft .katex .sizing.reset-size6.size6,.css-63uqft .katex .fontsize-ensurer.reset-size6.size6{font-size:1em;}.css-63uqft .katex .sizing.reset-size6.size7,.css-63uqft .katex .fontsize-ensurer.reset-size6.size7{font-size:1.2em;}.css-63uqft .katex .sizing.reset-size6.size8,.css-63uqft .katex .fontsize-ensurer.reset-size6.size8{font-size:1.44em;}.css-63uqft .katex .sizing.reset-size6.size9,.css-63uqft .katex .fontsize-ensurer.reset-size6.size9{font-size:1.728em;}.css-63uqft .katex .sizing.reset-size6.size10,.css-63uqft .katex .fontsize-ensurer.reset-size6.size10{font-size:2.074em;}.css-63uqft .katex .sizing.reset-size6.size11,.css-63uqft .katex .fontsize-ensurer.reset-size6.size11{font-size:2.488em;}.css-63uqft .katex .sizing.reset-size7.size1,.css-63uqft .katex .fontsize-ensurer.reset-size7.size1{font-size:0.41666667em;}.css-63uqft .katex .sizing.reset-size7.size2,.css-63uqft .katex .fontsize-ensurer.reset-size7.size2{font-size:0.5em;}.css-63uqft .katex .sizing.reset-size7.size3,.css-63uqft .katex .fontsize-ensurer.reset-size7.size3{font-size:0.58333333em;}.css-63uqft .katex .sizing.reset-size7.size4,.css-63uqft .katex .fontsize-ensurer.reset-size7.size4{font-size:0.66666667em;}.css-63uqft .katex .sizing.reset-size7.size5,.css-63uqft .katex .fontsize-ensurer.reset-size7.size5{font-size:0.75em;}.css-63uqft .katex .sizing.reset-size7.size6,.css-63uqft .katex .fontsize-ensurer.reset-size7.size6{font-size:0.83333333em;}.css-63uqft .katex .sizing.reset-size7.size7,.css-63uqft .katex .fontsize-ensurer.reset-size7.size7{font-size:1em;}.css-63uqft .katex .sizing.reset-size7.size8,.css-63uqft .katex .fontsize-ensurer.reset-size7.size8{font-size:1.2em;}.css-63uqft .katex .sizing.reset-size7.size9,.css-63uqft .katex .fontsize-ensurer.reset-size7.size9{font-size:1.44em;}.css-63uqft .katex .sizing.reset-size7.size10,.css-63uqft .katex .fontsize-ensurer.reset-size7.size10{font-size:1.72833333em;}.css-63uqft .katex .sizing.reset-size7.size11,.css-63uqft .katex .fontsize-ensurer.reset-size7.size11{font-size:2.07333333em;}.css-63uqft .katex .sizing.reset-size8.size1,.css-63uqft .katex .fontsize-ensurer.reset-size8.size1{font-size:0.34722222em;}.css-63uqft .katex .sizing.reset-size8.size2,.css-63uqft .katex .fontsize-ensurer.reset-size8.size2{font-size:0.41666667em;}.css-63uqft .katex .sizing.reset-size8.size3,.css-63uqft .katex .fontsize-ensurer.reset-size8.size3{font-size:0.48611111em;}.css-63uqft .katex .sizing.reset-size8.size4,.css-63uqft .katex .fontsize-ensurer.reset-size8.size4{font-size:0.55555556em;}.css-63uqft .katex .sizing.reset-size8.size5,.css-63uqft .katex .fontsize-ensurer.reset-size8.size5{font-size:0.625em;}.css-63uqft .katex .sizing.reset-size8.size6,.css-63uqft .katex .fontsize-ensurer.reset-size8.size6{font-size:0.69444444em;}.css-63uqft .katex .sizing.reset-size8.size7,.css-63uqft .katex .fontsize-ensurer.reset-size8.size7{font-size:0.83333333em;}.css-63uqft .katex .sizing.reset-size8.size8,.css-63uqft .katex .fontsize-ensurer.reset-size8.size8{font-size:1em;}.css-63uqft .katex .sizing.reset-size8.size9,.css-63uqft .katex .fontsize-ensurer.reset-size8.size9{font-size:1.2em;}.css-63uqft .katex .sizing.reset-size8.size10,.css-63uqft .katex .fontsize-ensurer.reset-size8.size10{font-size:1.44027778em;}.css-63uqft .katex .sizing.reset-size8.size11,.css-63uqft .katex .fontsize-ensurer.reset-size8.size11{font-size:1.72777778em;}.css-63uqft .katex .sizing.reset-size9.size1,.css-63uqft .katex .fontsize-ensurer.reset-size9.size1{font-size:0.28935185em;}.css-63uqft .katex .sizing.reset-size9.size2,.css-63uqft .katex .fontsize-ensurer.reset-size9.size2{font-size:0.34722222em;}.css-63uqft .katex .sizing.reset-size9.size3,.css-63uqft .katex .fontsize-ensurer.reset-size9.size3{font-size:0.40509259em;}.css-63uqft .katex .sizing.reset-size9.size4,.css-63uqft .katex .fontsize-ensurer.reset-size9.size4{font-size:0.46296296em;}.css-63uqft .katex .sizing.reset-size9.size5,.css-63uqft .katex .fontsize-ensurer.reset-size9.size5{font-size:0.52083333em;}.css-63uqft .katex .sizing.reset-size9.size6,.css-63uqft .katex .fontsize-ensurer.reset-size9.size6{font-size:0.5787037em;}.css-63uqft .katex .sizing.reset-size9.size7,.css-63uqft .katex .fontsize-ensurer.reset-size9.size7{font-size:0.69444444em;}.css-63uqft .katex .sizing.reset-size9.size8,.css-63uqft .katex .fontsize-ensurer.reset-size9.size8{font-size:0.83333333em;}.css-63uqft .katex .sizing.reset-size9.size9,.css-63uqft .katex .fontsize-ensurer.reset-size9.size9{font-size:1em;}.css-63uqft .katex .sizing.reset-size9.size10,.css-63uqft .katex .fontsize-ensurer.reset-size9.size10{font-size:1.20023148em;}.css-63uqft .katex .sizing.reset-size9.size11,.css-63uqft .katex .fontsize-ensurer.reset-size9.size11{font-size:1.43981481em;}.css-63uqft .katex .sizing.reset-size10.size1,.css-63uqft .katex .fontsize-ensurer.reset-size10.size1{font-size:0.24108004em;}.css-63uqft .katex .sizing.reset-size10.size2,.css-63uqft .katex .fontsize-ensurer.reset-size10.size2{font-size:0.28929605em;}.css-63uqft .katex .sizing.reset-size10.size3,.css-63uqft .katex .fontsize-ensurer.reset-size10.size3{font-size:0.33751205em;}.css-63uqft .katex .sizing.reset-size10.size4,.css-63uqft .katex .fontsize-ensurer.reset-size10.size4{font-size:0.38572806em;}.css-63uqft .katex .sizing.reset-size10.size5,.css-63uqft .katex .fontsize-ensurer.reset-size10.size5{font-size:0.43394407em;}.css-63uqft .katex .sizing.reset-size10.size6,.css-63uqft .katex .fontsize-ensurer.reset-size10.size6{font-size:0.48216008em;}.css-63uqft .katex .sizing.reset-size10.size7,.css-63uqft .katex .fontsize-ensurer.reset-size10.size7{font-size:0.57859209em;}.css-63uqft .katex .sizing.reset-size10.size8,.css-63uqft .katex .fontsize-ensurer.reset-size10.size8{font-size:0.69431051em;}.css-63uqft .katex .sizing.reset-size10.size9,.css-63uqft .katex .fontsize-ensurer.reset-size10.size9{font-size:0.83317261em;}.css-63uqft .katex .sizing.reset-size10.size10,.css-63uqft .katex .fontsize-ensurer.reset-size10.size10{font-size:1em;}.css-63uqft .katex .sizing.reset-size10.size11,.css-63uqft .katex .fontsize-ensurer.reset-size10.size11{font-size:1.19961427em;}.css-63uqft .katex .sizing.reset-size11.size1,.css-63uqft .katex .fontsize-ensurer.reset-size11.size1{font-size:0.20096463em;}.css-63uqft .katex .sizing.reset-size11.size2,.css-63uqft .katex .fontsize-ensurer.reset-size11.size2{font-size:0.24115756em;}.css-63uqft .katex .sizing.reset-size11.size3,.css-63uqft .katex .fontsize-ensurer.reset-size11.size3{font-size:0.28135048em;}.css-63uqft .katex .sizing.reset-size11.size4,.css-63uqft .katex .fontsize-ensurer.reset-size11.size4{font-size:0.32154341em;}.css-63uqft .katex .sizing.reset-size11.size5,.css-63uqft .katex .fontsize-ensurer.reset-size11.size5{font-size:0.36173633em;}.css-63uqft .katex .sizing.reset-size11.size6,.css-63uqft .katex .fontsize-ensurer.reset-size11.size6{font-size:0.40192926em;}.css-63uqft .katex .sizing.reset-size11.size7,.css-63uqft .katex .fontsize-ensurer.reset-size11.size7{font-size:0.48231511em;}.css-63uqft .katex .sizing.reset-size11.size8,.css-63uqft .katex .fontsize-ensurer.reset-size11.size8{font-size:0.57877814em;}.css-63uqft .katex .sizing.reset-size11.size9,.css-63uqft .katex .fontsize-ensurer.reset-size11.size9{font-size:0.69453376em;}.css-63uqft .katex .sizing.reset-size11.size10,.css-63uqft .katex .fontsize-ensurer.reset-size11.size10{font-size:0.83360129em;}.css-63uqft .katex .sizing.reset-size11.size11,.css-63uqft .katex .fontsize-ensurer.reset-size11.size11{font-size:1em;}.css-63uqft .katex .delimsizing.size1{font-family:KaTeX_Size1;}.css-63uqft .katex .delimsizing.size2{font-family:KaTeX_Size2;}.css-63uqft .katex .delimsizing.size3{font-family:KaTeX_Size3;}.css-63uqft .katex .delimsizing.size4{font-family:KaTeX_Size4;}.css-63uqft .katex .delimsizing.mult .delim-size1>span{font-family:KaTeX_Size1;}.css-63uqft .katex .delimsizing.mult .delim-size4>span{font-family:KaTeX_Size4;}.css-63uqft .katex .nulldelimiter{display:inline-block;width:0.12em;}.css-63uqft .katex .delimcenter{position:relative;}.css-63uqft .katex .op-symbol{position:relative;}.css-63uqft .katex .op-symbol.small-op{font-family:KaTeX_Size1;}.css-63uqft .katex .op-symbol.large-op{font-family:KaTeX_Size2;}.css-63uqft .katex .op-limits>.vlist-t{text-align:center;}.css-63uqft .katex .accent>.vlist-t{text-align:center;}.css-63uqft .katex .accent .accent-body{position:relative;}.css-63uqft .katex .accent .accent-body:not(.accent-full){width:0;}.css-63uqft .katex .overlay{display:block;}.css-63uqft .katex .mtable .vertical-separator{display:inline-block;min-width:1px;}.css-63uqft .katex .mtable .arraycolsep{display:inline-block;}.css-63uqft .katex .mtable .col-align-c>.vlist-t{text-align:center;}.css-63uqft .katex .mtable .col-align-l>.vlist-t{text-align:left;}.css-63uqft .katex .mtable .col-align-r>.vlist-t{text-align:right;}.css-63uqft .katex .svg-align{text-align:left;}.css-63uqft .katex svg{display:block;position:absolute;width:100%;height:inherit;fill:currentColor;stroke:currentColor;fill-rule:nonzero;fill-opacity:1;stroke-width:1;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1;}.css-63uqft .katex svg path{stroke:none;}.css-63uqft .katex img{border-style:none;min-width:0;min-height:0;max-width:none;max-height:none;}.css-63uqft .katex .stretchy{width:100%;display:block;position:relative;overflow:hidden;}.css-63uqft .katex .stretchy::before,.css-63uqft .katex .stretchy::after{content:'';}.css-63uqft .katex .hide-tail{width:100%;position:relative;overflow:hidden;}.css-63uqft .katex .halfarrow-left{position:absolute;left:0;width:50.2%;overflow:hidden;}.css-63uqft .katex .halfarrow-right{position:absolute;right:0;width:50.2%;overflow:hidden;}.css-63uqft .katex .brace-left{position:absolute;left:0;width:25.1%;overflow:hidden;}.css-63uqft .katex .brace-center{position:absolute;left:25%;width:50%;overflow:hidden;}.css-63uqft .katex .brace-right{position:absolute;right:0;width:25.1%;overflow:hidden;}.css-63uqft .katex .x-arrow-pad{padding:0 0.5em;}.css-63uqft .katex .cd-arrow-pad{padding:0 0.55556em 0 0.27778em;}.css-63uqft .katex .x-arrow,.css-63uqft .katex .mover,.css-63uqft .katex .munder{text-align:center;}.css-63uqft .katex .boxpad{padding:0 0.3em 0 0.3em;}.css-63uqft .katex .fbox,.css-63uqft .katex .fcolorbox{box-sizing:border-box;border:0.04em solid;}.css-63uqft .katex .cancel-pad{padding:0 0.2em 0 0.2em;}.css-63uqft .katex .cancel-lap{margin-left:-0.2em;margin-right:-0.2em;}.css-63uqft .katex .sout{border-bottom-style:solid;border-bottom-width:0.08em;}.css-63uqft .katex .angl{box-sizing:border-box;border-top:0.049em solid;border-right:0.049em solid;margin-right:0.03889em;}.css-63uqft .katex .anglpad{padding:0 0.03889em 0 0.03889em;}.css-63uqft .katex .eqn-num::before{counter-increment:katexEqnNo;content:'(' counter(katexEqnNo) ')';}.css-63uqft .katex .mml-eqn-num::before{counter-increment:mmlEqnNo;content:'(' counter(mmlEqnNo) ')';}.css-63uqft .katex .mtr-glue{width:50%;}.css-63uqft .katex .cd-vert-arrow{display:inline-block;position:relative;}.css-63uqft .katex .cd-label-left{display:inline-block;position:absolute;right:calc(50% + 0.3em);text-align:left;}.css-63uqft .katex .cd-label-right{display:inline-block;position:absolute;left:calc(50% + 0.3em);text-align:right;}.css-63uqft .katex-display{display:block;margin:1em 0;text-align:center;}.css-63uqft .katex-display>.katex{display:block;white-space:nowrap;}.css-63uqft .katex-display>.katex>.katex-html{display:block;position:relative;}.css-63uqft .katex-display>.katex>.katex-html>.tag{position:absolute;right:0;}.css-63uqft .katex-display.leqno>.katex>.katex-html>.tag{left:0;right:auto;}.css-63uqft .katex-display.fleqn>.katex{text-align:left;padding-left:2em;}.css-63uqft body{counter-reset:katexEqnNo mmlEqnNo;}.css-63uqft table{width:-webkit-max-content;width:-moz-max-content;width:max-content;}.css-63uqft .tableBlock{max-width:100%;margin-bottom:1rem;overflow-y:scroll;}.css-63uqft .tableBlock thead,.css-63uqft .tableBlock thead th{border-bottom:1px solid #333!important;}.css-63uqft .tableBlock th,.css-63uqft .tableBlock td{padding:10px;text-align:left;}.css-63uqft .tableBlock th{font-weight:bold!important;}.css-63uqft .tableBlock caption{caption-side:bottom;color:#555;font-size:12px;font-style:italic;text-align:center;}.css-63uqft .tableBlock caption>p{margin:0;}.css-63uqft .tableBlock th>p,.css-63uqft .tableBlock td>p{margin:0;}.css-63uqft .tableBlock [data-background-color='aliceblue']{background-color:#f0f8ff;color:#000;}.css-63uqft .tableBlock [data-background-color='black']{background-color:#000;color:#fff;}.css-63uqft .tableBlock [data-background-color='chocolate']{background-color:#d2691e;color:#fff;}.css-63uqft .tableBlock [data-background-color='cornflowerblue']{background-color:#6495ed;color:#fff;}.css-63uqft .tableBlock [data-background-color='crimson']{background-color:#dc143c;color:#fff;}.css-63uqft .tableBlock [data-background-color='darkblue']{background-color:#00008b;color:#fff;}.css-63uqft .tableBlock [data-background-color='darkseagreen']{background-color:#8fbc8f;color:#000;}.css-63uqft .tableBlock [data-background-color='deepskyblue']{background-color:#00bfff;color:#000;}.css-63uqft .tableBlock [data-background-color='gainsboro']{background-color:#dcdcdc;color:#000;}.css-63uqft .tableBlock [data-background-color='grey']{background-color:#808080;color:#fff;}.css-63uqft .tableBlock [data-background-color='lemonchiffon']{background-color:#fffacd;color:#000;}.css-63uqft .tableBlock [data-background-color='lightpink']{background-color:#ffb6c1;color:#000;}.css-63uqft .tableBlock [data-background-color='lightsalmon']{background-color:#ffa07a;color:#000;}.css-63uqft .tableBlock [data-background-color='lightskyblue']{background-color:#87cefa;color:#000;}.css-63uqft .tableBlock [data-background-color='mediumblue']{background-color:#0000cd;color:#fff;}.css-63uqft .tableBlock [data-background-color='omnigrey']{background-color:#f0f0f0;color:#000;}.css-63uqft .tableBlock [data-background-color='white']{background-color:#fff;color:#000;}.css-63uqft .tableBlock [data-text-align='center']{text-align:center;}.css-63uqft .tableBlock [data-text-align='left']{text-align:left;}.css-63uqft .tableBlock [data-text-align='right']{text-align:right;}.css-63uqft .tableBlock [data-vertical-align='bottom']{vertical-align:bottom;}.css-63uqft .tableBlock [data-vertical-align='middle']{vertical-align:middle;}.css-63uqft .tableBlock [data-vertical-align='top']{vertical-align:top;}.css-63uqft .tableBlock__font-size--xxsmall{font-size:10px;}.css-63uqft .tableBlock__font-size--xsmall{font-size:12px;}.css-63uqft .tableBlock__font-size--small{font-size:14px;}.css-63uqft .tableBlock__font-size--large{font-size:18px;}.css-63uqft .tableBlock__border--some tbody tr:not(:last-child){border-bottom:1px solid #e2e5e7;}.css-63uqft .tableBlock__border--bordered td,.css-63uqft .tableBlock__border--bordered th{border:1px solid #e2e5e7;}.css-63uqft .tableBlock__border--borderless tbody+tbody,.css-63uqft .tableBlock__border--borderless td,.css-63uqft .tableBlock__border--borderless th,.css-63uqft .tableBlock__border--borderless tr,.css-63uqft .tableBlock__border--borderless thead,.css-63uqft .tableBlock__border--borderless thead th{border:0!important;}.css-63uqft .tableBlock:not(.tableBlock__table-striped) tbody tr{background-color:unset!important;}.css-63uqft .tableBlock__table-striped tbody tr:nth-of-type(odd){background-color:#f9fafc!important;}.css-63uqft .tableBlock__table-compactl th,.css-63uqft .tableBlock__table-compact td{padding:3px!important;}.css-63uqft .tableBlock__full-size{width:100%;}.css-63uqft .textBlock{margin-bottom:16px;}.css-63uqft .textBlock__text-formatting--finePrint{font-size:12px;}.css-63uqft .textBlock__text-infoBox{padding:0.75rem 1.25rem;margin-bottom:1rem;border:1px solid transparent;border-radius:0.25rem;}.css-63uqft .textBlock__text-infoBox p{margin:0;}.css-63uqft .textBlock__text-infoBox--primary{background-color:#cce5ff;border-color:#b8daff;color:#004085;}.css-63uqft .textBlock__text-infoBox--secondary{background-color:#e2e3e5;border-color:#d6d8db;color:#383d41;}.css-63uqft .textBlock__text-infoBox--success{background-color:#d4edda;border-color:#c3e6cb;color:#155724;}.css-63uqft .textBlock__text-infoBox--danger{background-color:#f8d7da;border-color:#f5c6cb;color:#721c24;}.css-63uqft .textBlock__text-infoBox--warning{background-color:#fff3cd;border-color:#ffeeba;color:#856404;}.css-63uqft .textBlock__text-infoBox--info{background-color:#d1ecf1;border-color:#bee5eb;color:#0c5460;}.css-63uqft .textBlock__text-infoBox--dark{background-color:#d6d8d9;border-color:#c6c8ca;color:#1b1e21;}.css-63uqft .text-overline{-webkit-text-decoration:overline;text-decoration:overline;}.css-63uqft.css-63uqft{color:#2B3148;background-color:transparent;font-family:var(--calculator-ui-font-family),Verdana,sans-serif;font-size:20px;line-height:24px;overflow:visible;padding-top:0px;position:relative;}.css-63uqft.css-63uqft:after{content:'';-webkit-transform:scale(0);-moz-transform:scale(0);-ms-transform:scale(0);transform:scale(0);position:absolute;border:2px solid #EA9430;border-radius:2px;inset:-8px;z-index:1;}.css-63uqft .js-external-link-button.link-like,.css-63uqft .js-external-link-anchor{color:inherit;border-radius:1px;-webkit-text-decoration:underline;text-decoration:underline;}.css-63uqft .js-external-link-button.link-like:hover,.css-63uqft .js-external-link-anchor:hover,.css-63uqft .js-external-link-button.link-like:active,.css-63uqft .js-external-link-anchor:active{text-decoration-thickness:2px;text-shadow:1px 0 0;}.css-63uqft .js-external-link-button.link-like:focus-visible,.css-63uqft .js-external-link-anchor:focus-visible{outline:transparent 2px dotted;box-shadow:0 0 0 2px #6314E6;}.css-63uqft p,.css-63uqft div{margin:0;display:block;}.css-63uqft pre{margin:0;display:block;}.css-63uqft pre code{display:block;width:-webkit-fit-content;width:-moz-fit-content;width:fit-content;}.css-63uqft pre:not(:first-child){padding-top:8px;}.css-63uqft ul,.css-63uqft ol{display:block margin:0;padding-left:20px;}.css-63uqft ul li,.css-63uqft ol li{padding-top:8px;}.css-63uqft ul ul,.css-63uqft ol ul,.css-63uqft ul ol,.css-63uqft ol ol{padding-top:0;}.css-63uqft ul:not(:first-child),.css-63uqft ol:not(:first-child){padding-top:4px;} Interpretation

Significance level α

Stack Exchange Network

Stack Exchange network consists of 183 Q&A communities including Stack Overflow , the largest, most trusted online community for developers to learn, share their knowledge, and build their careers.

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Many p-values are equal to 1 after Bonferroni correction; is it normal?

I have behavioral data for 10 animals that I studied at two different times of day, AM and PM. I am looking to see if there is a significant difference in 8 different behaviors at AM vs. PM. I have two sites (LL and WK), so wanted to look at the differences per site and the combined data. So I performed 24 t-tests and got reasonable p-values, mostly indicating no significant difference between the two times.

I then applied Bonferroni's correction factor, to adjust for the repeated t-tests. However, 22/24 of the adjusted p-values now equal one, which seems odd to me. Any ideas about what went wrong or if these adjusted values are correct?

Here is the code I used to calculate the Bonferroni correction factors:

enter image description here

  • hypothesis-testing
  • multiple-comparisons

amoeba's user avatar

  • $\begingroup$ maybe this can help you: stats.stackexchange.com/questions/164181/… $\endgroup$ –  user83346 Commented Jun 24, 2017 at 6:58
  • $\begingroup$ In R, p.adjust(x, method = "bonferroni") multiplies the individual p-values by the number of tests - in your case 24 - but then truncates to 1.0 any result greater than 1. $\endgroup$ –  Florin Andrei Commented Oct 26, 2023 at 20:15

4 Answers 4

Nothing went wrong. The adjusted p-values are correct. Adjusted $p=1$ simply means no evidence at all for rejecting the null hypothesis.

is always better than the Bonferroni adjustment. Holm's method, which is a step down Bonferroni adjustment, gives the same error rate control as Bonferroni but is more powerful (smaller p-values). As the help page for ?p.adjust says:

There seems no reason to use the unmodified Bonferroni correction because it is dominated by Holm's method, which is also valid under arbitrary assumptions.

For your specific experiment, there is so little evidence of real effects that you won't get any significant results even with Holm's method.

Stefan's user avatar

  • 1 $\begingroup$ Not exactly. An adjusted p-value cannot be interpreted quite the same way as a p-value. To say there is "no evidence at all for rejecting the null hypothesis" doesn't make sense. Also, to say Holm's method is "always better" than Bonferroni isn't quite true either, as other discussions have addressed (Holm can't do confidence intervals for example). $\endgroup$ –  Bonferroni Commented Jun 14, 2017 at 20:21
  • 1 $\begingroup$ @Bonferroni. Yes, exactly. Holm adjusted p-values are probability bounds so it is correct to call them p-values. $p=1$ has the meaning I attributed to it. Confidence intervals are irrelevant to OP's problem but, nevertheless, one could in principle create CIs to match Holm p-values. $\endgroup$ –  Gordon Smyth Commented Jun 23, 2017 at 3:30
  • $\begingroup$ Gordon, I notice you still haven't answered the question of what adjusted p-values are the probability of. That's because they aren't the probability of anything. Indeed, saying adjusted p-values have the same interpretation as p-values is simply not true. Adjusted p-values are only "on the same scale as probabilities" because values greater than 1 are arbitrarily "fudged" down to 1. The guy who (literally) wrote the book on adjusted p-values is Peter Westfall, who says the following of Bonferroni-adjusted p-values: "The adjusted p values are not probabilities per se; rather, they are simply c $\endgroup$ –  Bonferroni Commented Jun 23, 2017 at 22:55
  • $\begingroup$ I have already answered your question, although IMO the unconstructive nature of your posts did not deserve a response. The key term in Peter Westfall's paper is "per se". I am already quite familiar with Prof Westflall work, and the 1997 paper in particular, and I would be happy to split semantic hairs with him. However the semantic war you have tried to provoke is irrelevant to the statistical issues raised here. All previous answers on this page are perfectly compatible with the writings of Prof Westfall and others. $\endgroup$ –  Gordon Smyth Commented Jun 25, 2017 at 5:50

Thanks for reading my 1997 JASA paper!

If I had a do-over, I would rephrase my comment that a (single-step) Bonferroni adjusted p-value is not a probability "per se." (And I would no longer use the dreaded "per se." Yecchhhh!)

The Bonferroni adjusted p is in fact an upper bound on the probability that the smallest (random) p-value is smaller than (smaller than or equal to in the discrete case) the given (fixed) p-value, assuming the complete null model describes the randomness. And certainly, 1.0 is an upper bound on any probability.

But the bigger and more important point of my paper is that you can find these adjusted p-values exactly in such a way that accounts for the correlations between the multiple test statistics, assuming the classical linear model. These exact adjusted p-values are in fact probabilities when calculated in single-step fashion; see p. 302 of my JASA paper for the math. (To get the single-step p-values, you need to modify the expression somewhat; see my 1993 Wiley-Interscience book and my SAS book). While I used an enhanced Monte Carlo method to approximate this exact probability, better methods have been developed since; please see Hothorn, T., Bretz, F., and Westfall, P. (2008). Simultaneous Inference in General Parametric Models, Biometrical Journal 50(3), 346–363.

So, single-step adjusted p-values, when computed exactly, are bona fide probabilities.

But, except for the smallest one, step-down adjusted p-values are not bona fide probabilities. They are constructed from bona fide probabilities, but they are not probabilities.

Hope this helps!

BigBendRegion's user avatar

  • 5 $\begingroup$ It's always good to see the authors being cited in the discussion giving their inputs here at CV (+1) $\endgroup$ –  Firebug Commented Aug 29, 2017 at 18:39

Just to add to @gordon-smyth and @student-t 's answer. Another way to look at it is to adjust the $\alpha$-level yourself instead of adjusting the $p$-values via the p.adjust() function. For the Bonferroni correction this is easy enough. If your $\alpha$ level is $0.05$, then you divide this by the number tests, which then is your new Bonferroni adjusted $\alpha$-level.

In your case $0.05/24=0.00208$. As you can see, none of your raw.p 's make that cut-off. So your output makes sense.

If you want to use the Holm-Bonferroni method , you can also do this quickly by hand (see here ).

Community's user avatar

Everything works as expected because Bonferroni could give you adjusted p value greater than one. The R function rounded it off to one because a probability over one makes no sense. This is an example where Bonferroni shows reduction in statistical power.

You may want to try other multiple comparison methods or adjust the significance level.

SmallChess's user avatar

  • 2 $\begingroup$ Adjusted p-values aren't probabilities. Reducing them to 1 is for aesthetic purposes. $\endgroup$ –  Bonferroni Commented Jun 15, 2017 at 0:31
  • 2 $\begingroup$ @Bonferroni. I think you may be getting confused by the different adjustment methods. Bonferroni and Holm adjusted p-values are still p-values (because they are still probabilities) while FDR adjusted p-values are not (because they are rates). Limiting adjusted p-values to 1 is certainly not just for "aesthetic purposes". Adjusted p=1 has an absolute meaning regardless of the adustment method. $\endgroup$ –  Gordon Smyth Commented Jun 23, 2017 at 3:35
  • $\begingroup$ That's incorrect. Adjusted p-values aren't probabilities (see Westfall, 1997, p. 299), but that's a common misunderstanding. It's not surprising that one would think something called an "adjusted p-value" is a p-value. But what is it you think they are probabilities of, exactly? You'll find there is no sensible answer. A p-value is the probability of, loosely speaking, observing an effect as least as large as the given observed effect, under the null hypothesis. An adjusted p-value obviously cannot mean that--how could there be a 100% probability of observing some effect? $\endgroup$ –  Bonferroni Commented Jun 23, 2017 at 4:07
  • 1 $\begingroup$ @Bonferroni. Bonferroni and Holm adjusted p-values provide upper bounds for rejection probabilities under the null hypothesis. They are on the same scale as probabilities, so having a value larger than 1 is meaningless. They have the same interpretation as p-values. In particular adjusted p=1 means inability to reject the null at any significance level; in intuitive terms, this means ``no evidence against the null''. $\endgroup$ –  Gordon Smyth Commented Jun 23, 2017 at 9:13
  • 1 $\begingroup$ @Bonferroni. Many statistical tests applied to discrete data can give p-values exactly equal to 1, with or without adjustment for multiple testing. One sided tests provide another example where p-values exactly equal to 1 are common. $\endgroup$ –  Gordon Smyth Commented Jun 23, 2017 at 9:15

Your Answer

Sign up or log in, post as a guest.

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy .

Not the answer you're looking for? Browse other questions tagged hypothesis-testing t-test p-value multiple-comparisons bonferroni or ask your own question .

  • Featured on Meta
  • Announcing a change to the data-dump process
  • Bringing clarity to status tag usage on meta sites

Hot Network Questions

  • Wien's displacement law
  • How to align rows within cell in Latex?
  • What is the optimal number of function evaluations?
  • Current in a circuit is 50% lower than predicted by Kirchhoff's law
  • Representing permutation groups as equivalence relations
  • When can the cat and mouse meet?
  • Where is this railroad track as seen in Rocky II during the training montage?
  • How is carousing different from drunkenness in Luke 21:34-36? How should they be interpreted literally and spiritually?
  • Is "She played good" a grammatically correct sentence?
  • How can I play MechWarrior 2?
  • What was the typical amount of disk storage for a mainframe installation in the 1980s?
  • How to realize the index shift operation in quantum circuit?
  • Nausea during high altitude cycling climbs
  • How do I learn more about rocketry?
  • Is there a way to read lawyers arguments in various trials?
  • Sum[] function not computing the sum
  • What does 'ex' mean in this context
  • Manhattan distance
  • What is this movie aircraft?
  • How to clean a female disconnect connector
  • Is the 2024 Ukrainian invasion of the Kursk region the first time since WW2 Russia was invaded?
  • Are old cardano node versions invalidated
  • What other crewed spacecraft returned uncrewed before Starliner Boe-CFT?
  • What is the term for the belief that significant events cannot have trivial causes?

hypothesis testing p value equal 1

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List

Logo of springeropen

Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations

Sander greenland.

Department of Epidemiology and Department of Statistics, University of California, Los Angeles, CA USA

Stephen J. Senn

Competence Center for Methodology and Statistics, Luxembourg Institute of Health, Strassen, Luxembourg

Kenneth J. Rothman

RTI Health Solutions, Research Triangle Institute, Research Triangle Park, NC USA

John B. Carlin

Clinical Epidemiology and Biostatistics Unit, Murdoch Children’s Research Institute, School of Population Health, University of Melbourne, Melbourne, VIC Australia

Charles Poole

Department of Epidemiology, Gillings School of Global Public Health, University of North Carolina, Chapel Hill, NC USA

Steven N. Goodman

Meta-Research Innovation Center, Departments of Medicine and of Health Research and Policy, Stanford University School of Medicine, Stanford, CA USA

Douglas G. Altman

Centre for Statistics in Medicine, Nuffield Department of Orthopaedics, Rheumatology and Musculoskeletal Sciences, University of Oxford, Oxford, UK

Misinterpretation and abuse of statistical tests, confidence intervals, and statistical power have been decried for decades, yet remain rampant. A key problem is that there are no interpretations of these concepts that are at once simple, intuitive, correct, and foolproof. Instead, correct use and interpretation of these statistics requires an attention to detail which seems to tax the patience of working scientists. This high cognitive demand has led to an epidemic of shortcut definitions and interpretations that are simply wrong, sometimes disastrously so—and yet these misinterpretations dominate much of the scientific literature. In light of this problem, we provide definitions and a discussion of basic statistics that are more general and critical than typically found in traditional introductory expositions. Our goal is to provide a resource for instructors, researchers, and consumers of statistics whose knowledge of statistical theory and technique may be limited but who wish to avoid and spot misinterpretations. We emphasize how violation of often unstated analysis protocols (such as selecting analyses for presentation based on the P values they produce) can lead to small P values even if the declared test hypothesis is correct, and can lead to large P values even if that hypothesis is incorrect. We then provide an explanatory list of 25 misinterpretations of P values, confidence intervals, and power. We conclude with guidelines for improving statistical interpretation and reporting.

Introduction

Misinterpretation and abuse of statistical tests has been decried for decades, yet remains so rampant that some scientific journals discourage use of “statistical significance” (classifying results as “significant” or not based on a P value) [ 1 ]. One journal now bans all statistical tests and mathematically related procedures such as confidence intervals [ 2 ], which has led to considerable discussion and debate about the merits of such bans [ 3 , 4 ].

Despite such bans, we expect that the statistical methods at issue will be with us for many years to come. We thus think it imperative that basic teaching as well as general understanding of these methods be improved. Toward that end, we attempt to explain the meaning of significance tests, confidence intervals, and statistical power in a more general and critical way than is traditionally done, and then review 25 common misconceptions in light of our explanations. We also discuss a few more subtle but nonetheless pervasive problems, explaining why it is important to examine and synthesize all results relating to a scientific question, rather than focus on individual findings. We further explain why statistical tests should never constitute the sole input to inferences or decisions about associations or effects. Among the many reasons are that, in most scientific settings, the arbitrary classification of results into “significant” and “non-significant” is unnecessary for and often damaging to valid interpretation of data; and that estimation of the size of effects and the uncertainty surrounding our estimates will be far more important for scientific inference and sound judgment than any such classification.

More detailed discussion of the general issues can be found in many articles, chapters, and books on statistical methods and their interpretation [ 5 – 20 ]. Specific issues are covered at length in these sources and in the many peer-reviewed articles that critique common misinterpretations of null-hypothesis testing and “statistical significance” [ 1 , 12 , 21 – 74 ].

Statistical tests, P values, and confidence intervals: a caustic primer

Statistical models, hypotheses, and tests.

Every method of statistical inference depends on a complex web of assumptions about how data were collected and analyzed, and how the analysis results were selected for presentation. The full set of assumptions is embodied in a statistical model that underpins the method. This model is a mathematical representation of data variability, and thus ideally would capture accurately all sources of such variability. Many problems arise however because this statistical model often incorporates unrealistic or at best unjustified assumptions. This is true even for so-called “non-parametric” methods, which (like other methods) depend on assumptions of random sampling or randomization. These assumptions are often deceptively simple to write down mathematically, yet in practice are difficult to satisfy and verify, as they may depend on successful completion of a long sequence of actions (such as identifying, contacting, obtaining consent from, obtaining cooperation of, and following up subjects, as well as adherence to study protocols for treatment allocation, masking, and data analysis).

There is also a serious problem of defining the scope of a model, in that it should allow not only for a good representation of the observed data but also of hypothetical alternative data that might have been observed. The reference frame for data that “might have been observed” is often unclear, for example if multiple outcome measures or multiple predictive factors have been measured, and many decisions surrounding analysis choices have been made after the data were collected—as is invariably the case [ 33 ].

The difficulty of understanding and assessing underlying assumptions is exacerbated by the fact that the statistical model is usually presented in a highly compressed and abstract form—if presented at all. As a result, many assumptions go unremarked and are often unrecognized by users as well as consumers of statistics. Nonetheless, all statistical methods and interpretations are premised on the model assumptions; that is, on an assumption that the model provides a valid representation of the variation we would expect to see across data sets, faithfully reflecting the circumstances surrounding the study and phenomena occurring within it.

In most applications of statistical testing, one assumption in the model is a hypothesis that a particular effect has a specific size, and has been targeted for statistical analysis. (For simplicity, we use the word “effect” when “association or effect” would arguably be better in allowing for noncausal studies such as most surveys.) This targeted assumption is called the study hypothesis or test hypothesis , and the statistical methods used to evaluate it are called statistical hypothesis tests . Most often, the targeted effect size is a “null” value representing zero effect (e.g., that the study treatment makes no difference in average outcome), in which case the test hypothesis is called the null hypothesis . Nonetheless, it is also possible to test other effect sizes. We may also test hypotheses that the effect does or does not fall within a specific range; for example, we may test the hypothesis that the effect is no greater than a particular amount, in which case the hypothesis is said to be a one - sided or dividing hypothesis [ 7 , 8 ].

Much statistical teaching and practice has developed a strong (and unhealthy) focus on the idea that the main aim of a study should be to test null hypotheses. In fact most descriptions of statistical testing focus only on testing null hypotheses, and the entire topic has been called “Null Hypothesis Significance Testing” (NHST). This exclusive focus on null hypotheses contributes to misunderstanding of tests. Adding to the misunderstanding is that many authors (including R.A. Fisher) use “null hypothesis” to refer to any test hypothesis, even though this usage is at odds with other authors and with ordinary English definitions of “null”—as are statistical usages of “significance” and “confidence.”

Uncertainty, probability, and statistical significance

A more refined goal of statistical analysis is to provide an evaluation of certainty or uncertainty regarding the size of an effect. It is natural to express such certainty in terms of “probabilities” of hypotheses. In conventional statistical methods, however, “probability” refers not to hypotheses, but to quantities that are hypothetical frequencies of data patterns under an assumed statistical model. These methods are thus called frequentist methods, and the hypothetical frequencies they predict are called “frequency probabilities.” Despite considerable training to the contrary, many statistically educated scientists revert to the habit of misinterpreting these frequency probabilities as hypothesis probabilities. (Even more confusingly, the term “likelihood of a parameter value” is reserved by statisticians to refer to the probability of the observed data given the parameter value; it does not refer to a probability of the parameter taking on the given value.)

Nowhere are these problems more rampant than in applications of a hypothetical frequency called the P value, also known as the “observed significance level” for the test hypothesis. Statistical “significance tests” based on this concept have been a central part of statistical analyses for centuries [ 75 ]. The focus of traditional definitions of P values and statistical significance has been on null hypotheses, treating all other assumptions used to compute the P value as if they were known to be correct. Recognizing that these other assumptions are often questionable if not unwarranted, we will adopt a more general view of the P value as a statistical summary of the compatibility between the observed data and what we would predict or expect to see if we knew the entire statistical model ( all the assumptions used to compute the P value) were correct.

Specifically, the distance between the data and the model prediction is measured using a test statistic (such as a t-statistic or a Chi squared statistic). The P value is then the probability that the chosen test statistic would have been at least as large as its observed value if every model assumption were correct, including the test hypothesis. This definition embodies a crucial point lost in traditional definitions: In logical terms, the P value tests all the assumptions about how the data were generated (the entire model), not just the targeted hypothesis it is supposed to test (such as a null hypothesis). Furthermore, these assumptions include far more than what are traditionally presented as modeling or probability assumptions—they include assumptions about the conduct of the analysis, for example that intermediate analysis results were not used to determine which analyses would be presented.

It is true that the smaller the P value, the more unusual the data would be if every single assumption were correct; but a very small P value does not tell us which assumption is incorrect. For example, the P value may be very small because the targeted hypothesis is false; but it may instead (or in addition) be very small because the study protocols were violated, or because it was selected for presentation based on its small size. Conversely, a large P value indicates only that the data are not unusual under the model, but does not imply that the model or any aspect of it (such as the targeted hypothesis) is correct; it may instead (or in addition) be large because (again) the study protocols were violated, or because it was selected for presentation based on its large size.

The general definition of a P value may help one to understand why statistical tests tell us much less than what many think they do: Not only does a P value not tell us whether the hypothesis targeted for testing is true or not; it says nothing specifically related to that hypothesis unless we can be completely assured that every other assumption used for its computation is correct—an assurance that is lacking in far too many studies.

Nonetheless, the P value can be viewed as a continuous measure of the compatibility between the data and the entire model used to compute it, ranging from 0 for complete incompatibility to 1 for perfect compatibility, and in this sense may be viewed as measuring the fit of the model to the data. Too often, however, the P value is degraded into a dichotomy in which results are declared “statistically significant” if P falls on or below a cut-off (usually 0.05) and declared “nonsignificant” otherwise. The terms “significance level” and “alpha level” (α) are often used to refer to the cut-off; however, the term “significance level” invites confusion of the cut-off with the P value itself. Their difference is profound: the cut-off value α is supposed to be fixed in advance and is thus part of the study design, unchanged in light of the data. In contrast, the P value is a number computed from the data and thus an analysis result, unknown until it is computed.

Moving from tests to estimates

We can vary the test hypothesis while leaving other assumptions unchanged, to see how the P value differs across competing test hypotheses. Usually, these test hypotheses specify different sizes for a targeted effect; for example, we may test the hypothesis that the average difference between two treatment groups is zero (the null hypothesis), or that it is 20 or −10 or any size of interest. The effect size whose test produced P  = 1 is the size most compatible with the data (in the sense of predicting what was in fact observed) if all the other assumptions used in the test (the statistical model) were correct, and provides a point estimate of the effect under those assumptions. The effect sizes whose test produced P  > 0.05 will typically define a range of sizes (e.g., from 11.0 to 19.5) that would be considered more compatible with the data (in the sense of the observations being closer to what the model predicted) than sizes outside the range—again, if the statistical model were correct. This range corresponds to a 1 − 0.05 = 0.95 or 95 % confidence interval , and provides a convenient way of summarizing the results of hypothesis tests for many effect sizes. Confidence intervals are examples of interval estimates .

Neyman [ 76 ] proposed the construction of confidence intervals in this way because they have the following property: If one calculates, say, 95 % confidence intervals repeatedly in valid applications , 95 % of them, on average, will contain (i.e., include or cover) the true effect size. Hence, the specified confidence level is called the coverage probability. As Neyman stressed repeatedly, this coverage probability is a property of a long sequence of confidence intervals computed from valid models, rather than a property of any single confidence interval.

Many journals now require confidence intervals, but most textbooks and studies discuss P values only for the null hypothesis of no effect. This exclusive focus on null hypotheses in testing not only contributes to misunderstanding of tests and underappreciation of estimation, but also obscures the close relationship between P values and confidence intervals, as well as the weaknesses they share.

What P values, confidence intervals, and power calculations don’t tell us

Much distortion arises from basic misunderstanding of what P values and their relatives (such as confidence intervals) do not tell us. Therefore, based on the articles in our reference list, we review prevalent P value misinterpretations as a way of moving toward defensible interpretations and presentations. We adopt the format of Goodman [ 40 ] in providing a list of misinterpretations that can be used to critically evaluate conclusions offered by research reports and reviews. Every one of the bolded statements in our list has contributed to statistical distortion of the scientific literature, and we add the emphatic “No!” to underscore statements that are not only fallacious but also not “true enough for practical purposes.”

Common misinterpretations of single P values

  • The P value is the probability that the test hypothesis is true; for example, if a test of the null hypothesis gave P   =   0.01, the null hypothesis has only a 1   % chance of being true; if instead it gave P   =   0.40, the null hypothesis has a 40   % chance of being true . No! The P value assumes the test hypothesis is true—it is not a hypothesis probability and may be far from any reasonable probability for the test hypothesis. The P value simply indicates the degree to which the data conform to the pattern predicted by the test hypothesis and all the other assumptions used in the test (the underlying statistical model). Thus P  = 0.01 would indicate that the data are not very close to what the statistical model (including the test hypothesis) predicted they should be, while P  = 0.40 would indicate that the data are much closer to the model prediction, allowing for chance variation.

Note : One often sees “alone” dropped from this description (becoming “the P value for the null hypothesis is the probability that chance produced the observed association”), so that the statement is more ambiguous, but just as wrong.

  • A significant test result ( P   ≤   0.05) means that the test hypothesis is false or should be rejected . No! A small P value simply flags the data as being unusual if all the assumptions used to compute it (including the test hypothesis) were correct; it may be small because there was a large random error or because some assumption other than the test hypothesis was violated (for example, the assumption that this P value was not selected for presentation because it was below 0.05). P  ≤ 0.05 only means that a discrepancy from the hypothesis prediction (e.g., no difference between treatment groups) would be as large or larger than that observed no more than 5 % of the time if only chance were creating the discrepancy (as opposed to a violation of the test hypothesis or a mistaken assumption).
  • A nonsignificant test result ( P   >   0.05) means that the test hypothesis is true or should be accepted . No! A large P value only suggests that the data are not unusual if all the assumptions used to compute the P value (including the test hypothesis) were correct. The same data would also not be unusual under many other hypotheses. Furthermore, even if the test hypothesis is wrong, the P value may be large because it was inflated by a large random error or because of some other erroneous assumption (for example, the assumption that this P value was not selected for presentation because it was above 0.05). P  > 0.05 only means that a discrepancy from the hypothesis prediction (e.g., no difference between treatment groups) would be as large or larger than that observed more than 5 % of the time if only chance were creating the discrepancy.
  • A large P value is evidence in favor of the test hypothesis . No! In fact, any P value less than 1 implies that the test hypothesis is not the hypothesis most compatible with the data, because any other hypothesis with a larger P value would be even more compatible with the data. A P value cannot be said to favor the test hypothesis except in relation to those hypotheses with smaller P values. Furthermore, a large P value often indicates only that the data are incapable of discriminating among many competing hypotheses (as would be seen immediately by examining the range of the confidence interval). For example, many authors will misinterpret P  = 0.70 from a test of the null hypothesis as evidence for no effect, when in fact it indicates that, even though the null hypothesis is compatible with the data under the assumptions used to compute the P value, it is not the hypothesis most compatible with the data—that honor would belong to a hypothesis with P  = 1. But even if P  = 1, there will be many other hypotheses that are highly consistent with the data, so that a definitive conclusion of “no association” cannot be deduced from a P value, no matter how large.
  • A null -hypothesis P value greater than 0.05 means that no effect was observed, or that absence of an effect was shown or demonstrated . No! Observing P  > 0.05 for the null hypothesis only means that the null is one among the many hypotheses that have P  > 0.05. Thus, unless the point estimate (observed association) equals the null value exactly, it is a mistake to conclude from P  > 0.05 that a study found “no association” or “no evidence” of an effect. If the null P value is less than 1 some association must be present in the data, and one must look at the point estimate to determine the effect size most compatible with the data under the assumed model.
  • Statistical significance indicates a scientifically or substantively important relation has been detected . No! Especially when a study is large, very minor effects or small assumption violations can lead to statistically significant tests of the null hypothesis. Again, a small null P value simply flags the data as being unusual if all the assumptions used to compute it (including the null hypothesis) were correct; but the way the data are unusual might be of no clinical interest. One must look at the confidence interval to determine which effect sizes of scientific or other substantive (e.g., clinical) importance are relatively compatible with the data, given the model.
  • Lack of statistical significance indicates that the effect size is small . No! Especially when a study is small, even large effects may be “drowned in noise” and thus fail to be detected as statistically significant by a statistical test. A large null P value simply flags the data as not being unusual if all the assumptions used to compute it (including the test hypothesis) were correct; but the same data will also not be unusual under many other models and hypotheses besides the null. Again, one must look at the confidence interval to determine whether it includes effect sizes of importance.
  • The P value is the chance of our data occurring if the test hypothesis is true; for example, P   =   0.05 means that the observed association would occur only 5   % of the time under the test hypothesis . No! The P value refers not only to what we observed, but also observations more extreme than what we observed (where “extremity” is measured in a particular way). And again, the P value refers to a data frequency when all the assumptions used to compute it are correct. In addition to the test hypothesis, these assumptions include randomness in sampling, treatment assignment, loss, and missingness, as well as an assumption that the P value was not selected for presentation based on its size or some other aspect of the results.
  • If you reject the test hypothesis because P   ≤   0.05, the chance you are in error (the chance your “significant finding” is a false positive) is 5   % . No! To see why this description is false, suppose the test hypothesis is in fact true. Then, if you reject it, the chance you are in error is 100 %, not 5 %. The 5 % refers only to how often you would reject it, and therefore be in error, over very many uses of the test across different studies when the test hypothesis and all other assumptions used for the test are true. It does not refer to your single use of the test, which may have been thrown off by assumption violations as well as random errors. This is yet another version of misinterpretation #1.
  • P   =   0.05 and P   ≤   0.05 mean the same thing . No! This is like saying reported height = 2 m and reported height ≤2 m are the same thing: “height = 2 m” would include few people and those people would be considered tall, whereas “height ≤2 m” would include most people including small children. Similarly, P  = 0.05 would be considered a borderline result in terms of statistical significance, whereas P  ≤ 0.05 lumps borderline results together with results very incompatible with the model (e.g., P  = 0.0001) thus rendering its meaning vague, for no good purpose.
  • P values are properly reported as inequalities (e.g., report “ P   <   0.02” when P   =   0.015 or report “ P   >   0.05” when P   =   0.06 or P   =   0.70) . No! This is bad practice because it makes it difficult or impossible for the reader to accurately interpret the statistical result. Only when the P value is very small (e.g., under 0.001) does an inequality become justifiable: There is little practical difference among very small P values when the assumptions used to compute P values are not known with enough certainty to justify such precision, and most methods for computing P values are not numerically accurate below a certain point.
  • Statistical significance is a property of the phenomenon being studied, and thus statistical tests detect significance . No! This misinterpretation is promoted when researchers state that they have or have not found “evidence of” a statistically significant effect. The effect being tested either exists or does not exist. “Statistical significance” is a dichotomous description of a P value (that it is below the chosen cut-off) and thus is a property of a result of a statistical test; it is not a property of the effect or population being studied.
  • One should always use two-sided P values . No! Two-sided P values are designed to test hypotheses that the targeted effect measure equals a specific value (e.g., zero), and is neither above nor below this value. When, however, the test hypothesis of scientific or practical interest is a one-sided (dividing) hypothesis, a one-sided P value is appropriate. For example, consider the practical question of whether a new drug is at least as good as the standard drug for increasing survival time. This question is one-sided, so testing this hypothesis calls for a one-sided P value. Nonetheless, because two-sided P values are the usual default, it will be important to note when and why a one-sided P value is being used instead.

There are other interpretations of P values that are controversial, in that whether a categorical “No!” is warranted depends on one’s philosophy of statistics and the precise meaning given to the terms involved. The disputed claims deserve recognition if one wishes to avoid such controversy.

For example, it has been argued that P values overstate evidence against test hypotheses, based on directly comparing P values against certain quantities (likelihood ratios and Bayes factors) that play a central role as evidence measures in Bayesian analysis [ 37 , 72 , 77 – 83 ]. Nonetheless, many other statisticians do not accept these quantities as gold standards, and instead point out that P values summarize crucial evidence needed to gauge the error rates of decisions based on statistical tests (even though they are far from sufficient for making those decisions). Thus, from this frequentist perspective, P values do not overstate evidence and may even be considered as measuring one aspect of evidence [ 7 , 8 , 84 – 87 ], with 1 −  P measuring evidence against the model used to compute the P value. See also Murtaugh [ 88 ] and its accompanying discussion.

Common misinterpretations of P value comparisons and predictions

Some of the most severe distortions of the scientific literature produced by statistical testing involve erroneous comparison and synthesis of results from different studies or study subgroups. Among the worst are:

  • 15. When the same hypothesis is tested in different studies and none or a minority of the tests are statistically significant (all P   >   0.05), the overall evidence supports the hypothesis . No! This belief is often used to claim that a literature supports no effect when the opposite is case. It reflects a tendency of researchers to “overestimate the power of most research” [ 89 ]. In reality, every study could fail to reach statistical significance and yet when combined show a statistically significant association and persuasive evidence of an effect. For example, if there were five studies each with P  = 0.10, none would be significant at 0.05 level; but when these P values are combined using the Fisher formula [ 9 ], the overall P value would be 0.01. There are many real examples of persuasive evidence for important effects when few studies or even no study reported “statistically significant” associations [ 90 , 91 ]. Thus, lack of statistical significance of individual studies should not be taken as implying that the totality of evidence supports no effect.
  • 16. When the same hypothesis is tested in two different populations and the resulting P values are on opposite sides of 0.05, the results are conflicting . No! Statistical tests are sensitive to many differences between study populations that are irrelevant to whether their results are in agreement, such as the sizes of compared groups in each population. As a consequence, two studies may provide very different P values for the same test hypothesis and yet be in perfect agreement (e.g., may show identical observed associations). For example, suppose we had two randomized trials A and B of a treatment, identical except that trial A had a known standard error of 2 for the mean difference between treatment groups whereas trial B had a known standard error of 1 for the difference. If both trials observed a difference between treatment groups of exactly 3, the usual normal test would produce P  = 0.13 in A but P  = 0.003 in B. Despite their difference in P values, the test of the hypothesis of no difference in effect across studies would have P  = 1, reflecting the perfect agreement of the observed mean differences from the studies. Differences between results must be evaluated by directly, for example by estimating and testing those differences to produce a confidence interval and a P value comparing the results (often called analysis of heterogeneity, interaction, or modification).
  • 17. When the same hypothesis is tested in two different populations and the same P values are obtained, the results are in agreement . No! Again, tests are sensitive to many differences between populations that are irrelevant to whether their results are in agreement. Two different studies may even exhibit identical P values for testing the same hypothesis yet also exhibit clearly different observed associations. For example, suppose randomized experiment A observed a mean difference between treatment groups of 3.00 with standard error 1.00, while B observed a mean difference of 12.00 with standard error 4.00. Then the standard normal test would produce P  = 0.003 in both; yet the test of the hypothesis of no difference in effect across studies gives P  = 0.03, reflecting the large difference (12.00 − 3.00 = 9.00) between the mean differences.
  • 18. If one observes a small P value, there is a good chance that the next study will produce a P value at least as small for the same hypothesis. No! This is false even under the ideal condition that both studies are independent and all assumptions including the test hypothesis are correct in both studies. In that case, if (say) one observes P  = 0.03, the chance that the new study will show P  ≤ 0.03 is only 3 %; thus the chance the new study will show a P value as small or smaller (the “replication probability”) is exactly the observed P value! If on the other hand the small P value arose solely because the true effect exactly equaled its observed estimate, there would be a 50 % chance that a repeat experiment of identical design would have a larger P value [ 37 ]. In general, the size of the new P value will be extremely sensitive to the study size and the extent to which the test hypothesis or other assumptions are violated in the new study [ 86 ]; in particular, P may be very small or very large depending on whether the study and the violations are large or small.

Finally, although it is (we hope obviously) wrong to do so, one sometimes sees the null hypothesis compared with another (alternative) hypothesis using a two-sided P value for the null and a one-sided P value for the alternative. This comparison is biased in favor of the null in that the two-sided test will falsely reject the null only half as often as the one-sided test will falsely reject the alternative (again, under all the assumptions used for testing).

Common misinterpretations of confidence intervals

Most of the above misinterpretations translate into an analogous misinterpretation for confidence intervals. For example, another misinterpretation of P  > 0.05 is that it means the test hypothesis has only a 5 % chance of being false, which in terms of a confidence interval becomes the common fallacy:

  • 19. The specific 95   % confidence interval presented by a study has a 95   % chance of containing the true effect size . No! A reported confidence interval is a range between two numbers. The frequency with which an observed interval (e.g., 0.72–2.88) contains the true effect is either 100 % if the true effect is within the interval or 0 % if not; the 95 % refers only to how often 95 % confidence intervals computed from very many studies would contain the true size if all the assumptions used to compute the intervals were correct . It is possible to compute an interval that can be interpreted as having 95 % probability of containing the true value; nonetheless, such computations require not only the assumptions used to compute the confidence interval, but also further assumptions about the size of effects in the model. These further assumptions are summarized in what is called a prior distribution , and the resulting intervals are usually called Bayesian posterior (or credible) intervals to distinguish them from confidence intervals [ 18 ].

Symmetrically, the misinterpretation of a small P value as disproving the test hypothesis could be translated into:

  • 20. An effect size outside the 95   % confidence interval has been refuted (or excluded) by the data . No! As with the P value, the confidence interval is computed from many assumptions, the violation of which may have led to the results. Thus it is the combination of the data with the assumptions, along with the arbitrary 95 % criterion, that are needed to declare an effect size outside the interval is in some way incompatible with the observations. Even then, judgements as extreme as saying the effect size has been refuted or excluded will require even stronger conditions.

As with P values, naïve comparison of confidence intervals can be highly misleading:

  • 21. If two confidence intervals overlap, the difference between two estimates or studies is not significant . No! The 95 % confidence intervals from two subgroups or studies may overlap substantially and yet the test for difference between them may still produce P  < 0.05. Suppose for example, two 95 % confidence intervals for means from normal populations with known variances are (1.04, 4.96) and (4.16, 19.84); these intervals overlap, yet the test of the hypothesis of no difference in effect across studies gives P  = 0.03. As with P values, comparison between groups requires statistics that directly test and estimate the differences across groups. It can, however, be noted that if the two 95 % confidence intervals fail to overlap, then when using the same assumptions used to compute the confidence intervals we will find P  < 0.05 for the difference; and if one of the 95 % intervals contains the point estimate from the other group or study, we will find P  > 0.05 for the difference.

Finally, as with P values, the replication properties of confidence intervals are usually misunderstood:

  • 22. An observed 95   % confidence interval predicts that 95   % of the estimates from future studies will fall inside the observed interval. No! This statement is wrong in several ways. Most importantly, under the model, 95 % is the frequency with which other unobserved intervals will contain the true effect , not how frequently the one interval being presented will contain future estimates. In fact, even under ideal conditions the chance that a future estimate will fall within the current interval will usually be much less than 95 %. For example, if two independent studies of the same quantity provide unbiased normal point estimates with the same standard errors, the chance that the 95 % confidence interval for the first study contains the point estimate from the second is 83 % (which is the chance that the difference between the two estimates is less than 1.96 standard errors). Again, an observed interval either does or does not contain the true effect; the 95 % refers only to how often 95 % confidence intervals computed from very many studies would contain the true effect if all the assumptions used to compute the intervals were correct.
  • 23. If one 95   % confidence interval includes the null value and another excludes that value, the interval excluding the null is the more precise one . No! When the model is correct, precision of statistical estimation is measured directly by confidence interval width (measured on the appropriate scale). It is not a matter of inclusion or exclusion of the null or any other value. Consider two 95 % confidence intervals for a difference in means, one with limits of 5 and 40, the other with limits of −5 and 10. The first interval excludes the null value of 0, but is 30 units wide. The second includes the null value, but is half as wide and therefore much more precise.

In addition to the above misinterpretations, 95 % confidence intervals force the 0.05-level cutoff on the reader, lumping together all effect sizes with P  > 0.05, and in this way are as bad as presenting P values as dichotomies. Nonetheless, many authors agree that confidence intervals are superior to tests and P values because they allow one to shift focus away from the null hypothesis, toward the full range of effect sizes compatible with the data—a shift recommended by many authors and a growing number of journals. Another way to bring attention to non-null hypotheses is to present their P values; for example, one could provide or demand P values for those effect sizes that are recognized as scientifically reasonable alternatives to the null.

As with P values, further cautions are needed to avoid misinterpreting confidence intervals as providing sharp answers when none are warranted. The hypothesis which says the point estimate is the correct effect will have the largest P value ( P  = 1 in most cases), and hypotheses inside a confidence interval will have higher P values than hypotheses outside the interval. The P values will vary greatly, however, among hypotheses inside the interval, as well as among hypotheses on the outside. Also, two hypotheses may have nearly equal P values even though one of the hypotheses is inside the interval and the other is outside. Thus, if we use P values to measure compatibility of hypotheses with data and wish to compare hypotheses with this measure, we need to examine their P values directly, not simply ask whether the hypotheses are inside or outside the interval. This need is particularly acute when (as usual) one of the hypotheses under scrutiny is a null hypothesis.

Common misinterpretations of power

The power of a test to detect a correct alternative hypothesis is the pre-study probability that the test will reject the test hypothesis (e.g., the probability that P will not exceed a pre-specified cut-off such as 0.05). (The corresponding pre-study probability of failing to reject the test hypothesis when the alternative is correct is one minus the power, also known as the Type-II or beta error rate) [ 84 ] As with P values and confidence intervals, this probability is defined over repetitions of the same study design and so is a frequency probability. One source of reasonable alternative hypotheses are the effect sizes that were used to compute power in the study proposal. Pre-study power calculations do not, however, measure the compatibility of these alternatives with the data actually observed, while power calculated from the observed data is a direct (if obscure) transformation of the null P value and so provides no test of the alternatives. Thus, presentation of power does not obviate the need to provide interval estimates and direct tests of the alternatives.

For these reasons, many authors have condemned use of power to interpret estimates and statistical tests [ 42 , 92 – 97 ], arguing that (in contrast to confidence intervals) it distracts attention from direct comparisons of hypotheses and introduces new misinterpretations, such as:

  • 24. If you accept the null hypothesis because the null P value exceeds 0.05 and the power of your test is 90   %, the chance you are in error (the chance that your finding is a false negative) is 10   % . No! If the null hypothesis is false and you accept it, the chance you are in error is 100 %, not 10 %. Conversely, if the null hypothesis is true and you accept it, the chance you are in error is 0 %. The 10 % refers only to how often you would be in error over very many uses of the test across different studies when the particular alternative used to compute power is correct and all other assumptions used for the test are correct in all the studies. It does not refer to your single use of the test or your error rate under any alternative effect size other than the one used to compute power.

It can be especially misleading to compare results for two hypotheses by presenting a test or P value for one and power for the other. For example, testing the null by seeing whether P  ≤ 0.05 with a power less than 1 − 0.05 = 0.95 for the alternative (as done routinely) will bias the comparison in favor of the null because it entails a lower probability of incorrectly rejecting the null (0.05) than of incorrectly accepting the null when the alternative is correct. Thus, claims about relative support or evidence need to be based on direct and comparable measures of support or evidence for both hypotheses, otherwise mistakes like the following will occur:

  • 25. If the null P value exceeds 0.05 and the power of this test is 90   % at an alternative, the results support the null over the alternative . This claim seems intuitive to many, but counterexamples are easy to construct in which the null P value is between 0.05 and 0.10, and yet there are alternatives whose own P value exceeds 0.10 and for which the power is 0.90. Parallel results ensue for other accepted measures of compatibility, evidence, and support, indicating that the data show lower compatibility with and more evidence against the null than the alternative, despite the fact that the null P value is “not significant” at the 0.05 alpha level and the power against the alternative is “very high” [ 42 ].

Despite its shortcomings for interpreting current data, power can be useful for designing studies and for understanding why replication of “statistical significance” will often fail even under ideal conditions. Studies are often designed or claimed to have 80 % power against a key alternative when using a 0.05 significance level, although in execution often have less power due to unanticipated problems such as low subject recruitment. Thus, if the alternative is correct and the actual power of two studies is 80 %, the chance that the studies will both show P  ≤ 0.05 will at best be only 0.80(0.80) = 64 %; furthermore, the chance that one study shows P  ≤ 0.05 and the other does not (and thus will be misinterpreted as showing conflicting results) is 2(0.80)0.20 = 32 % or about 1 chance in 3. Similar calculations taking account of typical problems suggest that one could anticipate a “replication crisis” even if there were no publication or reporting bias, simply because current design and testing conventions treat individual study results as dichotomous outputs of “significant”/“nonsignificant” or “reject”/“accept.”

A statistical model is much more than an equation with Greek letters

The above list could be expanded by reviewing the research literature. We will however now turn to direct discussion of an issue that has been receiving more attention of late, yet is still widely overlooked or interpreted too narrowly in statistical teaching and presentations: That the statistical model used to obtain the results is correct.

Too often, the full statistical model is treated as a simple regression or structural equation in which effects are represented by parameters denoted by Greek letters. “Model checking” is then limited to tests of fit or testing additional terms for the model. Yet these tests of fit themselves make further assumptions that should be seen as part of the full model. For example, all common tests and confidence intervals depend on assumptions of random selection for observation or treatment and random loss or missingness within levels of controlled covariates. These assumptions have gradually come under scrutiny via sensitivity and bias analysis [ 98 ], but such methods remain far removed from the basic statistical training given to most researchers.

Less often stated is the even more crucial assumption that the analyses themselves were not guided toward finding nonsignificance or significance (analysis bias), and that the analysis results were not reported based on their nonsignificance or significance (reporting bias and publication bias). Selective reporting renders false even the limited ideal meanings of statistical significance, P values, and confidence intervals. Because author decisions to report and editorial decisions to publish results often depend on whether the P value is above or below 0.05, selective reporting has been identified as a major problem in large segments of the scientific literature [ 99 – 101 ].

Although this selection problem has also been subject to sensitivity analysis, there has been a bias in studies of reporting and publication bias: It is usually assumed that these biases favor significance. This assumption is of course correct when (as is often the case) researchers select results for presentation when P  ≤ 0.05, a practice that tends to exaggerate associations [ 101 – 105 ]. Nonetheless, bias in favor of reporting P  ≤ 0.05 is not always plausible let alone supported by evidence or common sense. For example, one might expect selection for P  > 0.05 in publications funded by those with stakes in acceptance of the null hypothesis (a practice which tends to understate associations); in accord with that expectation, some empirical studies have observed smaller estimates and “nonsignificance” more often in such publications than in other studies [ 101 , 106 , 107 ].

Addressing such problems would require far more political will and effort than addressing misinterpretation of statistics, such as enforcing registration of trials, along with open data and analysis code from all completed studies (as in the AllTrials initiative, http://www.alltrials.net/ ). In the meantime, readers are advised to consider the entire context in which research reports are produced and appear when interpreting the statistics and conclusions offered by the reports.

Conclusions

Upon realizing that statistical tests are usually misinterpreted, one may wonder what if anything these tests do for science. They were originally intended to account for random variability as a source of error, thereby sounding a note of caution against overinterpretation of observed associations as true effects or as stronger evidence against null hypotheses than was warranted. But before long that use was turned on its head to provide fallacious support for null hypotheses in the form of “failure to achieve” or “failure to attain” statistical significance.

We have no doubt that the founders of modern statistical testing would be horrified by common treatments of their invention. In their first paper describing their binary approach to statistical testing, Neyman and Pearson [ 108 ] wrote that “it is doubtful whether the knowledge that [a P value] was really 0.03 (or 0.06), rather than 0.05…would in fact ever modify our judgment” and that “The tests themselves give no final verdict, but as tools help the worker who is using them to form his final decision.” Pearson [ 109 ] later added, “No doubt we could more aptly have said, ‘his final or provisional decision.’” Fisher [ 110 ] went further, saying “No scientific worker has a fixed level of significance at which from year to year, and in all circumstances, he rejects hypotheses; he rather gives his mind to each particular case in the light of his evidence and his ideas.” Yet fallacious and ritualistic use of tests continued to spread, including beliefs that whether P was above or below 0.05 was a universal arbiter of discovery. Thus by 1965, Hill [ 111 ] lamented that “too often we weaken our capacity to interpret data and to take reasonable decisions whatever the value of P . And far too often we deduce ‘no difference’ from ‘no significant difference.’”

In response, it has been argued that some misinterpretations are harmless in tightly controlled experiments on well-understood systems, where the test hypothesis may have special support from established theories (e.g., Mendelian genetics) and in which every other assumption (such as random allocation) is forced to hold by careful design and execution of the study. But it has long been asserted that the harms of statistical testing in more uncontrollable and amorphous research settings (such as social-science, health, and medical fields) have far outweighed its benefits, leading to calls for banning such tests in research reports—again with one journal banning P values as well as confidence intervals [ 2 ].

Given, however, the deep entrenchment of statistical testing, as well as the absence of generally accepted alternative methods, there have been many attempts to salvage P values by detaching them from their use in significance tests. One approach is to focus on P values as continuous measures of compatibility, as described earlier. Although this approach has its own limitations (as described in points 1, 2, 5, 9, 15, 18, 19), it avoids comparison of P values with arbitrary cutoffs such as 0.05, (as described in 3, 4, 6–8, 10–13, 15, 16, 21 and 23–25). Another approach is to teach and use correct relations of P values to hypothesis probabilities. For example, under common statistical models, one-sided P values can provide lower bounds on probabilities for hypotheses about effect directions [ 45 , 46 , 112 , 113 ]. Whether such reinterpretations can eventually replace common misinterpretations to good effect remains to be seen.

A shift in emphasis from hypothesis testing to estimation has been promoted as a simple and relatively safe way to improve practice [ 5 , 61 , 63 , 114 , 115 ] resulting in increasing use of confidence intervals and editorial demands for them; nonetheless, this shift has brought to the fore misinterpretations of intervals such as 19–23 above [ 116 ]. Other approaches combine tests of the null with further calculations involving both null and alternative hypotheses [ 117 , 118 ]; such calculations may, however, may bring with them further misinterpretations similar to those described above for power, as well as greater complexity.

Meanwhile, in the hopes of minimizing harms of current practice, we can offer several guidelines for users and readers of statistics, and re-emphasize some key warnings from our list of misinterpretations:

  • Correct and careful interpretation of statistical tests demands examining the sizes of effect estimates and confidence limits, as well as precise P values (not just whether P values are above or below 0.05 or some other threshold).
  • Careful interpretation also demands critical examination of the assumptions and conventions used for the statistical analysis—not just the usual statistical assumptions, but also the hidden assumptions about how results were generated and chosen for presentation.
  • It is simply false to claim that statistically nonsignificant results support a test hypothesis, because the same results may be even more compatible with alternative hypotheses—even if the power of the test is high for those alternatives.
  • Interval estimates aid in evaluating whether the data are capable of discriminating among various hypotheses about effect sizes, or whether statistical results have been misrepresented as supporting one hypothesis when those results are better explained by other hypotheses (see points 4–6). We caution however that confidence intervals are often only a first step in these tasks. To compare hypotheses in light of the data and the statistical model it may be necessary to calculate the P value (or relative likelihood) of each hypothesis. We further caution that confidence intervals provide only a best-case measure of the uncertainty or ambiguity left by the data, insofar as they depend on an uncertain statistical model.
  • Correct statistical evaluation of multiple studies requires a pooled analysis or meta-analysis that deals correctly with study biases [ 68 , 119 – 125 ]. Even when this is done, however, all the earlier cautions apply. Furthermore, the outcome of any statistical procedure is but one of many considerations that must be evaluated when examining the totality of evidence. In particular, statistical significance is neither necessary nor sufficient for determining the scientific or practical significance of a set of observations. This view was affirmed unanimously by the U.S. Supreme Court, (Matrixx Initiatives, Inc., et al. v. Siracusano et al. No. 09–1156. Argued January 10, 2011, Decided March 22, 2011), and can be seen in our earlier quotes from Neyman and Pearson.
  • Any opinion offered about the probability , likelihood , certainty , or similar property for a hypothesis cannot be derived from statistical methods alone. In particular, significance tests and confidence intervals do not by themselves provide a logically sound basis for concluding an effect is present or absent with certainty or a given probability. This point should be borne in mind whenever one sees a conclusion framed as a statement of probability, likelihood, or certainty about a hypothesis. Information about the hypothesis beyond that contained in the analyzed data and in conventional statistical models (which give only data probabilities) must be used to reach such a conclusion; that information should be explicitly acknowledged and described by those offering the conclusion. Bayesian statistics offers methods that attempt to incorporate the needed information directly into the statistical model; they have not, however, achieved the popularity of P values and confidence intervals, in part because of philosophical objections and in part because no conventions have become established for their use.
  • All statistical methods (whether frequentist or Bayesian, or for testing or estimation, or for inference or decision) make extensive assumptions about the sequence of events that led to the results presented—not only in the data generation, but in the analysis choices. Thus, to allow critical evaluation, research reports (including meta-analyses) should describe in detail the full sequence of events that led to the statistics presented, including the motivation for the study, its design, the original analysis plan, the criteria used to include and exclude subjects (or studies) and data, and a thorough description of all the analyses that were conducted.

In closing, we note that no statistical method is immune to misinterpretation and misuse, but prudent users of statistics will avoid approaches especially prone to serious abuse. In this regard, we join others in singling out the degradation of P values into “significant” and “nonsignificant” as an especially pernicious statistical practice [ 126 ].

Acknowledgments

SJS receives funding from the IDEAL project supported by the European Union’s Seventh Framework Programme for research, technological development and demonstration under Grant Agreement No. 602552. We thank Stuart Hurlbert, Deborah Mayo, Keith O’Rourke, and Andreas Stang for helpful comments, and Ron Wasserstein for his invaluable encouragement on this project.

Editor’s note

This article has been published online as supplementary material with an article of Wasserstein RL, Lazar NA. The ASA’s statement on p-values: context, process and purpose. The American Statistician 2016.

Albert Hofman, Editor-in-Chief EJE.

Contributor Information

Sander Greenland, Email: ude.alcu@semodsel .

Stephen J. Senn, Email: [email protected] .

John B. Carlin, Email: [email protected] .

Charles Poole, Email: ude.cnu@eloopc .

Steven N. Goodman, Email: [email protected] .

Douglas G. Altman, Email: [email protected] .

Want to create or adapt books like this? Learn more about how Pressbooks supports open publishing practices.

6 Week 5 Introduction to Hypothesis Testing Reading

An introduction to hypothesis testing.

What are you interested in learning about? Perhaps you’d like to know if there is a difference in average final grade between two different versions of a college class? Does the Fort Lewis women’s soccer team score more goals than the national Division II women’s average? Which outdoor sport do Fort Lewis students prefer the most?  Do the pine trees on campus differ in mean height from the aspen trees? For all of these questions, we can collect a sample, analyze the data, then make a statistical inference based on the analysis.  This means determining whether we have enough evidence to reject our null hypothesis (what was originally assumed to be true, until we prove otherwise). The process is called hypothesis testing .

A really good Khan Academy video to introduce the hypothesis test process: Khan Academy Hypothesis Testing . As you watch, please don’t get caught up in the calculations, as we will use SPSS to do these calculations.  We will also use SPSS p-values, instead of the referenced Z-table, to make statistical decisions.

The Six-Step Process

Hypothesis testing requires very specific, detailed steps.  Think of it as a mathematical lab report where you have to write out your work in a particular way.  There are six steps that we will follow for ALL of the hypothesis tests that we learn this semester.

Six Step Hypothesis Testing Process

1. Research Question

All hypothesis tests start with a research question.  This is literally a question that includes what you are trying to prove, like the examples earlier:  Which outdoor sport do Fort Lewis students prefer the most? Is there sufficient evidence to show that the Fort Lewis women’s soccer team scores more goals than the national Division 2 women’s average?

In this step, besides literally being a question, you’ll want to include:

  • mention of your variable(s)
  • wording specific to the type of test that you’ll be conducting (mean, mean difference, relationship, pattern)
  • specific wording that indicates directionality (are you looking for a ‘difference’, are you looking for something to be ‘more than’ or ‘less than’ something else, or are you comparing one pattern to another?)

Consider this research question: Do the pine trees on campus differ in mean height from the aspen trees?

  • The wording of this research question clearly mentions the variables being studied. The independent variable is the type of tree (pine or aspen), and these trees are having their heights compared, so the dependent variable is height.
  • ‘Mean’ is mentioned, so this indicates a test with a quantitative dependent variable.
  • The question also asks if the tree heights ‘differ’. This specific word indicates that the test being performed is a two-tailed (i.e. non-directional) test. More about the meaning of one/two-tailed will come later.

2. Statistical Hypotheses

A statistical hypothesis test has a null hypothesis, the status quo, what we assume to be true.  Notation is H 0, read as “H naught”.  The alternative hypothesis is what you are trying to prove (mentioned in your research question), H 1 or H A .  All hypothesis tests must include a null and an alternative hypothesis.  We also note which hypothesis test is being done in this step.

The notation for your statistical hypotheses will vary depending on the type of test that you’re doing. Writing statistical hypotheses is NOT the same as most scientific hypotheses. You are not writing sentences explaining what you think will happen in the study. Here is an example of what statistical hypotheses look like using the research question: Do the pine trees on campus differ in mean height from the aspen trees?

LaTeX: H_0\:

3. Decision Rule

In this step, you state which alpha value you will use, and when appropriate, the directionality, or tail, of the test.  You also write a statement: “I will reject the null hypothesis if p < alpha” (insert actual alpha value here).  In this introductory class, alpha is the level of significance, how willing we are to make the wrong statistical decision, and it will be set to 0.05 or 0.01.

Example of a Decision Rule:

Let alpha=0.01, two-tailed. I will reject the null hypothesis if p<0.01.

4. Assumptions, Analysis and Calculations

Quite a bit goes on in this step.  Assumptions for the particular hypothesis test must be done.  SPSS will be used to create appropriate graphs, and test output tables. Where appropriate, calculations of the test’s effect size will also be done in this step.

All hypothesis tests have assumptions that we hope to meet. For example, tests with a quantitative dependent variable consider a histogram(s) to check if the distribution is normal, and whether there are any obvious outliers. Each hypothesis test has different assumptions, so it is important to pay attention to the specific test’s requirements.

Required SPSS output will also depend on the test.

5. Statistical Decision

It is in Step 5 that we determine if we have enough statistical evidence to reject our null hypothesis.  We will consult the SPSS p-value and compare to our chosen alpha (from Step 3: Decision Rule).

Put very simply, the p -value is the probability that, if the null hypothesis is true, the results from another randomly selected sample will be as extreme or more extreme as the results obtained from the given sample. The p -value can also be thought of as the probability that the results (from the sample) that we are seeing are solely due to chance. This concept will be discussed in much further detail in the class notes.

Based on this numerical comparison between the p-value and alpha, we’ll either reject or retain our null hypothesis.  Note: You may NEVER ‘accept’ the null hypothesis. This is because it is impossible to prove a null hypothesis to be true.

Retaining the null means that you just don’t have enough evidence to prove your alternative hypothesis to be true, so you fall back to your null. (You retain the null when p is greater than or equal to alpha.)

Rejecting the null means that you did find enough evidence to prove your alternative hypothesis as true. (You reject the null when p is less than alpha.)

Example of a Statistical Decision:

Retain the null hypothesis, because p=0.12 > alpha=0.01.

The p-value will come from SPSS output, and the alpha will have already been determined back in Step 3. You must be very careful when you compare the decimal values of the p-value and alpha. If, for example, you mistakenly think that p=0.12 < alpha=0.01, then you will make the incorrect statistical decision, which will likely lead to an incorrect interpretation of the study’s findings.

6. Interpretation

The interpretation is where you write up your findings. The specifics will vary depending on the type of hypothesis test you performed, but you will always include a plain English, contextual conclusion of what your study found (i.e. what it means to reject or retain the null hypothesis in that particular study).  You’ll have statistics that you quote to support your decision.  Some of the statistics will need to be written in APA style citation (the American Psychological Association style of citation).  For some hypothesis tests, you’ll also include an interpretation of the effect size.

Some hypothesis tests will also require an additional (non-Parametric) test after the completion of your original test, if the test’s assumptions have not been met. These tests are also call “Post-Hoc tests”.

As previously stated, hypothesis testing is a very detailed process. Do not be concerned if you have read through all of the steps above, and have many questions (and are possibly very confused). It will take time, and a lot of practice to learn and apply these steps!

This Reading is just meant as an overview of hypothesis testing. Much more information is forthcoming in the various sets of Notes about the specifics needed in each of these steps. The Hypothesis Test Checklist will be a critical resource for you to refer to during homeworks and tests.

Student Course Learning Objectives

4.  Choose, administer and interpret the correct tests based on the situation, including identification of appropriate sampling and potential errors

c. Choose the appropriate hypothesis test given a situation

d. Describe the meaning and uses of alpha and p-values

e. Write the appropriate null and alternative hypotheses, including whether the alternative should be one-sided or two-sided

f. Determine and calculate the appropriate test statistic (e.g. z-test, multiple t-tests, Chi-Square, ANOVA)

g. Determine and interpret effect sizes.

h. Interpret results of a hypothesis test

  • Use technology in the statistical analysis of data
  • Communicate in writing the results of statistical analyses of data

Attributions

Adapted from “Week 5 Introduction to Hypothesis Testing Reading” by Sherri Spriggs and Sandi Dang is licensed under CC BY-NC-SA 4.0 .

Math 132 Introduction to Statistics Readings Copyright © by Sherri Spriggs is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License , except where otherwise noted.

Share This Book

User Preferences

Content preview.

Arcu felis bibendum ut tristique et egestas quis:

  • Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris
  • Duis aute irure dolor in reprehenderit in voluptate
  • Excepteur sint occaecat cupidatat non proident

Keyboard Shortcuts

5.1 - hypothesis testing overview.

  Jin asked Carlos if he had taken statistics, Carlos said he had but it was a long time ago and he did not remember a lot of it. Jin told Carlos understanding hypothesis testing would help him understand what the judge just said. In most research, a researcher has a “research hypothesis”, that is, what the research THINKS is going to occur because of some kind of intervention or treatment. In the courtroom the prosecutor is the researcher, thinking the person on trial is guilty. This would be the research hypothesis; guilty. However, as most of us know, the U.S. legal system operates that a person is innocent until PROVEN guilty. In other words, we have to believe innocence until there is enough evidence to change our mind that the person on trial is actually not innocent. In hypothesis testing, we refer to the presumption of innocence as the NULL HYPOTHESIS. So while the prosecutor has a research hypothesis, it must be shown that the presumption of innocence can be rejected.

Like the judge in the TV show, if we have enough evidence to conclude that the null is not true, we can reject the null. Jin explained that if the judge had enough evidence to conclude the person on trial was not innocent she would have. The judge specifically stated that she did not have enough evidence to reject innocence (the null hypothesis).

When the judge acquits a defendant, as on the T.V. show, this does not mean that the judge accepts the defendant’s claim of innocence. It only says that innocence is plausible because guilt has not been established beyond a reasonable doubt.

On the other hand, if the judge returns a guilty verdict she has concluded innocence (null) is not plausible given the evidence presented, therefore she rejects the statute of the null, innocence and concludes the alternative hypothesis- guilty .

Let’s take a closer look at how this works.

Making a Decision Section  

Taking a sample of 500 Penn State students, we asked them if they like cold weather, we observe a sample proportion of 0.556, since these students go to school in Pennsylvania it might generally be thought the true proportion of students who like cold weather is 0.5, in other words the NULL hypothesis is that the true population proportion equal to 0.5 ,

In order to “test” what is generally thought about these students (half of them like cold weather) we have to ask about the relationship of the data we have (from our sample) relative to the hypothesized null value. In other words, is our observed sample proportion far enough away from the 0.5 to suggest that there is evidence against the null? Translating this to statistical terms, we can think about the “how far” questions in terms of standard deviations. How many standard deviations apart would we consider to be “meaningfully different”?

What if instead of a cutoff standard deviation, we found a probability? With a null hypothesis of equal to 0.5, the alternative hypothesis is not equal to 0.50. To test this, we convert the distance between the observed value and the null value into a standardized statistic. We have worked with standardized scores when working with z scores. We also learned about the empirical rule. Combining these two concepts, we can begin to make decisions about “how far” the observed value and null hypothesis need to be to be “meaningfully different”.

To do this we calculate a Z statistic, which is a standardized score of the difference.

\(z^{*}=\dfrac{\hat{p}-p_{0}}{\sqrt{\frac{p_{0}\left(1-p_{0}\right)}{n}}}\)

We can look at the results of calculating a z test (which we will do using software). Large test statistics indicate a large difference between the observed value and the null, contributing to greater evidence of a significant difference, thus casting doubt that the true population proportion is the null value.

Accompanying the magnitude of the test statistic, our software also yields a “probability”. Returning to the values of the empirical rule we know the percentiles under a standard normal curve. We can apply these to determine the probability (which is really a percentile) of getting an observed score IF the null hypothesis is indeed true (or the mean of the distribution). In this class, we will not be calculating these by hand, but we do need to understand what the “p-values'' in the output mean. In our example, after calculating a z statistic, we determine that if the true proportion is 0.5, the probability we would get a sample proportion of 0.556 is 0.0061. This is a very small probability as measure against the standard defining “small” as a probability less than .05. In this case, we would reject the null hypothesis as a probable value for the population based on the evidence from our sample.

While p values are a standard in most statistics courses and textbook there have been recent conversations about the use of p values. 

  American Statistical Association Releases Statement on Statistical Significance and P-Values

The use of p-values is a common practice in statistical inference but also not without its controversy. In March of 2016, the American Statistical Association released a statement regarding p-values and their use in statistical studies and decision making.

You can review the full article: ASA Statement on p-Values: Context, Process and Purpose

P-Values Section  

Before we proceed any further we need to step away from the jargon and understand exactly what the heck a p value is. Simply a p value is the probability of getting the observed sample statistic,  given the null hypothesis is true . In our example, IF the true proportion of Penn State students who like the cold IS really .5 (as we state in the null hypothesis), what is the probability that we would get an observed sample statistic of .556?

When the probability is small we have one of two options. We can either conclude there is something wrong with our sample (however, if we followed good sampling techniques as discussed early in the notes then this is not likely) OR we can conclude that the null is probably not the true population value. 

To summarize the application of the p value:

  • If our p-value is less than or equal to \(\alpha \), then there is enough evidence to reject the null hypothesis (in most cases the alpha is going to be 0.05).
  • If our p-value is greater than \(\alpha \), there is not enough evidence to reject the null hypothesis.

One should be aware that \(\alpha \) is also called level of significance. This makes for a confusion in terminology. \(\alpha \) is the preset level of significance whereas the p-value is the observed level of significance. The p-value, in fact, is a summary statistic which translates the observed test statistic's value to a probability which is easy to interpret.

We can summarize the data by reporting the p-value and let the users decide to reject \(H_0 \) or not to reject \(H_0 \) for their subjectively chosen \(\alpha\) values.

IMAGES

  1. Hypothesis testing tutorial using p value method

    hypothesis testing p value equal 1

  2. What is P-value in hypothesis testing

    hypothesis testing p value equal 1

  3. Everything You Need To Know about Hypothesis Testing

    hypothesis testing p value equal 1

  4. Hypothesis testing tutorial using p value method

    hypothesis testing p value equal 1

  5. P-value and Hypothesis Testing

    hypothesis testing p value equal 1

  6. P-Value Method For Hypothesis Testing

    hypothesis testing p value equal 1

VIDEO

  1. Hypothesis Testing, p-value & Confidence Intervals, Exploratory Data Analysis In Python Statistics

  2. Hypothesis Testing (p-value) sigma known

  3. Hypothesis Testing: P-Value Method

  4. 1-Prop ZTest: One Sample z Test for a Proportion & p-value

  5. Hypothesis Testing Sigma Unknown Right Tailed P-Value Method

  6. Hypothesis testing: P-value method

COMMENTS

  1. S.3.2 Hypothesis Testing (P-Value Approach)

    S.3.2 Hypothesis Testing (P-Value Approach) - STAT ONLINE

  2. How Hypothesis Tests Work: Significance Levels (Alpha) and P values

    Using P values and Significance Levels Together. If your P value is less than or equal to your alpha level, reject the null hypothesis. The P value results are consistent with our graphical representation. The P value of 0.03112 is significant at the alpha level of 0.05 but not 0.01.

  3. Understanding P-values

    The p value gets smaller as the test statistic calculated from your data gets further away from the range of test statistics predicted by the null hypothesis. The p value is a proportion: if your p value is 0.05, that means that 5% of the time you would see a test statistic at least as extreme as the one you found if the null hypothesis was true.

  4. Understanding P-Values and Statistical Significance

    A p-value, or probability value, is a number describing how likely it is that your data would have occurred by random chance (i.e., that the null hypothesis is true). The level of statistical significance is often expressed as a p-value between 0 and 1. The smaller the p -value, the less likely the results occurred by random chance, and the ...

  5. Interpreting P values

    Here is the technical definition of P values: P values are the probability of observing a sample statistic that is at least as extreme as your sample statistic when you assume that the null hypothesis is true. Let's go back to our hypothetical medication study. Suppose the hypothesis test generates a P value of 0.03.

  6. How to Find the P value: Process and Calculations

    To find the p value for your sample, do the following: Identify the correct test statistic. Calculate the test statistic using the relevant properties of your sample. Specify the characteristics of the test statistic's sampling distribution. Place your test statistic in the sampling distribution to find the p value.

  7. 5.1

    A test is considered to be statistically significant when the p-value is less than or equal to the level of significance, also known as the alpha (\(\alpha\)) level. For this class, unless otherwise specified, \(\alpha=0.05\); this is the most frequently used alpha level in many fields. ... « Previous 5: Hypothesis Testing, Part 1;

  8. Hypothesis Testing

    Table of contents. Step 1: State your null and alternate hypothesis. Step 2: Collect data. Step 3: Perform a statistical test. Step 4: Decide whether to reject or fail to reject your null hypothesis. Step 5: Present your findings. Other interesting articles. Frequently asked questions about hypothesis testing.

  9. Hypothesis Testing

    The p-value is a probability computed assuming the null hypothesis is true, that the test statistic would take a value as extreme or more extreme than that actually observed. Since it's a probability, it is a number between 0 and 1. The closer the number is to 0 means the event is "unlikely." So if p-value is "small," (typically, less ...

  10. Understanding Hypothesis Tests: Significance Levels (Alpha) and P

    When a P value is less than or equal to the significance level, you reject the null hypothesis. If we take the P value for our example and compare it to the common significance levels, it matches the previous graphical results. The P value of 0.03112 is statistically significant at an alpha level of 0.05, but not at the 0.01 level.

  11. p-value

    In null-hypothesis significance testing, the -value [note 1] is the probability of obtaining test results at least as extreme as the result actually observed, under the assumption that the null hypothesis is correct. [2] [3] A very small p-value means that such an extreme observed outcome would be very unlikely under the null hypothesis.Even though reporting p-values of statistical tests is ...

  12. Hypothesis Testing, P Values, Confidence Intervals, and Significance

    Medical providers often rely on evidence-based medicine to guide decision-making in practice. Often a research hypothesis is tested with results provided, typically with p values, confidence intervals, or both. Additionally, statistical or research significance is estimated or determined by the investigators. Unfortunately, healthcare providers may have different comfort levels in interpreting ...

  13. An Explanation of P-Values and Statistical Significance

    The textbook definition of a p-value is: A p-value is the probability of observing a sample statistic that is at least as extreme as your sample statistic, given that the null hypothesis is true. For example, suppose a factory claims that they produce tires that have a mean weight of 200 pounds. An auditor hypothesizes that the true mean weight ...

  14. P-Value in Statistical Hypothesis Tests: What is it?

    A p value is used in hypothesis testing to help you support or reject the null hypothesis. The p value is the evidence against a null hypothesis. The smaller the p-value, the stronger the evidence that you should reject the null hypothesis. P values are expressed as decimals although it may be easier to understand what they are if you convert ...

  15. p-value Calculator

    Formally, the p-value is the probability that the test statistic will produce values at least as extreme as the value it produced for your sample.It is crucial to remember that this probability is calculated under the assumption that the null hypothesis H 0 is true!. More intuitively, p-value answers the question: Assuming that I live in a world where the null hypothesis holds, how probable is ...

  16. hypothesis testing

    The Bonferroni adjusted p is in fact an upper bound on the probability that the smallest (random) p-value is smaller than (smaller than or equal to in the discrete case) the given (fixed) p-value, assuming the complete null model describes the randomness. And certainly, 1.0 is an upper bound on any probability.

  17. 5 Tips for Interpreting P-Values Correctly in Hypothesis Testing

    Given the importance of the p-value, it is essential to ensure its interpretation is correct. Here are five essential tips for ensuring the p-value from a hypothesis test is understood correctly. 1. Know What the P-value Represents. First, it is essential to understand what a p-value is. In hypothesis testing, the p-value is defined as the ...

  18. Statistical tests, P values, confidence intervals, and power: a guide

    The P value assumes the test hypothesis is true—it is not a hypothesis probability and may be far from any reasonable probability for the test hypothesis. The P value simply indicates the degree to which the data conform to the pattern predicted by the test hypothesis and all the other assumptions used in the test (the underlying statistical ...

  19. 6 Week 5 Introduction to Hypothesis Testing Reading

    A statistical hypothesis test has a null hypothesis, the status quo, what we assume to be true. Notation is H 0, read as "H naught". The alternative hypothesis is what you are trying to prove (mentioned in your research question), H 1 or H A. All hypothesis tests must include a null and an alternative hypothesis.

  20. 8.3: Sampling distribution and hypothesis testing

    Questions. 1. What is the probability of having a sample mean greater than 50 (mean > 50) for a sample of n = 9 ?. We'll use a slight modification of the Z-score equation we introduced in Chapter 6.6 — the modification here is that previously we referred to the distribution of \(X_{i}\) values and how likely a particular observation would be. Instead, we can use the Z score with the ...

  21. S.3.2 Hypothesis Testing (P-Value Approach)

    The P-value approach involves determining "likely" or "unlikely" by determining the probability — assuming the null hypothesis was true — of observing a more extreme test statistic in the direction of the alternative hypothesis than the one observed.If the P-value is small, say less than (or equal to) \(\alpha\), then it is "unlikely."And, if the P-value is large, say more than \(\alpha ...

  22. 3.4: Hypothesis Test for a Population Proportion

    Let's answer this question using the p-value approach. Remember, for a two-sided alternative hypothesis ("not equal"), the p-value is two times the area of the test statistic. The test statistic is -1.81 and we want to find the area to the left of -1.81 from the standard normal table. On the negative page, find the Z-score -1.81.

  23. 8.1: The null and alternative hypotheses

    a probability value or p-value which is associated with the test statistic, assuming a null hypothesis is "true" in the population from which we sample. Note that as discussed in (Chapter 8.2) , this is not strictly the interpretation of p-value, but a shorthand for how likely the data is to fit the null hypothesis.

  24. Introduction to Hypothesis Testing

    Hypothesis Tests. A hypothesis test consists of five steps: 1. State the hypotheses. State the null and alternative hypotheses. These two hypotheses need to be mutually exclusive, so if one is true then the other must be false. 2. Determine a significance level to use for the hypothesis. Decide on a significance level.

  25. 5.1

    The p-value, in fact, is a summary statistic which translates the observed test statistic's value to a probability which is easy to interpret. Important note: We can summarize the data by reporting the p-value and let the users decide to reject \(H_0 \) or not to reject \(H_0 \) for their subjectively chosen \(\alpha\) values.

  26. Khan Academy

    P-values and significance tests (video)