hypothesis testing multiple means

User Preferences

Content preview.

Arcu felis bibendum ut tristique et egestas quis:

  • Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris
  • Duis aute irure dolor in reprehenderit in voluptate
  • Excepteur sint occaecat cupidatat non proident

Keyboard Shortcuts

10.3 - multiple comparisons.

If our test of the null hypothesis is rejected , we conclude that not all the means are equal: that is, at least one mean is different from the other means. The ANOVA test itself provides only statistical evidence of a difference, but not any statistical evidence as to which mean or means are statistically different.

For instance, using the previous example for tar content, if the ANOVA test results in a significant difference in average tar content between the cigarette brands, a follow up analysis would be needed to determine which brand mean or means differ in tar content. Plus we would want to know if one brand or multiple brands were better/worse than another brand in average tar content. To complete this analysis we use a method called multiple comparisons.

Multiple comparisons conducts an analysis of all possible pairwise means. For example, with three brands of cigarettes, A, B, and C, if the ANOVA test was significant, then multiple comparison methods would compare the three possible pairwise comparisons:

  • Brand A to Brand B
  • Brand A to Brand C
  • Brand B to Brand C

These are essentially tests of two means similar to what we learned previously in our lesson for comparing two means. However, the methods here use an adjustment to account for the number of comparisons taking place. Minitab provides three adjustment choices. We will use the Tukey adjustment which is an adjustment on the t-multiplier based on the number of comparisons.

Note! We don’t go in the theory behind the Tukey method. Just note that we only use a multiple comparison technique in ANOVA when we have a significant result.

In the next section, we present an example to walk through the ANOVA results.

Minitab 18

Minitab ®

Using minitab to perform one-way anova section  .

If the data entered in Minitab are in different columns, then in Minitab we use:

  • Stat > ANOVA > One-Way
  • If the responses are in one column and the factors are in their own column, then select the drop down of 'Response data are in one column for all factor levels.'
  • If the responses are in their own column for each factor level, then select 'Response data are in a separate column for each factor level.'
  • Next, in case we have a significant ANOVA result, and we want to conduct a multiple comparison analysis, we preemptively click 'Comparisons', the box for Tukey, and verify that the boxes for 'Interval plot for differences of means' and 'Grouping Information' are also checked.
  • Click OK and OK again.

Example: Tar Content (ANOVA) Section  

Test the hypothesis that the means are the same vs. at least one is different for both labs. Compare the two labs and comment.

Lab Precise

We are testing the following hypotheses:

\(H_0\colon \mu_1=\mu_2=\mu_3\) vs \(H_a\colon\text{ at least one mean is different}\)

The assumptions were discussed in the previous example.

The following is the output for one-way ANOVA for Lab Precise:

One-way ANOVA: Precise A, Precise B, Precise C

Null Hypothesis All means are equal
Alternative Hypothesis Not all means are equal
Significance Level \(\alpha\)= 0.05

Equal variances were assumed for the analysis.

Factor Information

Factor Levels Values

Factor

3 Precise A, Precise B, Precise C

Analysis of Variance

Source DF Adj SS Adj MS F-Value P-Value

Factor

2 12.000 6.00000 65.46 0.000
Error 15 1.375 0.09165    
Total 17 13.375      

Model Summary

S R-sq R-sq(adj) R-sq(pred)
0.302743 89.72% 88.35% 85.20%

The p-value for this test is less than 0.0001. At any reasonable significance level, we would reject the null hypothesis and conclude there is enough evidence in the data to suggest at least one mean tar content is different.

But which ones are different? The next step is to examine the multiple comparisons. Minitab provides the following output:

Factor N Mean StDev 95% CI
Precise A 6 10.000 0.257 (9.737, 10.263)
Precise B 6 11.000 0.365 (10.737, 11.263)
Precise C 6 12.000 0.276 (11.737, 12.263)

Pooled StDev = 0.302743

Tukey Pairwise Comparisons

Grouping information using the tukey method and 95% confidence.

Factor N Mean Grouping
Precise C 6 12.000 A
Precise B 6 11.000 B
Precise A 6 10.000 C

Means that do not share a letter are significantly different.

The Tukey pairwise comparisons suggest that all the means are different. Therefore, Brand C has the highest tar content and Brand A has the lowest.

We are testing the same hypotheses for Lab Sloppy as Lab Precise, and the assumptions were checked. The ANOVA output for Lab Sloppy is:

One-way ANOVA: Sloppy A, Sloppy B, Sloppy C

Factor Levels Values

Factor

3

Sloppy A, Sloppy B, Sloppy C

Source DF Adj SS Adj MS F-Value P-Value

Factor

2 12.00 6.000 1.96 0.176
Error 15 45.98 3.065    
Total 17 57.98      
S R-sq R-sq(adj) R-sq(pred)
1.75073 20.70% 10.12% 0.00%

The one-way ANOVA showed statistically significant results for Lab Precise but not for Lab Sloppy. Recall that ANOVA compares the within variation and the between variation. For Lab Precise, the within variation was small compared to the between variation. This resulted in a large F-statistic (65.46) and thus a small p-value. For Lab Sloppy, this ratio was small (1.96), resulting in a large p-value.

Try it! Section  

20 young pigs are assigned at random among 4 experimental groups. Each group is fed a different diet. (This design is a completely randomized design.) The data are the pig's weight, in kilograms, after being raised on these diets for 10 months ( pig_weights.txt ). We wish to determine whether the mean pig weights are the same for all 4 diets.

First, we set up our hypothesis test:

\(H_0\colon \mu_1=\mu_2=\mu_3=\mu_4\)

\(H_a\colon \text { at least one mean weight is different}\)

Here are the data that were obtained from the four experimental groups, as well as, their summary statistics:

Feed 1 Feed 2 Feed 3 Feed 4
60.8 68.3 102.6 87.9
57.1 67.7 102.2 84.7
65.0 74.0 100.5 83.2
58.7 66.3 97.5 85.8
61.8 69.9 98.9 90.3

Output from Minitab:

Descriptive statistics: feed 1, feed 2, feed 3, feed 4.

Variable

N

N* Mean StDev Minimum Maximum

Feed 1

5 0 60.68 3.03 57.10 65.00

Feed 2

5 0 69.24 2.96 66.30 74.00
Feed 3 5 0 100.34 2.16 97.50 102.60
Feed 4 5 0 86.38 2.78 83.20 90.30

The smallest standard deviation is 2.16, and the largest is 3.03. Since the rule of thumb is satisfied here, we can say the equal variance assumption is not violated. The description suggests that the samples are independent. There is nothing in the description to suggest the weights come from a normal distribution. The normal probability plots are:

Probability plot for feed 1-4. The plots show the trend line and 95% confidence interval lines.

There are no obvious violations from the normal assumption, but we should proceed with caution as the sample sizes are very small.

The ANOVA output is:

One-way ANOVA: Feed 1, Feed 2, Feed 3, Feed 4

Factor Levels Values

Factor

4 Feed 1, Feed 2, Feed 3, Feed 4
Source DF Adj SS Adj MS F-Value P-Value

Factor

3 4703.2 1567.73 206.72 0.000
Error 16 121.3 7.58    
Total 19 4824.5      
S R-sq R-sq(adj) R-sq(pred)
2.75386 97.48% 97.01% 96.07%

The p-value for the test is less than 0.001. With a significance level of 5%, we reject the null hypothesis. The data provide sufficient evidence to conclude that the mean weights of pigs from the four feeds are not all the same.

With a rejection of the null hypothesis leading us to conclude that not all the means are equal (i.e., at least the mean pig weight or one diet differs from the mean pig weight from the other diets) some follow up questions are:

  • "Which diet type results in different average pig weights?", and
  • "Is there one particular diet type that produces the largest/smallest mean weight?"

To answer these questions we analyze the multiple comparison output (the grouping information) and the interval graph.

Factor N Mean StDev 95% CI
Feed 1 5 60.68 3.03 (58.07, 63.29)
Feed 2 5 69.24 2.96 (66.63, 71.85)
Feed 3 5 100.340 2.164 (97.729, 102.951)
Feed 4 5 86.38 2.78 (83.77, 88.99)

Pooled StDev = 2.75386

Factor N Mean Grouping
Feed 3 5 100.340 A      
Feed 4 5 86.38   B    
Feed 2 5 69.24     C  
Feed 1 5 60.68       D

Each of these factor levels are associated with a grouping letter. If any factor levels have the same letter, then the multiple comparison method did not determine a significant difference between the mean response. For any factor level that does not share a letter, a significant mean difference was identified. From the lettering we see each Diet Type has a different letter, i.e. no two groups share a letter. Therefore, we can conclude that all four diets resulted in statistically significant different mean pig weights. Furthermore, with the order of the means also provided from highest to lowest, we can say that Feed 3 resulted in the highest mean weight followed by Feed 4, then Feed 2, then Feed 1. This grouping result is supported by the graph of the intervals.

Confidence Interval Plot of the comparisons between the feeds. None of the intervals cover zero which says the corresponding means are significantly different.

In analyzing the intervals, we reflect back on our lesson in comparing two means: if an interval contained zero, we could not conclude a difference between the two means; if the interval did not contain zero, then a difference between the two means was supported. With four factor levels, there are six possible pairwise comparisons. (Remember the binomial formula where we had the counter for the number of possible outcomes? In this case \(4\choose 2\) = 6). In inspecting each of these six intervals, we find that all six do NOT include zero. Therefore, there is a statistical difference between all four group means; the four types of diet resulted in significantly different mean pig weights.

Comparing More Than Two Means: One-Way ANOVA

Copyright © 2009–2024 by Stan Brown, BrownMath.com

When you have several means to compare, it’s not valid just to compare all possible pairs with t tests. Instead, you follow a two-stage process:

  • Are all the means equal? A computation called ANOVA (analysis of variance) answers this question.
  • If ANOVA shows that the means aren’t all equal, then which means are unequal, and by how much? There are many ways to answer this question (and they give different answers), but we’ll use a process called Tukey’s HSD (Honestly Significant Difference).

Terminology

Example 1: fat for frying donuts, requirements for anova, perform a 1-way anova test, estimating differences of means, other comparisons, example 2: stock market, example 3: crt lifetimes, why not just pick two means and do a t test, how anova works, estimating individual treatment means, η²: strength of association, what’s new.

The factor that varies between samples is called the factor . (Every once in a while things are easy.) The r different values or levels of the factor are called the treatments . Here the factor is the choice of fat and the treatments are the four fats, so r  = 4.

The computations to test the means for equality are called a 1-way ANOVA or 1-factor ANOVA .

g Fat Absorbed in Batch
Fat 16472687756957213.34
Fat 2789197828577857.77
Fat 3759378716376769.88
Fat 4556649647068628.22
source: [full citation in “References”, below] pp 217–218

Hoping to produce a donut that could be marketed to health-conscious consumers, a company tried four different fats to see which one was least absorbed by the donuts during the deep frying process. Each fat was used for six batches of two dozen donuts each, and the table shows the grams of fat absorbed by each batch of donuts.

It looks like donuts absorb the most of Fat 2 and the least of Fat 4, with intermediate amounts of Fat 1 and Fat 3. But there’s a lot of overlap, too: for instance, even though the mean for Fat 2 is much higher than for Fat 1, one sample of Fat 1, 95 g, is higher than five of the six samples of Fat 2.

Nevertheless, the sample means do look different. But what about the population means? In other words, would the four fats be absorbed in different if you made a whole lot of batches of donuts — do statistics justify choosing one fat over another? This is the basic question of a hypothesis test or significance test: is the difference great enough that you can rule out chance?

If Fats 2 and 4 were the only ones you had data for, you’d do a good old 2-sample t test. So why can’t you do that anyway? because that would greatly increase your chances of a Type I error. The reasons are given in the Appendix .

By the way, though usually you are interested in the differences between population means with various treatments, you can also estimate the individual means. If you’re interested, see Estimating Individual Treatment Means in the Appendix.

Step 1: ANOVA Test for Equality of All Means

The ANOVA procedure tests these hypotheses:

H 0 : μ 1 = μ 2 = … = μ r , all the means are the same

H 1 : two or more means are different from the others

Let’s test these hypotheses at the α = 0.05 significance level.

You might wonder why you do analysis of variance to test means , but this actually makes sense. The question, remember, is whether the observed difference in means is too large to be the result of random selection. How do you decide whether the difference is too large? You look at the absolute difference of means between treatments (samples), but you also consider the variability within each treatment. Intuitively, if the difference between treatments is a lot bigger than the difference within treatments, you conclude that it’s not due to random chance and there is a real effect.

And this is just how ANOVA works: comparing the variation between groups to the variation within groups. Hence, analysis of variance .

  • You need r simple random samples for the r treatments, and they need to be independent samples. The sample sizes need not be the same, though it’s best if they’re not very different.
  • The underlying populations should be normally distributed . However, the ANOVA test is robust and moderate departures from normality aren’t a problem, especially if sample sizes are large and equal or nearly equal ( Kuzma & Bohnenblust 2005 [full citation at https://BrownMath.com/swt/sources.htm#so_Kuzma2005] page 180).

Miller 1986 [full citation in “References”, below] (pages 90–91) is more cautious. When sample sizes are equal but standard deviations are not, the actual p-value will be slightly larger than what you find in the tables. But when sample sizes are unequal, and the smaller samples have the larger standard deviations, the actual p-value “can increase dramatically above” what the tables say, even “without too much disparity” in the standard deviations. “Falsely reporting significant results when the small samples have the larger variances is a serious worry. The lesson to be learned is to balance the experiment [equal sample sizes] if at all possible. ”

A 1-way ANOVA tests whether the means of all groups are equal for different levels of one factor, using some fairly lengthy calculations. You could do all the computations by hand as shown in the Appendix, but no one ever does. Here are some alternatives:

  • Excel’s Anova: Single Factor command is in the Tools  » Data Analysis menu in Excel 2003 and below, or the Data  » Analysis  » Data Analysis menu in Excel 2007. If you don’t see it there, follow instructions in Excel help to load the Analysis Toolpak.
  • On a TI-83 or TI-84, enter each sample in a statistics list, then press [ STAT ] [ ◄ ] [ ▲ ] to select ANOVA , and enter the list names separated by commas.
  • There are even Web-based ANOVA calculators, such as Lowry 2001b [full citation in “References”, below] .
  • There are many software packages for mathematics and statistics that include ANOVA calculations. One of them, R , is highly regarded and is open source.

When you use a calculator or computer program to do ANOVA, you get an ANOVA table that looks something like this:

SS MSFp
Between groups
(or “Factor”)
1636.53545.45.410.0069
Within groups
(or “Error”)
2018.020100.9
Total 3654.523

Note that the mean square between treatments, 545.4, is much larger than the mean square within treatments, 100.9. That ratio, between-groups mean square over within-groups mean square, is called an F statistic (F = MS B /MS W  = 5.41 in this example). It tells you how much more variability there is between treatment groups than within treatment groups. The larger that ratio, the more confident you feel in rejecting the null hypothesis , which was that all means are equal and there is no treatment effect.

But what you care about is the p-value of 0.0069, obtained from the F distribution. The p-value has the usual interpretation: the probability of the between-treatments MS being ≥5.41 times the within-treatments MS, if the null hypothesis is true, is p = 0.0069.

The p-value is below your significance level of 0.05: it would be quite unlikely to have MS B /MS W this large if there were no real difference among the means. Therefore you reject H 0 and accept H 1 , concluding that the mean absorption of all the fats is not the same .

An interesting extra parameter can be derived from the ANOVA table; see η²: Strength of Association in the Appendix below.

Now that you know that it does make a difference which fat is used, you naturally want to know which fats are significantly different. This is post-hoc analysis . There are several different post-hoc analyses, and no one is superior on all points, but the most common choice is the Tukey HSD.

Step 2: Tukey HSD for Post-Hoc Analysis

If your ANOVA test shows that the means aren’t all equal, your next step is to determine which means are different, to your level of significance. You can’t just perform a series of t tests , because that would greatly increase your likelihood of a Type I error. So what do you do?

John Tukey gave one answer to this question, the HSD (Honestly Significant Difference) test. You compute something analogous to a t score for each pair of means, but you don’t compare it to the Student’s t distribution. Instead, you use a new distribution called the studentized range or q distribution .

Caution: Perform post-hoc analysis only if the ANOVA test shows a p-value less than your α. If p>α, you don’t know whether the means are all the same or not, and you can’t go fishing for unequal means.

You generally want to know not just which means differ, but by how much they differ (the effect size ). The easiest thing is to compute the confidence interval first, and then interpret it for a significant difference in means (or no significant difference). You’ve already seen this relationship between a test of significance at the α level and a 1−α confidence interval:

  • If the endpoints of the CI have the same sign (both positive or both are negative), then 0 is not in the interval and you can conclude that the means are different.
  • If the endpoints of the CI have opposite signs, then 0 is in the interval and you can’t determine whether the means are equal or different .

You compute that confidence interval similarly to the confidence interval for the difference of two means, but using the q distribution which avoids the problem of inflating α :

where x̅ i and x̅ j are the two sample means, n i and n j are the two sample sizes, MS W is the within-groups mean square from the ANOVA table , and q is the critical value of the studentized range for α, the number of treatments or samples r , and the within-groups degrees of freedom df W . The square-root term is called the standardized error (as opposed to standard error).

Using the studentized range, developed by Tukey, overcomes the problem of inflated significance level that I talked about earlier. If sample sizes are equal, the risk of a Type I error is exactly α, and if sample sizes are unequal it’s less than α: the procedure is conservative . In terms of confidence intervals, if the sample sizes are equal then the confidence level is the stated 1−α, but if the sample size are unequal then the actual confidence level is greater than 1−α ( NIST 2012 [full citation in “References”, below] section 7.4.7.1).

Usually the comparisons are presented in a table, like this one for the example with frying donuts :

Critical q
q(α,r, )
Standardized
error
95% Conf Interval
for μ −μ
Signif
at 0.05?
Fat 1 − Fat 2 −133.95974.1008 −29.23.2
Fat 1 − Fat 3 −43.95974.1008 −20.212.2
Fat 1 − Fat 4 103.95974.1008 −6.226.2
Fat 2 − Fat 3 93.95974.1008 −7.225.2
Fat 2 − Fat 4 233.95974.1008 6.839.2
Fat 3 − Fat 4 143.95974.1008 −2.230.2

How do you read the table, and how was it constructed? Look first at the rows. Each row compares one pair of treatments.

If you have r treatments, there will be r ( r −1)/2 pairs of means. The “/2” part comes because there’s no need to compare Fat 1 to Fat 2 and then Fat 2 to Fat 1. If Fat 1 is absorbed less than Fat 2, then Fat 2 is absorbed more than Fat 1 and by the same amount.

Now look at the columns. I’ll work through all the columns of the first row with you, and you can interpret the others in the same way.

  • The row heading tells you which treatments are being compared in this row , and the direction of comparison.
  • The next column gives the point estimate of difference , which is nothing more than the difference or the two sample means. The sample means of Fat 1 and Fat 2 were 72 and 85, so the difference is −13: the sample average of Fat 1 was 13 g less fat absorbed than the sample average of Fat 2.

For this experiment, we had four treatments and df W from the ANOVA table was 20, so we need q(0.05, 4, 20). Your textbook may have a table of critical values for the studentized range, or you can look up q in an online table such as the one at the end of Abdi and Williams 2010 [full citation in “References”, below] , or find it with an online calculator like Lowry 2001a [full citation in “References”, below] . Most textbooks don’t have a table of q, and the TI calculators can’t compute it.)

Different sources give slightly different critical values of q, I suspect because q is extremely difficult to compute. One value I found was q(0.05,4,20) = 3.9597.

In an experiment with unequal sample sizes, the standardized error would vary for comparing different pairs of treatments. But in this experiment, every treatment has six data points, and so the standardized error is the same for every pair of means:

√ (MS W /2)·(1/6+1/6) = √ (100.9/2)·(2/6) = 4.1008

The confidence interval for the difference between Fat 1 and Fat 2 goes from a negative to a positive, so it does include zero. That means the two fats might have the same or different absorption, so you can’t say whether there’s a difference.

Caution : It’s generally best not to say that there is no significant difference. Even though that’s literally true, it’s easily misinterpreted to mean that the absorption of the two fats is the same, and you don’t know that. It might be, and it might not be. Stick to neutral language .

On the other hand, when the endpoints of the confidence interval are both positive or both negative, then 0 is not in the interval and we reject the null hypothesis of equality. In this table, only Fats 2 and 4 have a significant difference.

Interpretation : Fats 2 and 4 are not equally absorbed in frying donuts, and we’re 95% confident that a batch of 24 donuts absorbs 6.8 g to 30.2 g more of Fat 2 than Fat 4.

It’s possible to make more complicated comparisons. For instance, with a control group and two treatments you might compare the mean of the control group to the average of the means of the two treatments. Any kind of linear comparison can be done using a procedure developed by Henry Scheffé. A good brief explanation of Scheffé’s method is at NIST 2012 [full citation in “References”, below] section 7.4.7.2.

Tukey’s method is best when you are simultaneously comparing all pairs of means. If you have pre-selected a subset of means to compare, the Bonferroni method ( NIST 2012 [full citation in “References”, below] section 7.4.7.3) may be better.

5-year Rates of Return
FinancialEnergyUtilities
10.7612.7211.88
15.0513.915.86
17.016.4313.46
5.0711.199.90
19.5018.793.95
8.1620.733.44
10.389.607.11
6.7517.4015.70
11.58513.8468.913
5.1244.8674.530
source: morningstar.com via [full citation at https://BrownMath.com/swt/sources.htm#so_Sullivan2011] page C–30 (on CD)

A stock analyst randomly selected eight stocks in each of three industries and compiled the five-year rate of return for each stock. The analyst would like to know whether any of the industries have a different rate of return from the others, at the 0.05 significance level.

Solution : The hypotheses are

H 0 : = μ F = μ E = μ U , all three industries have the same average rate of return

H 1 : the industries don’t all have the same average rate of return

You can use a normal probability plot to assess normality for each sample; see MATH200A Program part 4 . The standard deviations of the three samples are fairly close together, so the requirements are met.

Here is the ANOVA table:

SS MSFp
Between groups
(or “Factor”)
97.5931248.79652.080.1502
Within groups
(or “Error”)
493.25772123.4885
Total 590.850823

The F statistic is only 2.08, so the variation between groups is only about double the variation within groups. The high p-value makes you fail to reject H 0 and you cannot reach a conclusion about differences between average rates of returns for the three industries.

Since you failed to reject H 0 in the initial ANOVA test, you can’t do any sort of post-hoc analysis and look for differences between any particular pairs of means. (Well, you can , but you know in advance that all of the intervals will include zero, meaning that you don’t know whether any particular sector has a different return from any other sector or not.)

Lifetime, hr
Type A 407   411   409409 2.0
Type B 404   406   408   405   402 4052.2
Type C 410   408   406   408 4081.6
source: [full citation in “References”, below], pp 378–379

A company makes three types of high-performance CRTs. A random sample finds lifetimes shown in the table at right. At the 0.05 level, is there a difference in the average lifetimes of the three types?

Solution : Your hypotheses are

H 0 : μ A = μ B = μ C , the three types have equal mean lifetime

H 1 : the three types don’t all have the same mean lifetime

Excel or the TI-83/84 gives you this ANOVA table:

SS MSFp
Between groups
(or “Factor”)
362184.500.0442
Within groups
(or “Error”)
3694
Total 7211

p<α, so you reject H 0 and accept H 1 , concluding that the three types don’t all have the same mean lifetime.

Since you were able to reject the null hypothesis, you can proceed with post-hoc analysis to determine which means are different and the size of the difference. Here is the table:

Critical q
q(α,r, )
Standardized
error
95% Conf Interval
for μ −μ
Signif
at 0.05?
Type A − Type B 43.95081.0328 −0.18.1
Type A − Type C 13.95081.0801 −3.35.3
Type B − Type C −33.95080.9487 −6.70.7

This result might surprise you: although the three means aren’t all equal, you can’t say that any two of the means are unequal. But when you look more closely at the numbers, this doesn’t seem quite so unreasonable.

First, look at the p-value in the ANOVA table: 0.0442 is below 0.05, yes, but it’s not very far below. There’s almost a 4½% chance that we’re committing a Type I error in rejecting H 0 . Next, look at the confidence interval μ A −μ B . While the interval does include 0, it’s extremely lopsided and almost doesn’t include 0.

Though we’re used to thinking of significance as “either it is or it isn’t”, there are cases where the decision is a close one, and this is one of those cases. And the confidence intervals are computed by a different method than the significance test, using a different distribution. Here again, the decision is a close one. So what we have is two close decisions, based on different computations, one falling slightly on one side of the line and the other falling slightly on the other side of the line. It’s a good reminder that in statistics we’re dealing with probabilities, not certainties.

Appendix (The Hard Stuff)

The following sections are for students who want to know more than just the bare bones of how to do a 1-way ANOVA test.

Remember that you have to set up hypotheses up before you know the data. Before you’ve actually fried the donuts, you have no reason to expect any particular outcome. Specifically, until you have the data you have no reason to think Fats 2 and 4 are any more different than Fats 1 and 4, or any other pair.

Why can’t you collect the data and then select your hypotheses? Because that can put significance on a chance event. For example, a golfer hits a ball and it lands on a particular tuft of grass. The probability of landing on that particular tuft is extremely small, so there’s something different about that particular tuft, right? Obviously not! It’s a logical fallacy to decide what to test after you already have the data.

So if you want to do a 2-sample t test in differences among four fats you would have to test every pair of fats: 1 and 2, 1 and 3 1 and 4, 2 and 3, 2 and 4, 3 and 4. That’s six hypotheses in all.

Well, why not do a 0.05 significance test on pair of means? Remember what a 0.05 significance level means: you’re willing to accept a 5% chance of a Type I error, rejecting H 0 when it’s actually true. But if you test six 0.05 hypotheses on the same set of data, you’re much more likely to commit a Type I error. How much more likely? Well, for each hypothesis there’s a 95% chance of escaping a Type I error, but the probability of escaping a Type I error six times in a row is 0.95 6  = 0.7351. 1−0.7351 = 0.2649, so if you test all six pairs at the 0.05 level, you’re more likely than one chance in four to get a false positive , finding a difference between two fats when there’s actually no difference.

Prob. of Type I Error
groupspairs α = 0.05α = 0.01
330.14260.0297
460.26490.0585
5100.40130.0956
6150.53670.1399

In general, if you have r treatments, there are r ( r −1)/2 pairs of means to compare. If you test each pair at significance level α, the overall probability of a Type I error is 1 − (1−α) r ( r −1)/2 . The table at right shows the effective α for various numbers of treatments when the nominal α is 0.05 or 0.01. You can see that testing multiple hypotheses increases your α dramatically. Even with just three treatments, the effective α is almost three times the nominal α. This is clearly unacceptable.

Why not just lower your alpha? Because as you lower your α you increase your β, the chance of a Type II error. β represents the probability of a false negative , failing to find a difference in fats when there actually is a difference. This, too, is unacceptable.

So you have to find a way to test all the pairs of means at the same time, in one test. The solution is an extension of the t test to multiple samples, and it’s called ANOVA. ( If you have only two treatments, ANOVA computes the same p-value as a two-sample t test, but at the cost of extra effort.)

How does the ANOVA procedure compute a p-value? This section shows you the formulas and carries through the computations for the example with fat for frying donuts .

Remember, long ago in a galaxy called Descriptive Statistics, how the variance was defined: find the mean, then for each data point take the square of its difference from the mean. Add up all those squares, and you have SS( x ), the sum of squared deviations in x . The variance was SS( x ) divided by the degrees of freedom n −1, so it was a kind of average or mean squared deviation. You probably learned the shortcut computational formulas:

SS( x ) = ∑ x ² − (∑ x )²/ n or SS( x ) = ∑ x ² − n x̅ ²

s ² = MS( x ) = SS( x )/ df where df  =  n −1

In 1-way ANOVA, we extend those concepts a bit. First you partition SS( x ) into between-treatments and within-treatments parts, SS B and SS W . Then you compute the mean square deviations:

  • MS B is called the between-treatments mean square, between-groups variance, or factor MS . It measures the variability associated with the different treatment levels or different values of the factor.
  • MS W is called the within-treatments mean square, within-group variance, pooled variance, or error MS . It measures the variability that is not associated with the different treatments.

Finally you divide the two to obtain your test statistic, F = MS B /MS W , and you look up the p-value in a table of the F distribution.

(The F distribution is named after “the celebrated R.A. Fisher” ( Kuzma & Bohnenblust 2005 [full citation at https://BrownMath.com/swt/sources.htm#so_Kuzma2005] , 176). You may have already seen the F distribution in computing a different ratio of variances, as part of testing the variances of two populations for equality.)

There are several ways to compute the variability, but they all come up with the same answers and this method in Spiegel and Stephens 1999 [full citation in “References”, below] pages 367–368 is as easy as any:

SS MSF
Between groups
(or “Factor”)
SS = ∑ ²− ² = −1 MS = SS / F = MS /MS
Within groups
(or “Error”)*
SS = SS −SS = − MS = SS /
Total* SS = ∑ ²− ² = −1
or, if you know the standard deviations of the samples,
SS = ∑( −1) ²
SS = SS + SS
  • r is the number of treatments.
  • n j , x̅ j , s j for each treatment are the sample size, sample mean, and sample standard deviation.

x̅ = ∑ n j x̅ j / N

You begin with the treatment means x̅ j ={72, 85, 76, 62} and the overall mean x̅ =73.75, then compute

SS B = (6×72²+6×85²+6×76²+6×62²) − 24×73.75² = 1636.5

MS B = 1636.5 / 3 = 545.4

The next step depends on whether you know the standard deviations s j of the samples. If you don’t, then you jump to the third row of the table to compute the overall sum of squares:

∑ x ² = 64² + 72² + 68² + … + 70² + 68² = 134192

SS tot = ∑ x ² − N x̅ ² = 134192 − 24×73.75² = 3654.5

Then you find SS W by subtracting the “between” sum of squares SS B from the overall sum of squares SS tot :

SS W = SS tot −SS B = 3654.5−1636.5 = 2018.0

MS W = 2018.0 / 20 = 100.9

Now you’re almost there. You want to know whether the variability between treatments, MS B , is greater than the variability within treatments, MS W . If it’s enough greater, then you conclude that there is a real difference between at least some of the treatment means and therefore that the factor has a real effect. To determine this, divide

F = MS B /MS W  = 5.41

This is the F statistic. The F distribution is a one-tailed distribution that depends on both degrees of freedom, df B and df W .

At long last, you look up F=5.41 with 3 and 20 degrees of freedom, and you find a p-value of 0.0069. The interpretation is the usual one: there’s only a 0.0069 chance of getting an F statistic greater than 5.41 (or higher variability between treatments relative to the variability within treatments) if there is actually no difference between treatments. Since the p-value is less than α, you conclude that there is a difference.

Usually you’re interested in the contrast between two treatments, but you can also estimate the population mean for an individual treatment. You do use a t interval, as you would when you have only one sample, but the standard error and degrees of freedom are different ( NIST 2012 [full citation in “References”, below] section 7.4.3.6).

To compute a confidence interval on an individual mean for the j th treatment, use

standard error = √ MS W / n j

Therefore the margin of error, which is the half-width of the confidence interval, is

E = t(α/2, df W ) · √ MS W / n j

Example : Refer back to the fats for frying donuts . Estimate the population mean for Fat 2 with 95% confidence? In other words, if you fried a great many batches of donuts in Fat 2, how much fat per batch would be absorbed, on average?

Solution : First, marshal your data:

sample mean for Fat 2: x̅ 2 = 85

sample size: n 2 = 6

degrees of freedom: df W = 20 (from the ANOVA table )

MS W = 100.9 (also from the table)

TI-83 or TI-84 users , please see an easy procedure below .

Computation by Hand

Begin by finding the critical t. Since 1−α = 0.95, α/2 = 0.025. You therefore need t(0.025,20). You can find this from a table:

t(0.025,20) = 2.0860

Next, find the standard error. This is

standard error = √ MS W / n j = √ 100.9/6 = 4.1008

Now you’re ready to finish the confidence interval. The margin of error is

E = t(α/2, df ) · √ MS W / n j = 2.0860×4.1008 = 8.5541

Therefore the confidence interval is

μ 2 = 85 ± 8.6 g (95% confidence)

76.4 g ≤ μ 2 ≤ 93.6 g (95% confidence)

Conclusion : You’re 95% confident that the true mean amount of fat absorbed by a batch of donuts fried in Fat 2 is between 76.4 g and 93.6 g.

TI-83/84 Procedure

Your TI calculator is set up to do the necessary calculations, but there’s one glitch because the degrees of freedom is not based on the size of the individual sample, as it is in a regular t interval. So you have to “spoof” the calculator as follows.

Press [ STAT ] [ ◄ ] [ 8 ] to bring up the TInterval screen. First I’ll tell you what to enter; then I’ll explain why.

  • x̅ : mean of the treatment sample, here 85
  • Sx: √ MS W *( df W +1)/ n j , here √ 100.9*21/6
  • n : df W +1, here 21
  • C-Level: as specified in the problem, here .95

Now, what’s up with n and Sx? Well, the calculator uses n to compute degrees of freedom for critical t as n −1. You want degrees of freedom to be df W , so you lie to the calculator and enter the value of n as df W +1 (20+1 = 21).

But that creates a new problem. The calculator also divides s by √ n to come up with the standard error. But you want it to use n j (6) and not your fake n (21). So you have to multiply MS W by df W +1 and divide by n j to trick the calculator into using the value you actually want.

By the way, why is MS W inside the square root sign? Because the calculator wants a standard deviation, but MS W is a variance. As you know, standard deviation is the square root of variance.

Lowry 1988 [full citation in “References”, below] chapter 14 part 2 mentions a measure that is usually neglected in ANOVA: η². (η is the Greek letter eta, which rhymes with beta.)

η² = SS B /SS tot , the ratio of sum of squares between groups to total sum of squares. For the donut-frying example ,

η² = SS B /SS tot = 1636.5 / 3654.5 = 0.45

What does this tell you? η² measures how much of the total variability in the dependent variable is associated with the variation in treatments. For the donut example, η² = 0.45 tells you that 45% of the variability in fat absorption among the batches is associated with the choice of fat.

  • Updated links to references here and here .
  • 20 Oct 2020 : Improved rendering of square roots of formulas. Italicized variable names. Converted page from HTML 4.01 to HTML5.
  • (intervening changes suppressed)
  • 31 Jan 2009 : First publication.

Updates and new info: https://BrownMath.com/stat/

Site Map | Searches | Home Page | Contact

Module 10: Inference for Means

Hypothesis test for a difference in two population means (1 of 2), learning outcomes.

  • Under appropriate conditions, conduct a hypothesis test about a difference between two population means. State a conclusion in context.

Using the Hypothesis Test for a Difference in Two Population Means

The general steps of this hypothesis test are the same as always. As expected, the details of the conditions for use of the test and the test statistic are unique to this test (but similar in many ways to what we have seen before.)

Step 1: Determine the hypotheses.

The hypotheses for a difference in two population means are similar to those for a difference in two population proportions. The null hypothesis, H 0 , is again a statement of “no effect” or “no difference.”

  • H 0 : μ 1 – μ 2 = 0, which is the same as H 0 : μ 1 = μ 2

The alternative hypothesis, H a , can be any one of the following.

  • H a : μ 1 – μ 2 < 0, which is the same as H a : μ 1 < μ 2
  • H a : μ 1 – μ 2 > 0, which is the same as H a : μ 1 > μ 2
  • H a : μ 1 – μ 2 ≠ 0, which is the same as H a : μ 1 ≠ μ 2

Step 2: Collect the data.

As usual, how we collect the data determines whether we can use it in the inference procedure. We have our usual two requirements for data collection.

  • Samples must be random to remove or minimize bias.
  • Samples must be representative of the populations in question.

We use this hypothesis test when the data meets the following conditions.

  • The two random samples are independent .
  • The variable is normally distributed in both populations . If this variable is not known, samples of more than 30 will have a difference in sample means that can be modeled adequately by the t-distribution. As we discussed in “Hypothesis Test for a Population Mean,” t-procedures are robust even when the variable is not normally distributed in the population. If checking normality in the populations is impossible, then we look at the distribution in the samples. If a histogram or dotplot of the data does not show extreme skew or outliers, we take it as a sign that the variable is not heavily skewed in the populations, and we use the inference procedure. (Note: This is the same condition we used for the one-sample t-test in “Hypothesis Test for a Population Mean.”)

Step 3: Assess the evidence.

If the conditions are met, then we calculate the t-test statistic. The t-test statistic has a familiar form.

[latex]T\text{}=\text{}\frac{(\mathrm{Observed}\text{}\mathrm{difference}\text{}\mathrm{in}\text{}\mathrm{sample}\text{}\mathrm{means})-(\mathrm{Hypothesized}\text{}\mathrm{difference}\text{}\mathrm{in}\text{}\mathrm{population}\text{}\mathrm{means})}{\mathrm{Standard}\text{}\mathrm{error}}[/latex]

[latex]T\text{}=\text{}\frac{({\stackrel{¯}{x}}_{1}-{\stackrel{¯}{x}}_{2})-({μ}_{1}-{μ}_{2})}{\sqrt{\frac{{{s}_{1}}^{2}}{{n}_{1}}+\frac{{{s}_{2}}^{2}}{{n}_{2}}}}[/latex]

Since the null hypothesis assumes there is no difference in the population means, the expression (μ 1 – μ 2 ) is always zero.

As we learned in “Estimating a Population Mean,” the t-distribution depends on the degrees of freedom (df) . In the one-sample and matched-pair cases df = n – 1. For the two-sample t-test, determining the correct df is based on a complicated formula that we do not cover in this course. We will either give the df or use technology to find the df . With the t-test statistic and the degrees of freedom, we can use the appropriate t-model to find the P-value, just as we did in “Hypothesis Test for a Population Mean.” We can even use the same simulation.

Step 4: State a conclusion.

To state a conclusion, we follow what we have done with other hypothesis tests. We compare our P-value to a stated level of significance.

  • If the P-value ≤ α, we reject the null hypothesis in favor of the alternative hypothesis.
  • If the P-value > α, we fail to reject the null hypothesis. We do not have enough evidence to support the alternative hypothesis.

As always, we state our conclusion in context, usually by referring to the alternative hypothesis.

“Context and Calories”

Does the company you keep impact what you eat? This example comes from an article titled “Impact of Group Settings and Gender on Meals Purchased by College Students” (Allen-O’Donnell, M., T. C. Nowak, K. A. Snyder, and M. D. Cottingham, Journal of Applied Social Psychology 49(9), 2011, onlinelibrary.wiley.com/doi/10.1111/j.1559-1816.2011.00804.x/full) . In this study, researchers examined this issue in the context of gender-related theories in their field. For our purposes, we look at this research more narrowly.

Step 1: Stating the hypotheses.

In the article, the authors make the following hypothesis. “The attempt to appear feminine will be empirically demonstrated by the purchase of fewer calories by women in mixed-gender groups than by women in same-gender groups.” We translate this into a simpler and narrower research question: Do women purchase fewer calories when they eat with men compared to when they eat with women?

Here the two populations are “women eating with women” (population 1) and “women eating with men” (population 2). The variable is the calories in the meal. We test the following hypotheses at the 5% level of significance.

The null hypothesis is always H 0 : μ 1 – μ 2 = 0, which is the same as H 0 : μ 1 = μ 2 .

The alternative hypothesis H a : μ 1 – μ 2 > 0, which is the same as H a : μ 1 > μ 2 .

Here μ 1 represents the mean number of calories ordered by women when they were eating with other women, and μ 2 represents the mean number of calories ordered by women when they were eating with men.

Note: It does not matter which population we label as 1 or 2, but once we decide, we have to stay consistent throughout the hypothesis test. Since we expect the number of calories to be greater for the women eating with other women, the difference is positive if “women eating with women” is population 1. If you prefer to work with positive numbers, choose the group with the larger expected mean as population 1. This is a good general tip.

Step 2: Collect Data.

As usual, there are two major things to keep in mind when considering the collection of data.

  • Samples need to be representative of the population in question.
  • Samples need to be random in order to remove or minimize bias.

Representative Samples?

The researchers state their hypothesis in terms of “women.” We did the same. But the researchers gathered data by watching people eat at the HUB Rock Café II on the campus of Indiana University of Pennsylvania during the Spring semester of 2006. Almost all of the women in the data set were white undergraduates between the ages of 18 and 24, so there are some definite limitations on the scope of this study. These limitations will affect our conclusion (and the specific definition of the population means in our hypotheses.)

Random Samples?

The observations were collected on February 13, 2006, through February 22, 2006, between 11 a.m. and 7 p.m. We can see that the researchers included both lunch and dinner. They also made observations on all days of the week to ensure that weekly customer patterns did not confound their findings. The authors state that “since the time period for observations and the place where [they] observed students were limited, the sample was a convenience sample.” Despite these limitations, the researchers conducted inference procedures with the data, and the results were published in a reputable journal. We will also conduct inference with this data, but we also include a discussion of the limitations of the study with our conclusion. The authors did this, also.

Do the data met the conditions for use of a t-test?

The researchers reported the following sample statistics.

  • In a sample of 45 women dining with other women, the average number of calories ordered was 850, and the standard deviation was 252.
  • In a sample of 27 women dining with men, the average number of calories ordered was 719, and the standard deviation was 322.

One of the samples has fewer than 30 women. We need to make sure the distribution of calories in this sample is not heavily skewed and has no outliers, but we do not have access to a spreadsheet of the actual data. Since the researchers conducted a t-test with this data, we will assume that the conditions are met. This includes the assumption that the samples are independent.

As noted previously, the researchers reported the following sample statistics.

To compute the t-test statistic, make sure sample 1 corresponds to population 1. Here our population 1 is “women eating with other women.” So x 1 = 850, s 1 = 252, n 1 =45, and so on.

[latex]T\text{}=\text{}\frac{{\stackrel{¯}{x}}_{1}\text{}\text{−}\text{}{\stackrel{¯}{x}}_{2}}{\sqrt{\frac{{{s}_{1}}^{2}}{{n}_{1}}+\frac{{{s}_{2}}^{2}}{{n}_{2}}}}\text{}=\text{}\frac{850\text{}\text{−}\text{}719}{\sqrt{\frac{{252}^{2}}{45}+\frac{{322}^{2}}{27}}}\text{}\approx \text{}\frac{131}{72.47}\text{}\approx \text{}1.81[/latex]

Using technology, we determined that the degrees of freedom are about 45 for this data. To find the P-value, we use our familiar simulation of the t-distribution. Since the alternative hypothesis is a “greater than” statement, we look for the area to the right of T = 1.81. The P-value is 0.0385.

The green area to the left of the t value = 0.9615. The blue area to the right of the T value = 0.0385.

Generic Conclusion

The hypotheses for this test are H 0 : μ 1 – μ 2 = 0 and H a : μ 1 – μ 2 > 0. Since the P-value is less than the significance level (0.0385 < 0.05), we reject H 0 and accept H a .

Conclusion in context

At Indiana University of Pennsylvania, the mean number of calories ordered by undergraduate women eating with other women is greater than the mean number of calories ordered by undergraduate women eating with men (P-value = 0.0385).

Comment about Conclusions

In the conclusion above, we did not generalize the findings to all women. Since the samples included only undergraduate women at one university, we included this information in our conclusion. But our conclusion is a cautious statement of the findings. The authors see the results more broadly in the context of theories in the field of social psychology. In the context of these theories, they write, “Our findings support the assertion that meal size is a tool for influencing the impressions of others. For traditional-age, predominantly White college women, diminished meal size appears to be an attempt to assert femininity in groups that include men.” This viewpoint is echoed in the following summary of the study for the general public on National Public Radio (npr.org).

  • Both men and women appear to choose larger portions when they eat with women, and both men and women choose smaller portions when they eat in the company of men, according to new research published in the Journal of Applied Social Psychology . The study, conducted among a sample of 127 college students, suggests that both men and women are influenced by unconscious scripts about how to behave in each other’s company. And these scripts change the way men and women eat when they eat together and when they eat apart.

Should we be concerned that the findings of this study are generalized in this way? Perhaps. But the authors of the article address this concern by including the following disclaimer with their findings: “While the results of our research are suggestive, they should be replicated with larger, representative samples. Studies should be done not only with primarily White, middle-class college students, but also with students who differ in terms of race/ethnicity, social class, age, sexual orientation, and so forth.” This is an example of good statistical practice. It is often very difficult to select truly random samples from the populations of interest. Researchers therefore discuss the limitations of their sampling design when they discuss their conclusions.

In the following activities, you will have the opportunity to practice parts of the hypothesis test for a difference in two population means. On the next page, the activities focus on the entire process and also incorporate technology.

National Health and Nutrition Survey

Contribute.

Improve this page Learn More

  • Concepts in Statistics. Provided by : Open Learning Initiative. Located at : http://oli.cmu.edu . License : CC BY: Attribution

Footer Logo Lumen Waymaker

Multiple Hypothesis Testing

  • Reference work entry
  • pp 1468–1469
  • Cite this reference work entry

hypothesis testing multiple means

  • Roger Higdon 5  

2734 Accesses

1 Citations

Multiple comparisons ; Multiple testing

The multiple hypothesis testing problem occurs when a number of individual hypothesis tests are considered simultaneously. In this case, the significance or the error rate of individual tests no longer represents the error rate of the combined set of tests. Multiple hypothesis testing methods correct error rates for this issue.

Characteristics

In conventional hypothesis testing, the level of significance or type I error rate (the probability of wrongly rejecting the null hypothesis) for a single test is less than the probability of making an error on at least one test in a multiple hypothesis testing situation. While this is typically not an issue when testing a small number of preplanned hypotheses, the likelihood of making false discoveries is greatly increased when there are large numbers of unplanned or exploratory tests conducted based on the significance level or type I error rate from a single test. Therefore, it is...

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Barrett T et al (2007) NCBI GEO: mining tens of millions of expression profiles–database and tools update. Nucleic Acids Res 35:D760–D765

Article   CAS   PubMed   Google Scholar  

Dudoit S et al (2003) Multiple hypothesis testing in microarray experiments. Stat Sci 18:71–103

Article   Google Scholar  

Higdon R, van Belle G, Kolker E (2008) A note on the false discovery rate and inconsistent comparison between experiments. Bioinformatics 24:1225–1228

Hochberg Y, Tamahane A (2009) Multiple comparison procedures. Wiley, New York

Google Scholar  

Download references

Author information

Authors and affiliations.

Seattle Children’s Research Institute, 1900 9th Ave, C9S-9, 98101, Seattle, WA, USA

Dr. Roger Higdon

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Roger Higdon .

Editor information

Editors and affiliations.

Biomedical Sciences Research Institute, University of Ulster, Coleraine, UK

Werner Dubitzky

Department of Computer Science, University of Rostock, Rostock, Germany

Olaf Wolkenhauer

Department of Bio and Brain Engineering, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Republic of Korea

Kwang-Hyun Cho

Department of Biomedical Engineering, Rensselaer Polytechnic Institute, Troy, NY, USA

Hiroki Yokota

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer Science+Business Media, LLC

About this entry

Cite this entry.

Higdon, R. (2013). Multiple Hypothesis Testing. In: Dubitzky, W., Wolkenhauer, O., Cho, KH., Yokota, H. (eds) Encyclopedia of Systems Biology. Springer, New York, NY. https://doi.org/10.1007/978-1-4419-9863-7_1211

Download citation

DOI : https://doi.org/10.1007/978-1-4419-9863-7_1211

Publisher Name : Springer, New York, NY

Print ISBN : 978-1-4419-9862-0

Online ISBN : 978-1-4419-9863-7

eBook Packages : Biomedical and Life Sciences Reference Module Biomedical and Life Sciences

Share this entry

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

Want to create or adapt books like this? Learn more about how Pressbooks supports open publishing practices.

Inference for Comparing 2 Population Means (HT for 2 Means, independent samples)

More of the good stuff! We will need to know how to label the null and alternative hypothesis, calculate the test statistic, and then reach our conclusion using the critical value method or the p-value method.

The Test Statistic for a Test of 2 Means from Independent Samples:

[latex]t = \displaystyle \frac{(\bar{x_1} - \bar{x_2}) - (\mu_1 - \mu_2)}{\sqrt{\displaystyle \frac{s_1^2}{n_1} + \displaystyle \frac{s_2^2}{n_2}}}[/latex]

What the different symbols mean:

[latex]n_1[/latex] is the sample size for the first group

[latex]n_2[/latex] is the sample size for the second group

[latex]df[/latex], the degrees of freedom, is the smaller of [latex]n_1 - 1[/latex] and [latex]n_2 - 1[/latex]

[latex]\mu_1[/latex] is the population mean from the first group

[latex]\mu_2[/latex] is the population mean from the second group

[latex]\bar{x_1}[/latex] is the sample mean for the first group

[latex]\bar{x_2}[/latex] is the sample mean for the second group

[latex]s_1[/latex] is the sample standard deviation for the first group

[latex]s_2[/latex] is the sample standard deviation for the second group

[latex]\alpha[/latex] is the significance level , usually given within the problem, or if not given, we assume it to be 5% or 0.05

Assumptions when conducting a Test for 2 Means from Independent Samples:

  • We do not know the population standard deviations, and we do not assume they are equal
  • The two samples or groups are independent
  • Both samples are simple random samples
  • Both populations are Normally distributed OR both samples are large ([latex]n_1 > 30[/latex] and [latex]n_2 > 30[/latex])

Steps to conduct the Test for 2 Means from Independent Samples:

  • Identify all the symbols listed above (all the stuff that will go into the formulas). This includes [latex]n_1[/latex] and [latex]n_2[/latex], [latex]df[/latex], [latex]\mu_1[/latex] and [latex]\mu_2[/latex], [latex]\bar{x_1}[/latex] and [latex]\bar{x_2}[/latex], [latex]s_1[/latex] and [latex]s_2[/latex], and [latex]\alpha[/latex]
  • Identify the null and alternative hypotheses
  • Calculate the test statistic, [latex]t = \displaystyle \frac{(\bar{x_1} - \bar{x_2}) - (\mu_1 - \mu_2)}{\sqrt{\displaystyle \frac{s_1^2}{n_1} + \displaystyle \frac{s_2^2}{n_2}}}[/latex]
  • Find the critical value(s) OR the p-value OR both
  • Apply the Decision Rule
  • Write up a conclusion for the test

Example 1: Study on the effectiveness of stents for stroke patients [1]

In this study , researchers randomly assigned stroke patients to two groups: one received the current standard care (control) and the other received a stent surgery in addition to the standard care (stent treatment). If the stents work, the treatment group should have a lower average disability score . Do the results give convincing statistical evidence that the stent treatment reduces the average disability from stroke?

Mean Disability Score 2.26 3.23
Standard Deviation Disability Score 1.78 1.78
Sample Size, n 98 93

Since we are being asked for convincing statistical evidence, a hypothesis test should be conducted. In this case, we are dealing with averages from two samples or groups (the patients with stent treatment and patients receiving the standard care), so we will conduct a Test of 2 Means.

  • [latex]n_1 = 98[/latex] is the sample size for the first group
  • [latex]n_2 = 93[/latex] is the sample size for the second group
  • [latex]df[/latex], the degrees of freedom, is the smaller of [latex]98 - 1 = 97[/latex] and [latex]93 - 1 = 92[/latex], so [latex]df = 92[/latex]
  • [latex]\bar{x_1} = 2.26[/latex] is the sample mean for the first group
  • [latex]\bar{x_2} = 3.23[/latex] is the sample mean for the second group
  • [latex]s_1 = 1.78[/latex] is the sample standard deviation for the first group
  • [latex]s_2 = 1.78[/latex] is the sample standard deviation for the second group
  • [latex]\alpha = 0.05[/latex] (we were not told a specific value in the problem, so we are assuming it is 5%)
  • One additional assumption we extend from the null hypothesis is that [latex]\mu_1 - \mu_2 = 0[/latex]; this means that in our formula, those variables cancel out
  • [latex]H_{0}: \mu_1 = \mu_2[/latex]
  • [latex]H_{A}: \mu_1 < \mu_2[/latex]
  • [latex]t = \displaystyle \frac{(\bar{x_1} - \bar{x_2}) - (\mu_1 - \mu_2)}{\sqrt{\displaystyle \frac{s_1^2}{n_1} + \displaystyle \frac{s_2^2}{n_2}}} = \displaystyle \frac{(2.26 - 3.23) - 0)}{\sqrt{\displaystyle \frac{1.78^2}{98} + \displaystyle \frac{1.78^2}{93}}} = -3.76[/latex]
  • StatDisk : We can conduct this test using StatDisk. The nice thing about StatDisk is that it will also compute the test statistic. From the main menu above we click on Analysis, Hypothesis Testing, and then Mean Two Independent Samples. From there enter the 0.05 significance, along with the specific values as outlined in the picture below in Step 2. Notice the alternative hypothesis is the [latex]<[/latex] option. Enter the sample size, mean, and standard deviation for each group, and make sure that unequal variances is selected. Now we click on Evaluate. If you check the values, the test statistic is reported in the Step 3 display, as well as the P-Value of 0.00011.
  • Applying the Decision Rule: We now compare this to our significance level, which is 0.05. If the p-value is smaller or equal to the alpha level, we have enough evidence for our claim, otherwise we do not. Here, [latex]p-value = 0.00011[/latex], which is definitely smaller than [latex]\alpha = 0.05[/latex], so we have enough evidence for the alternative hypothesis…but what does this mean?
  • Conclusion: Because our p-value  of [latex]0.00011[/latex] is less than our [latex]\alpha[/latex] level of [latex]0.05[/latex], we reject [latex]H_{0}[/latex]. We have convincing statistical evidence that the stent treatment reduces the average disability from stroke.

Example 2: Home Run Distances

In 1998, Sammy Sosa and Mark McGwire (2 players in Major League Baseball) were on pace to set a new home run record. At the end of the season McGwire ended up with 70 home runs, and Sosa ended up with 66. The home run distances were recorded and compared (sometimes a player’s home run distance is used to measure their “power”). Do the results give convincing statistical evidence that the home run distances are different from each other? Who would you say “hit the ball farther” in this comparison?

Mean Home Run Distance 418.5 404.8
Standard Deviation Home Run Distance 45.5 35.7
Sample Size, n 70 66

Since we are being asked for convincing statistical evidence, a hypothesis test should be conducted. In this case, we are dealing with averages from two samples or groups (the home run distances), so we will conduct a Test of 2 Means.

  • [latex]n_1 = 70[/latex] is the sample size for the first group
  • [latex]n_2 = 66[/latex] is the sample size for the second group
  • [latex]df[/latex], the degrees of freedom, is the smaller of [latex]70 - 1 = 69[/latex] and [latex]66 - 1 = 65[/latex], so [latex]df = 65[/latex]
  • [latex]\bar{x_1} = 418.5[/latex] is the sample mean for the first group
  • [latex]\bar{x_2} = 404.8[/latex] is the sample mean for the second group
  • [latex]s_1 = 45.5[/latex] is the sample standard deviation for the first group
  • [latex]s_2 = 35.7[/latex] is the sample standard deviation for the second group
  • [latex]H_{A}: \mu_1 \neq \mu_2[/latex]
  • [latex]t = \displaystyle \frac{(\bar{x_1} - \bar{x_2}) - (\mu_1 - \mu_2)}{\sqrt{\displaystyle \frac{s_1^2}{n_1} + \displaystyle \frac{s_2^2}{n_2}}} = \displaystyle \frac{(418.5 - 404.8) - 0)}{\sqrt{\displaystyle \frac{45.5^2}{70} + \displaystyle \frac{35.7^2}{65}}} = 1.95[/latex]
  • StatDisk : We can conduct this test using StatDisk. The nice thing about StatDisk is that it will also compute the test statistic. From the main menu above we click on Analysis, Hypothesis Testing, and then Mean Two Independent Samples. From there enter the 0.05 significance, along with the specific values as outlined in the picture below in Step 2. Notice the alternative hypothesis is the [latex]\neq[/latex] option. Enter the sample size, mean, and standard deviation for each group, and make sure that unequal variances is selected. Now we click on Evaluate. If you check the values, the test statistic is reported in the Step 3 display, as well as the P-Value of 0.05221.
  • Applying the Decision Rule: We now compare this to our significance level, which is 0.05. If the p-value is smaller or equal to the alpha level, we have enough evidence for our claim, otherwise we do not. Here, [latex]p-value = 0.05221[/latex], which is larger than [latex]\alpha = 0.05[/latex], so we do not have enough evidence for the alternative hypothesis…but what does this mean?
  • Conclusion: Because our p-value  of [latex]0.05221[/latex] is larger than our [latex]\alpha[/latex] level of [latex]0.05[/latex], we fail to reject [latex]H_{0}[/latex]. We do not have convincing statistical evidence that the home run distances are different.
  • Follow-up commentary: But what does this mean? There actually was a difference, right? If we take McGwire’s average and subtract Sosa’s average we get a difference of 13.7. What this result indicates is that the difference is not statistically significant; it could be due more to random chance than something meaningful. Other factors, such as sample size, could also be a determining factor (with a larger sample size, the difference may have been more meaningful).
  • Adapted from the Skew The Script curriculum ( skewthescript.org ), licensed under CC BY-NC-Sa 4.0 ↵

Basic Statistics Copyright © by Allyn Leon is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License , except where otherwise noted.

Share This Book

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Perspect Clin Res
  • v.7(2); Apr-Jun 2016

Common pitfalls in statistical analysis: The perils of multiple testing

Priya ranganathan.

Department of Anaesthesiology, Tata Memorial Centre, Mumbai, Maharashtra, India

C. S. Pramesh

1 Department of Surgical Oncology, Division of Thoracic Surgery, Tata Memorial Centre, Mumbai, Maharashtra, India

2 International Drug Development Institute, San Francisco, California, USA

3 Department of Biostatistics, Hasselt University, Hasselt, Belgium, USA

Multiple testing refers to situations where a dataset is subjected to statistical testing multiple times - either at multiple time-points or through multiple subgroups or for multiple end-points. This amplifies the probability of a false-positive finding. In this article, we look at the consequences of multiple testing and explore various methods to deal with this issue.

INTRODUCTION

In a previous article, we discussed the alpha error rate (or false-positive error rate), which is the probability of falsely rejecting the null hypothesis.[ 1 ] In any study, when two or more groups are compared, there is always a chance of finding a difference between them just by chance. This is known as a Type 1 error, in contrast to a Type 2 error, which consists of failing to detect a difference that truly exists. Conventionally, the alpha error is set at 5% or less which ensures that when we do find a difference between the groups, we can be at least 95% confident that this is a true difference and not a chance finding.

The 5% limit for alpha, known as the significance level of the study, is set for a single comparison between groups. When we compare treatment groups multiple times, the probability of finding a difference just by chance increases depending on the number of times, we perform the comparison. In many clinical trials, a number of interim analyses are planned to occur during the course of the trial, with the final analysis taking place when all patients have been accrued and followed up for a minimum period. If all these interim (and final) analyses were performed at the 5% significance level, the overall probability of a Type 1 error would exceed the prespecified limit of 5%. It can be calculated that if two groups are compared 5 times, the probability of a false positive finding is as high as 23%; if they are compared 20 times, the probability of finding a significant difference just by chance increases to 64%.[ 2 , 3 ] Fortunately, much statistical research has been devoted to this problem, and “group sequential designs” have been proposed to control the Type 1 error rate when the data of a trial need to be analyzed multiple times.

Another, more challenging type, of multiple testing occurs when authors try to salvage a negative study. If the primary endpoint does not show statistical significance, looking at multiple other (less important) comparisons quite often produce a “positive” result, especially if there are many such comparisons. Investigators can try to analyze different endpoints, among different subsets of patients, using different statistical tests, and so on, so the opportunity for multiplicity can be substantial.[ 2 , 4 ] One case in point is subset analyses when the treatments are compared among subsets of patients defined using prognostic features such as their gender, age, tumor location, stage, histology, and grade If there were only three such binary factors, 8 = 2 3 subsets could be formed. If we were to compare the treatments among these 8 subsets, we would have one chance in three (33% probability) to observe a statistically significant ( P <= 0.05) treatment effect in one of them even if there was no true difference between the treatments. Worse still, if there was an overall statistically significant benefit ( P <= 0.05) in favor of one of treatments, we would have a nine in ten chance (90% probability) to observe a benefit in favor of the other treatment in one of the subsets!

It is to avoid these serious problems that all intended comparisons should be fully prespecified in the research protocol, with appropriate adjustments for multiple testing. However, for retrospective studies, it is difficult to ascertain with certainty whether the analyses performed were actually thought of when the research idea was conceived or whether the performed analyses were mere data dredging.

HOW ARE ADJUSTMENTS MADE FOR MULTIPLE TESTING?

Two main techniques have been described for controlling the overall alpha error:

  • The family–wise error rate: This approach attempts to control the overall false-positive rate for all comparisons. “Family” is defined as a set of tests related to the same hypothesis.[ 2 ] Various approaches for correcting the alpha error include the Bonferroni, Tukey, Hochberg and Holm's step-down methods. The Bonferroni correction consists of simply dividing the overall alpha level by the number of comparisons. For example, if 20 comparisons are being made, then the alpha level for significance for each comparison would be 0.05/20 = 0.0025. However, while this is simple to do (and understand), it has been criticized as being far too conservative, especially when the various tests being performed are highly correlated[ 3 ]
  • The false discovery rate: This approach attempts to control the fraction of “false significant results” among the significant results only. The Benjamini and Hochberg procedure has been described for this approach.[ 5 ]

IS ADJUSTMENT OR COMMON SENSE NEEDED FOR MULTIPLE TESTING?

Many statisticians feel that alpha-adjustment for multiple comparisons reduces the significance value to very stringent levels and increases the chances of a Type 2 error (false negative error; falsely accepting the null hypothesis).[ 2 ] It has also been argued that an obsessive reliance on alpha-adjustment may be counterproductive.[ 6 ]

The following simple strategies have been suggested to handle multiple comparisons:[ 2 , 7 ]

  • Readers should evaluate the quality of the study and the actual effect size instead of focusing only on statistical significance
  • Results from single studies should not be used to make treatment decisions; instead, one should look for scientific plausibility and supporting data from other studies which can validate the results of the original study
  • Authors should try to limit comparisons between groups and identify a single primary endpoint; using a composite endpoint or global assessment tool is also an acceptable alternative to using multiple endpoints.

Financial support and sponsorship

Conflicts of interest.

There are no conflicts of interest.

hypothesis testing multiple means

One and Two Sample Tests and ANOVA

  •   1  
  • |   2  
  • |   3  
  • |   4  
  • |   5  
  • Contributing Authors:
  • Learning Objectives
  • One-Sample Tests
  • Parametric One-sample T-test
  • Boston Data and Assumption Checking
  • One-sample t-test for the mean μ
  • Non-parametric Wilcoxon Signed-Rank Test
  • Two-sample Tests
  • Two-sample Paired Test
  • Parametric Two-sample T-test
  • Comparison of Variance
  • Non-parametric Wilcoxon Test

Tests for More Than Two Samples

Parametric analysis of variance (anova), bmi for adults widget.

  • Non-parametric Kruskal-Wallis Test

More Resources sidebar

Intro to R Contents

Common R Commands

In this section, we consider comparisons among more than two groups parametrically, using analysis of variance (ANOVA), as well as non-parametrically, using the Kruskal-Wallis test.  

To test if the means are equal for more than two groups we perform an analysis of variance test. An ANOVA test will determine if the grouping variable explains a significant portion of the variability in the dependent variable. If so, we would expect that the mean of your dependent variable will be different in each group. The assumptions of an ANOVA test are as follows:

  • Independent observations
  • The dependent variable follows a normal distribution in each group
  • Equal variance of the dependent variable in each group

 Here, we will use the Pima.tr dataset. According to National Heart Lung and Blood Institute (NHLBI) website ( http://www.nhlbisupport.com/bmi/ ), BMI can be classified into 4 categories:

  • Underweight:  < 18.5
  • Normal weight: 18.5 ~ 24.9
  • Overweight: 25 ~ 29.9
  • Obesity: >= 30 

to categorize the continuous variable into four classes based on the definition shown above. Note that we have very few underweight individuals, so collapse underweight and normal weight into "Normal/under weight."

An Aside

In this dataset the BMI is stored in numerical format, so we need to categorize BMI first since we are interested in whether categorical BMI is associated with the plasma glucose concentration. In the Exercise, you can use an "if-else-" statement to create the variable. Alternatively, we can use function as well. Since we have very few individuals with BMI < 18.5, we will collapse categories "Underweight" and "Normal weight" together.

 

> bmi.label <-  c("Underweight/Normalweight", "Overweight", "Obesity")

> summary(bmi)

> bmi.break <- c(18, 24.9, 29.9, 50)

> bmi.cat <- cut(bmi, breaks=bmi.break, labels = bmi.label)

> table(bmi.cat)

bmi.cat

Underweight/Normal weight         Overweight                   Obesity

                       25                 43                       132

> tapply(glu, bmi.cat, mean)

Normal/under weight          Overweight             Obesity

           108.4800           116.6977             129.2727  

 

Suppose we want to compare the means of plasma glucose concentration for our four BMI categories. We will conduct analysis of variance using bmi.cat variable as a factor.

> bmi.cat <- factor(bmi.cat)

> bmi.anova <- aov(glu ~ bmi.cat)

Before looking at the result, you may be interested in checking each category's glucose concentration average. One way it can be done is using the tapply() function. But alternatively, we can also use another function.

> print(model.tables(bmi.anova, "means"))

Tables of means

      

 bmi.cat

   Underweight/Normal weight Overweight Obesity

                        108.5      116.7   129.3

rep                      25.0       43.0   132.0

Apparently, the glucose level varies in different categories. We can now request the ANOVA table for this analysis to check if the hypothesis testing result matches our observation in summary statistics.

> summary(bmi.anova)

             Df Sum Sq Mean Sq F value   Pr(>F)  

bmi.cat       2  11984    5992  6.2932 0.002242 **

Residuals   197 187575     952                     

  • H 0 : The mean glucose is equal for all levels of bmi categories.
  • H a : At least one of the bmi categories has a mean glucose that is not the same as the other bmi categories.

We see that we reject the null hypothesis that the mean glucose is equal for all levels of bmi categories (F 2,197 = 6.29, p-value = 0.002242). The plasma glucose concentration means in at least two categories are significantly different.

Performing many tests will increase the probability of finding one of them to be significant; that is, the p-values tend to be exaggerated (our type I error rate increases). A common adjustment method is the Bonferroni correction, which adjusts for multiple comparisons by changing the level of significance α for each test to α / (# of tests). Thus, if we were performing 10 tests to maintain a level of significance α of 0.05 we adjust for multiple testing using the Bonferroni correction by using 0.05/10 = 0.005 as our new level of significance.

A function called pairwise.t.test computes all possible two-group comparisons.

> pairwise.t.test(glu, bmi.cat, p.adj = "none")

        Pairwise comparisons using t tests with pooled SD

data:  glu and bmi.cat

           Underweight/Normalweight Overweight

Overweight 0.2910                    -        

Obesity    0.0023                    0.0213   

P value adjustment method: none

From this result we reject the null hypothesis that the mean glucose for those who are obese is equal to the mean glucose for those who are underweight/normal weight (p-value = 0.0023). We also reject the null hypothesis that the mean glucose for those who are obese is equal to the mean glucose for those who are overweight (p-value = 0.0213). We fail to reject the null hypothesis that the mean glucose for those who are overweight is equal to the mean glucose for those who are underweight (p-value = 0.2910).

We can also make adjustments for multiple comparisons, like so:

> pairwise.t.test(glu, bmi.cat, p.adj = "bonferroni")

           Underweight/Normal weight Overweight

Overweight 0.8729                    -        

Obesity    0.0069                    0.0639  

P value adjustment method: bonferroni

However, the Bonferroni correction is very conservative. Here, we introduce an alternative multiple comparison approach using Tukey's procedure:

> TukeyHSD(bmi.anova)

  Tukey multiple comparisons of means

    95% family-wise confidence level

Fit: aov(formula = glu ~ bmi.cat)

$bmi.cat                                                                                

diff         lwr      upr     p adj

Overweight-Underweight/Normalweight   8.217674 -10.1099039 26.54525 0.5407576

Obesity-Underweight/Normal weight    20.792727   4.8981963 36.68726 0.0064679

Obesity-Overweight                   12.575053  -0.2203125 25.37042 0.0552495

From the pairwise comparison, what do we find regarding the plasma glucose in the different weight categories?

It is important to note that when testing the assumptions of an ANOVA, the var.test function can only be performed for two groups at a time. To look at the assumption of equal variance for more than two groups, we can use side-by-side boxplots:

> boxplot(glu~bmi.cat)

hypothesis testing multiple means

To determine whether or not the assumption of equal variance is met we look to see if the spread is equal for each of the groups.

We can also conduct a formal test for homogeneity of variances when we have more than two groups. This test is called Bartlett's Test , which assumes normality. The procedure is performed as follows:

> bartlett.test(glu~bmi.cat)

        Bartlett test of homogeneity of variances

data:  glu by bmi.cat

Bartlett's K-squared = 3.6105, df = 2, p-value = 0.1644

H 0 : The variability in glucose is equal for all bmi categories.

H a : The variability in glucose is not equal for all bmi categories.

We fail to reject the null hypothesis that the variability in glucose is equal for all bmi categories (Bartlett's K-squared = 3.6105, df = 2, p-value = 0.1644).

return to top | previous page | next page

Creative Commons license Attribution Non-commercial

Module 10: Inference for Means

Hypothesis Test for a Difference in Two Population Means (1 of 2)

Learning outcomes.

  • Under appropriate conditions, conduct a hypothesis test about a difference between two population means. State a conclusion in context.

Using the Hypothesis Test for a Difference in Two Population Means

The general steps of this hypothesis test are the same as always. As expected, the details of the conditions for use of the test and the test statistic are unique to this test (but similar in many ways to what we have seen before.)

Step 1: Determine the hypotheses.

The hypotheses for a difference in two population means are similar to those for a difference in two population proportions. The null hypothesis, H 0 , is again a statement of “no effect” or “no difference.”

  • H 0 : μ 1 – μ 2 = 0, which is the same as H 0 : μ 1 = μ 2

The alternative hypothesis, H a , can be any one of the following.

  • H a : μ 1 – μ 2 < 0, which is the same as H a : μ 1 < μ 2
  • H a : μ 1 – μ 2 > 0, which is the same as H a : μ 1 > μ 2
  • H a : μ 1 – μ 2 ≠ 0, which is the same as H a : μ 1 ≠ μ 2

Step 2: Collect the data.

As usual, how we collect the data determines whether we can use it in the inference procedure. We have our usual two requirements for data collection.

  • Samples must be random to remove or minimize bias.
  • Samples must be representative of the populations in question.

We use this hypothesis test when the data meets the following conditions.

  • The two random samples are independent .
  • The variable is normally distributed in both populations . If this variable is not known, samples of more than 30 will have a difference in sample means that can be modeled adequately by the t-distribution. As we discussed in “Hypothesis Test for a Population Mean,” t-procedures are robust even when the variable is not normally distributed in the population. If checking normality in the populations is impossible, then we look at the distribution in the samples. If a histogram or dotplot of the data does not show extreme skew or outliers, we take it as a sign that the variable is not heavily skewed in the populations, and we use the inference procedure. (Note: This is the same condition we used for the one-sample t-test in “Hypothesis Test for a Population Mean.”)

Step 3: Assess the evidence.

If the conditions are met, then we calculate the t-test statistic. The t-test statistic has a familiar form.

[latex]T=\frac{Observeddifferenceinsamplemeans-Hypothesizeddiferenceinpopulationmeans}{ standarderror}[/latex]

[latex]T=\frac{(\bar{x}_{1}-\bar{x}_{2})-(\mu_{1}-\mu_{2})}{\sqrt{\frac{s_{1}^{2}}{n_{1}}}+\frac{s_{2}^{2}}{n_{2}}}[/latex]

Since the null hypothesis assumes there is no difference in the population means, the expression (μ 1 – μ 2 ) is always zero.

As we learned in “Estimating a Population Mean,” the t-distribution depends on the degrees of freedom (df) . In the one-sample and matched-pair cases df = n – 1. For the two-sample t-test, determining the correct df is based on a complicated formula that we do not cover in this course. We will either give the df or use technology to find the df . With the t-test statistic and the degrees of freedom, we can use the appropriate t-model to find the P-value, just as we did in “Hypothesis Test for a Population Mean.” We can even use the same simulation.

Step 4: State a conclusion.

To state a conclusion, we follow what we have done with other hypothesis tests. We compare our P-value to a stated level of significance.

  • If the P-value ≤ α, we reject the null hypothesis in favor of the alternative hypothesis.
  • If the P-value > α, we fail to reject the null hypothesis. We do not have enough evidence to support the alternative hypothesis.

As always, we state our conclusion in context, usually by referring to the alternative hypothesis.

“Context and Calories”

Does the company you keep impact what you eat? This example comes from an article titled “Impact of Group Settings and Gender on Meals Purchased by College Students” (Allen-O’Donnell, M., T. C. Nowak, K. A. Snyder, and M. D. Cottingham, Journal of Applied Social Psychology 49(9), 2011, onlinelibrary.wiley.com/doi/10.1111/j.1559-1816.2011.00804.x/full) . In this study, researchers examined this issue in the context of gender-related theories in their field. For our purposes, we look at this research more narrowly.

Step 1: Stating the hypotheses.

In the article, the authors make the following hypothesis. “The attempt to appear feminine will be empirically demonstrated by the purchase of fewer calories by women in mixed-gender groups than by women in same-gender groups.” We translate this into a simpler and narrower research question: Do women purchase fewer calories when they eat with men compared to when they eat with women?

Here the two populations are “women eating with women” (population 1) and “women eating with men” (population 2). The variable is the calories in the meal. We test the following hypotheses at the 5% level of significance.

The null hypothesis is always H 0 : μ 1 – μ 2 = 0, which is the same as H 0 : μ 1 = μ 2 .

The alternative hypothesis H a : μ 1 – μ 2 > 0, which is the same as H a : μ 1 > μ 2 .

Here μ 1 represents the mean number of calories ordered by women when they were eating with other women, and μ 2 represents the mean number of calories ordered by women when they were eating with men.

Note: It does not matter which population we label as 1 or 2, but once we decide, we have to stay consistent throughout the hypothesis test. Since we expect the number of calories to be greater for the women eating with other women, the difference is positive if “women eating with women” is population 1. If you prefer to work with positive numbers, choose the group with the larger expected mean as population 1. This is a good general tip.

Step 2: Collect Data.

As usual, there are two major things to keep in mind when considering the collection of data.

  • Samples need to be representative of the population in question.
  • Samples need to be random in order to remove or minimize bias.

Representative Samples?

The researchers state their hypothesis in terms of “women.” We did the same. But the researchers gathered data by watching people eat at the HUB Rock Café II on the campus of Indiana University of Pennsylvania during the Spring semester of 2006. Almost all of the women in the data set were white undergraduates between the ages of 18 and 24, so there are some definite limitations on the scope of this study. These limitations will affect our conclusion (and the specific definition of the population means in our hypotheses.)

Random Samples?

The observations were collected on February 13, 2006, through February 22, 2006, between 11 a.m. and 7 p.m. We can see that the researchers included both lunch and dinner. They also made observations on all days of the week to ensure that weekly customer patterns did not confound their findings. The authors state that “since the time period for observations and the place where [they] observed students were limited, the sample was a convenience sample.” Despite these limitations, the researchers conducted inference procedures with the data, and the results were published in a reputable journal. We will also conduct inference with this data, but we also include a discussion of the limitations of the study with our conclusion. The authors did this, also.

Do the data meet the conditions for use of a t-test?

The researchers reported the following sample statistics.

  • In a sample of 45 women dining with other women, the average number of calories ordered was 850, and the standard deviation was 252.
  • In a sample of 27 women dining with men, the average number of calories ordered was 719, and the standard deviation was 322.

One of the samples has fewer than 30 women. We need to make sure the distribution of calories in this sample is not heavily skewed and has no outliers, but we do not have access to a spreadsheet of the actual data. Since the researchers conducted a t-test with this data, we will assume that the conditions are met. This includes the assumption that the samples are independent.

As noted previously, the researchers reported the following sample statistics.

To compute the t-test statistic, make sure sample 1 corresponds to population 1. Here our population 1 is “women eating with other women.” So x 1 = 850, s 1 = 252, n 1 =45, and so on.

[latex]T=\frac{\bar{x}_{1}-\bar{x}_{2}}{\sqrt{\frac{s_{1}^{2}}{n_{1}}}+\frac{s_{2}^{2}}{n_{2}}}= \frac{850-719}{\sqrt{\frac{252^{2}}{45}+\frac{322^{2}}{27}}}\approx \frac{131}{72.47}\approx 1.81[/latex]

Using technology, we determined that the degrees of freedom are about 45 for this data. To find the P-value, we use our familiar simulation of the t-distribution. Since the alternative hypothesis is a “greater than” statement, we look for the area to the right of T = 1.81. The P-value is 0.0385.

The green area to the left of the t value = 0.9615. The blue area to the right of the T value = 0.0385.

Generic Conclusion

The hypotheses for this test are H 0 : μ 1 – μ 2 = 0 and H a : μ 1 – μ 2 > 0. Since the P-value is less than the significance level (0.0385 < 0.05), we reject H 0 and accept H a .

Conclusion in context

At Indiana University of Pennsylvania, the mean number of calories ordered by undergraduate women eating with other women is greater than the mean number of calories ordered by undergraduate women eating with men (P-value = 0.0385).

Comment about Conclusions

In the conclusion above, we did not generalize the findings to all women. Since the samples included only undergraduate women at one university, we included this information in our conclusion. But our conclusion is a cautious statement of the findings. The authors see the results more broadly in the context of theories in the field of social psychology. In the context of these theories, they write, “Our findings support the assertion that meal size is a tool for influencing the impressions of others. For traditional-age, predominantly White college women, diminished meal size appears to be an attempt to assert femininity in groups that include men.” This viewpoint is echoed in the following summary of the study for the general public on National Public Radio (npr.org).

  • Both men and women appear to choose larger portions when they eat with women, and both men and women choose smaller portions when they eat in the company of men, according to new research published in the Journal of Applied Social Psychology . The study, conducted among a sample of 127 college students, suggests that both men and women are influenced by unconscious scripts about how to behave in each other’s company. And these scripts change the way men and women eat when they eat together and when they eat apart.

Should we be concerned that the findings of this study are generalized in this way? Perhaps. But the authors of the article address this concern by including the following disclaimer with their findings: “While the results of our research are suggestive, they should be replicated with larger, representative samples. Studies should be done not only with primarily White, middle-class college students, but also with students who differ in terms of race/ethnicity, social class, age, sexual orientation, and so forth.” This is an example of good statistical practice. It is often very difficult to select truly random samples from the populations of interest. Researchers therefore discuss the limitations of their sampling design when they discuss their conclusions.

In the following activities, you will have the opportunity to practice parts of the hypothesis test for a difference in two population means. On the next page, the activities focus on the entire process and also incorporate technology.

National Health and Nutrition Survey

  • Concepts in Statistics. Provided by : Open Learning Initiative. Located at : http://oli.cmu.edu . License : CC BY: Attribution

Concepts in Statistics Copyright © 2023 by CUNY School of Professional Studies is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License , except where otherwise noted.

Share This Book

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base
  • Choosing the Right Statistical Test | Types & Examples

Choosing the Right Statistical Test | Types & Examples

Published on January 28, 2020 by Rebecca Bevans . Revised on June 22, 2023.

Statistical tests are used in hypothesis testing . They can be used to:

  • determine whether a predictor variable has a statistically significant relationship with an outcome variable.
  • estimate the difference between two or more groups.

Statistical tests assume a null hypothesis of no relationship or no difference between groups. Then they determine whether the observed data fall outside of the range of values predicted by the null hypothesis.

If you already know what types of variables you’re dealing with, you can use the flowchart to choose the right statistical test for your data.

Statistical tests flowchart

Table of contents

What does a statistical test do, when to perform a statistical test, choosing a parametric test: regression, comparison, or correlation, choosing a nonparametric test, flowchart: choosing a statistical test, other interesting articles, frequently asked questions about statistical tests.

Statistical tests work by calculating a test statistic – a number that describes how much the relationship between variables in your test differs from the null hypothesis of no relationship.

It then calculates a p value (probability value). The p -value estimates how likely it is that you would see the difference described by the test statistic if the null hypothesis of no relationship were true.

If the value of the test statistic is more extreme than the statistic calculated from the null hypothesis, then you can infer a statistically significant relationship between the predictor and outcome variables.

If the value of the test statistic is less extreme than the one calculated from the null hypothesis, then you can infer no statistically significant relationship between the predictor and outcome variables.

Receive feedback on language, structure, and formatting

Professional editors proofread and edit your paper by focusing on:

  • Academic style
  • Vague sentences
  • Style consistency

See an example

hypothesis testing multiple means

You can perform statistical tests on data that have been collected in a statistically valid manner – either through an experiment , or through observations made using probability sampling methods .

For a statistical test to be valid , your sample size needs to be large enough to approximate the true distribution of the population being studied.

To determine which statistical test to use, you need to know:

  • whether your data meets certain assumptions.
  • the types of variables that you’re dealing with.

Statistical assumptions

Statistical tests make some common assumptions about the data they are testing:

  • Independence of observations (a.k.a. no autocorrelation): The observations/variables you include in your test are not related (for example, multiple measurements of a single test subject are not independent, while measurements of multiple different test subjects are independent).
  • Homogeneity of variance : the variance within each group being compared is similar among all groups. If one group has much more variation than others, it will limit the test’s effectiveness.
  • Normality of data : the data follows a normal distribution (a.k.a. a bell curve). This assumption applies only to quantitative data .

If your data do not meet the assumptions of normality or homogeneity of variance, you may be able to perform a nonparametric statistical test , which allows you to make comparisons without any assumptions about the data distribution.

If your data do not meet the assumption of independence of observations, you may be able to use a test that accounts for structure in your data (repeated-measures tests or tests that include blocking variables).

Types of variables

The types of variables you have usually determine what type of statistical test you can use.

Quantitative variables represent amounts of things (e.g. the number of trees in a forest). Types of quantitative variables include:

  • Continuous (aka ratio variables): represent measures and can usually be divided into units smaller than one (e.g. 0.75 grams).
  • Discrete (aka integer variables): represent counts and usually can’t be divided into units smaller than one (e.g. 1 tree).

Categorical variables represent groupings of things (e.g. the different tree species in a forest). Types of categorical variables include:

  • Ordinal : represent data with an order (e.g. rankings).
  • Nominal : represent group names (e.g. brands or species names).
  • Binary : represent data with a yes/no or 1/0 outcome (e.g. win or lose).

Choose the test that fits the types of predictor and outcome variables you have collected (if you are doing an experiment , these are the independent and dependent variables ). Consult the tables below to see which test best matches your variables.

Parametric tests usually have stricter requirements than nonparametric tests, and are able to make stronger inferences from the data. They can only be conducted with data that adheres to the common assumptions of statistical tests.

The most common types of parametric test include regression tests, comparison tests, and correlation tests.

Regression tests

Regression tests look for cause-and-effect relationships . They can be used to estimate the effect of one or more continuous variables on another variable.

Predictor variable Outcome variable Research question example
What is the effect of income on longevity?
What is the effect of income and minutes of exercise per day on longevity?
Logistic regression What is the effect of drug dosage on the survival of a test subject?

Comparison tests

Comparison tests look for differences among group means . They can be used to test the effect of a categorical variable on the mean value of some other characteristic.

T-tests are used when comparing the means of precisely two groups (e.g., the average heights of men and women). ANOVA and MANOVA tests are used when comparing the means of more than two groups (e.g., the average heights of children, teenagers, and adults).

Predictor variable Outcome variable Research question example
Paired t-test What is the effect of two different test prep programs on the average exam scores for students from the same class?
Independent t-test What is the difference in average exam scores for students from two different schools?
ANOVA What is the difference in average pain levels among post-surgical patients given three different painkillers?
MANOVA What is the effect of flower species on petal length, petal width, and stem length?

Correlation tests

Correlation tests check whether variables are related without hypothesizing a cause-and-effect relationship.

These can be used to test whether two variables you want to use in (for example) a multiple regression test are autocorrelated.

Variables Research question example
Pearson’s  How are latitude and temperature related?

Non-parametric tests don’t make as many assumptions about the data, and are useful when one or more of the common statistical assumptions are violated. However, the inferences they make aren’t as strong as with parametric tests.

Predictor variable Outcome variable Use in place of…
Spearman’s 
Pearson’s 
Sign test One-sample -test
Kruskal–Wallis  ANOVA
ANOSIM MANOVA
Wilcoxon Rank-Sum test Independent t-test
Wilcoxon Signed-rank test Paired t-test

Here's why students love Scribbr's proofreading services

Discover proofreading & editing

This flowchart helps you choose among parametric tests. For nonparametric alternatives, check the table above.

Choosing the right statistical test

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

  • Normal distribution
  • Descriptive statistics
  • Measures of central tendency
  • Correlation coefficient
  • Null hypothesis

Methodology

  • Cluster sampling
  • Stratified sampling
  • Types of interviews
  • Cohort study
  • Thematic analysis

Research bias

  • Implicit bias
  • Cognitive bias
  • Survivorship bias
  • Availability heuristic
  • Nonresponse bias
  • Regression to the mean

Statistical tests commonly assume that:

  • the data are normally distributed
  • the groups that are being compared have similar variance
  • the data are independent

If your data does not meet these assumptions you might still be able to use a nonparametric statistical test , which have fewer requirements but also make weaker inferences.

A test statistic is a number calculated by a  statistical test . It describes how far your observed data is from the  null hypothesis  of no relationship between  variables or no difference among sample groups.

The test statistic tells you how different two or more groups are from the overall population mean , or how different a linear slope is from the slope predicted by a null hypothesis . Different test statistics are used in different statistical tests.

Statistical significance is a term used by researchers to state that it is unlikely their observations could have occurred under the null hypothesis of a statistical test . Significance is usually denoted by a p -value , or probability value.

Statistical significance is arbitrary – it depends on the threshold, or alpha value, chosen by the researcher. The most common threshold is p < 0.05, which means that the data is likely to occur less than 5% of the time under the null hypothesis .

When the p -value falls below the chosen alpha value, then we say the result of the test is statistically significant.

Quantitative variables are any variables where the data represent amounts (e.g. height, weight, or age).

Categorical variables are any variables where the data represent groups. This includes rankings (e.g. finishing places in a race), classifications (e.g. brands of cereal), and binary outcomes (e.g. coin flips).

You need to know what type of variables you are working with to choose the right statistical test for your data and interpret your results .

Discrete and continuous variables are two types of quantitative variables :

  • Discrete variables represent counts (e.g. the number of objects in a collection).
  • Continuous variables represent measurable amounts (e.g. water volume or weight).

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

Bevans, R. (2023, June 22). Choosing the Right Statistical Test | Types & Examples. Scribbr. Retrieved June 24, 2024, from https://www.scribbr.com/statistics/statistical-tests/

Is this article helpful?

Rebecca Bevans

Rebecca Bevans

Other students also liked, hypothesis testing | a step-by-step guide with easy examples, test statistics | definition, interpretation, and examples, normal distribution | examples, formulas, & uses, what is your plagiarism score.

  • FOR INSTRUCTOR
  • FOR INSTRUCTORS

8.4.3 Hypothesis Testing for the Mean

$\quad$ $H_0$: $\mu=\mu_0$, $\quad$ $H_1$: $\mu \neq \mu_0$.

$\quad$ $H_0$: $\mu \leq \mu_0$, $\quad$ $H_1$: $\mu > \mu_0$.

$\quad$ $H_0$: $\mu \geq \mu_0$, $\quad$ $H_1$: $\mu \lt \mu_0$.

Two-sided Tests for the Mean:

Therefore, we can suggest the following test. Choose a threshold, and call it $c$. If $|W| \leq c$, accept $H_0$, and if $|W|>c$, accept $H_1$. How do we choose $c$? If $\alpha$ is the required significance level, we must have

  • As discussed above, we let \begin{align}%\label{} W(X_1,X_2, \cdots,X_n)=\frac{\overline{X}-\mu_0}{\sigma / \sqrt{n}}. \end{align} Note that, assuming $H_0$, $W \sim N(0,1)$. We will choose a threshold, $c$. If $|W| \leq c$, we accept $H_0$, and if $|W|>c$, accept $H_1$. To choose $c$, we let \begin{align} P(|W| > c \; | \; H_0) =\alpha. \end{align} Since the standard normal PDF is symmetric around $0$, we have \begin{align} P(|W| > c \; | \; H_0) = 2 P(W>c | \; H_0). \end{align} Thus, we conclude $P(W>c | \; H_0)=\frac{\alpha}{2}$. Therefore, \begin{align} c=z_{\frac{\alpha}{2}}. \end{align} Therefore, we accept $H_0$ if \begin{align} \left|\frac{\overline{X}-\mu_0}{\sigma / \sqrt{n}} \right| \leq z_{\frac{\alpha}{2}}, \end{align} and reject it otherwise.
  • We have \begin{align} \beta (\mu) &=P(\textrm{type II error}) = P(\textrm{accept }H_0 \; | \; \mu) \\ &= P\left(\left|\frac{\overline{X}-\mu_0}{\sigma / \sqrt{n}} \right| \lt z_{\frac{\alpha}{2}}\; | \; \mu \right). \end{align} If $X_i \sim N(\mu,\sigma^2)$, then $\overline{X} \sim N(\mu, \frac{\sigma^2}{n})$. Thus, \begin{align} \beta (\mu)&=P\left(\left|\frac{\overline{X}-\mu_0}{\sigma / \sqrt{n}} \right| \lt z_{\frac{\alpha}{2}}\; | \; \mu \right)\\ &=P\left(\mu_0- z_{\frac{\alpha}{2}} \frac{\sigma}{\sqrt{n}} \leq \overline{X} \leq \mu_0+ z_{\frac{\alpha}{2}} \frac{\sigma}{\sqrt{n}}\right)\\ &=\Phi\left(z_{\frac{\alpha}{2}}+\frac{\mu_0-\mu}{\sigma / \sqrt{n}}\right)-\Phi\left(-z_{\frac{\alpha}{2}}+\frac{\mu_0-\mu}{\sigma / \sqrt{n}}\right). \end{align}
  • Let $S^2$ be the sample variance for this random sample. Then, the random variable $W$ defined as \begin{equation} W(X_1,X_2, \cdots, X_n)=\frac{\overline{X}-\mu_0}{S / \sqrt{n}} \end{equation} has a $t$-distribution with $n-1$ degrees of freedom, i.e., $W \sim T(n-1)$. Thus, we can repeat the analysis of Example 8.24 here. The only difference is that we need to replace $\sigma$ by $S$ and $z_{\frac{\alpha}{2}}$ by $t_{\frac{\alpha}{2},n-1}$. Therefore, we accept $H_0$ if \begin{align} |W| \leq t_{\frac{\alpha}{2},n-1}, \end{align} and reject it otherwise. Let us look at a numerical example of this case.

$\quad$ $H_0$: $\mu=170$, $\quad$ $H_1$: $\mu \neq 170$.

  • Let's first compute the sample mean and the sample standard deviation. The sample mean is \begin{align}%\label{} \overline{X}&=\frac{X_1+X_2+X_3+X_4+X_5+X_6+X_7+X_8+X_9}{9}\\ &=165.8 \end{align} The sample variance is given by \begin{align}%\label{} {S}^2=\frac{1}{9-1} \sum_{k=1}^9 (X_k-\overline{X})^2&=68.01 \end{align} The sample standard deviation is given by \begin{align}%\label{} S&= \sqrt{S^2}=8.25 \end{align} The following MATLAB code can be used to obtain these values: x=[176.2,157.9,160.1,180.9,165.1,167.2,162.9,155.7,166.2]; m=mean(x); v=var(x); s=std(x); Now, our test statistic is \begin{align} W(X_1,X_2, \cdots, X_9)&=\frac{\overline{X}-\mu_0}{S / \sqrt{n}}\\ &=\frac{165.8-170}{8.25 / 3}=-1.52 \end{align} Thus, $|W|=1.52$. Also, we have \begin{align} t_{\frac{\alpha}{2},n-1} = t_{0.025,8} \approx 2.31 \end{align} The above value can be obtained in MATLAB using the command $\mathtt{tinv(0.975,8)}$. Thus, we conclude \begin{align} |W| \leq t_{\frac{\alpha}{2},n-1}. \end{align} Therefore, we accept $H_0$. In other words, we do not have enough evidence to conclude that the average height in the city is different from the average height in the country.

Let us summarize what we have obtained for the two-sided test for the mean.

Case Test Statistic Acceptance Region
$X_i \sim N(\mu, \sigma^2)$, $\sigma$ known $W=\frac{\overline{X}-\mu_0}{\sigma / \sqrt{n}}$ $|W| \leq z_{\frac{\alpha}{2}}$
$n$ large, $X_i$ non-normal $W=\frac{\overline{X}-\mu_0}{S / \sqrt{n}}$ $|W| \leq z_{\frac{\alpha}{2}}$
$X_i \sim N(\mu, \sigma^2)$, $\sigma$ unknown $W=\frac{\overline{X}-\mu_0}{S / \sqrt{n}}$ $|W| \leq t_{\frac{\alpha}{2},n-1}$

One-sided Tests for the Mean:

  • As before, we define the test statistic as \begin{align}%\label{} W(X_1,X_2, \cdots,X_n)=\frac{\overline{X}-\mu_0}{\sigma / \sqrt{n}}. \end{align} If $H_0$ is true (i.e., $\mu \leq \mu_0$), we expect $\overline{X}$ (and thus $W$) to be relatively small, while if $H_1$ is true, we expect $\overline{X}$ (and thus $W$) to be larger. This suggests the following test: Choose a threshold, and call it $c$. If $W \leq c$, accept $H_0$, and if $W>c$, accept $H_1$. How do we choose $c$? If $\alpha$ is the required significance level, we must have \begin{align} P(\textrm{type I error}) &= P(\textrm{Reject }H_0 \; | \; H_0) \\ &= P(W > c \; | \; \mu \leq \mu_0) \leq \alpha. \end{align} Here, the probability of type I error depends on $\mu$. More specifically, for any $\mu \leq \mu_0$, we can write \begin{align} P(\textrm{type I error} \; | \; \mu) &= P(\textrm{Reject }H_0 \; | \; \mu) \\ &= P(W > c \; | \; \mu)\\ &=P \left(\frac{\overline{X}-\mu_0}{\sigma / \sqrt{n}}> c \; | \; \mu\right)\\ &=P \left(\frac{\overline{X}-\mu}{\sigma / \sqrt{n}}+\frac{\mu-\mu_0}{\sigma / \sqrt{n}}> c \; | \; \mu\right)\\ &=P \left(\frac{\overline{X}-\mu}{\sigma / \sqrt{n}}> c+\frac{\mu_0-\mu}{\sigma / \sqrt{n}} \; | \; \mu\right)\\ &\leq P \left(\frac{\overline{X}-\mu}{\sigma / \sqrt{n}}> c \; | \; \mu\right) \quad (\textrm{ since }\mu \leq \mu_0)\\ &=1-\Phi(c) \quad \big(\textrm{ since given }\mu, \frac{\overline{X}-\mu}{\sigma / \sqrt{n}} \sim N(0,1) \big). \end{align} Thus, we can choose $\alpha=1-\Phi(c)$, which results in \begin{align} c=z_{\alpha}. \end{align} Therefore, we accept $H_0$ if \begin{align} \frac{\overline{X}-\mu_0}{\sigma / \sqrt{n}} \leq z_{\alpha}, \end{align} and reject it otherwise.
Case Test Statistic Acceptance Region
$X_i \sim N(\mu, \sigma^2)$, $\sigma$ known $W=\frac{\overline{X}-\mu_0}{\sigma / \sqrt{n}}$ $W \leq z_{\alpha}$
$n$ large, $X_i$ non-normal $W=\frac{\overline{X}-\mu_0}{S / \sqrt{n}}$ $W \leq z_{\alpha}$
$X_i \sim N(\mu, \sigma^2)$, $\sigma$ unknown $W=\frac{\overline{X}-\mu_0}{S / \sqrt{n}}$ $W \leq t_{\alpha,n-1}$

$\quad$ $H_0$: $\mu \geq \mu_0$, $\quad$ $H_1$: $\mu \lt \mu_0$,

Case Test Statistic Acceptance Region
$X_i \sim N(\mu, \sigma^2)$, $\sigma$ known $W=\frac{\overline{X}-\mu_0}{\sigma / \sqrt{n}}$ $W \geq -z_{\alpha}$
$n$ large, $X_i$ non-normal $W=\frac{\overline{X}-\mu_0}{S / \sqrt{n}}$ $W \geq -z_{\alpha}$
$X_i \sim N(\mu, \sigma^2)$, $\sigma$ unknown $W=\frac{\overline{X}-\mu_0}{S / \sqrt{n}}$ $W \geq -t_{\alpha,n-1}$

The print version of the book is available on .


Stack Exchange Network

Stack Exchange network consists of 183 Q&A communities including Stack Overflow , the largest, most trusted online community for developers to learn, share their knowledge, and build their careers.

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Which statistical test to use to test differences in multiple means (multiple populations)

I have 3 populations, let's call them cluster 1, cluster 2 and cluster 3. The data are continuous. I want to see if there's a difference between the means of the three clusters. I know that the t-test tests for the difference of means, but that is only for 2 samples. What test should I use for multiple samples, i.e. 3, in my case?

  • hypothesis-testing

Glen_b's user avatar

If you want a multi-group analog of a t-test it sounds like you just want ANOVA (analysis of variance) or something similar to it. That's exactly what it's for - comparing group means.

Specifically, you seem to be asking for one-way analysis of variance .

Any decent statistics package does ANOVA.

If you don't want to assume normality (just as you would for a t-test), there are a variety of options that still allow a test of means (including permutation tests and GLMs), but if your samples are large, moderate non-normality won't impact things much.

There's also the issue of potential heteroskedasticity; in the normal case many packages offer an approximation via an adjustment to error degrees of freedom (Welch-Satterthwaite) that often performs quite well. If heteroskedasticity is related to mean, you may be better off looking at an ANOVA-like model fitted as a GLM.

However, if the clusters are generated by performing cluster analysis on data, the theory for t-tests, ANOVA, GLMs, permutation tests, etc no longer holds. None of the p-values would be correct

Your Answer

Sign up or log in, post as a guest.

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy .

Not the answer you're looking for? Browse other questions tagged hypothesis-testing anova or ask your own question .

  • Featured on Meta
  • Upcoming sign-up experiments related to tags

Hot Network Questions

  • Scammed applying for ESTA, looking for travel options
  • Adding shadows to spiric sections of a torus
  • How should I use SIESTA and Wannier90 for calculation of a system with spin-orbit coupling?
  • Was the phrase "white Christmas" indeed coined in the song?
  • How to find gaps in mathematics/theoretical physics/mathematical physics ? How to proceed after you found them?
  • if people are bred like dogs, what can be achieved?
  • Can a star that initially started with less energy output than the sun (when it was young) evolve to be as luminous as the sun in the same time period
  • Are there substantive differences between the different approaches to "size issues" in category theory?
  • Were trained crocodiles used as a weapon in the Battle of Saraighat?
  • How to make Region from Spline curve
  • Delimiter sizes with unicode-math
  • Can I race if I cannot or no longer ride a sit up cycle?
  • Time-evolution operator in QFT
  • Not working break in for loop in bash script for mounting a VHD
  • Can commercial aircraft be equipped with ECM and ECCM equipment?
  • What is the meaning of "Attempt any air"?
  • RAW, do transparent obstacles generally grant Total Cover?
  • The connection between determinants and eigenvalues
  • What is the object under the Delver of Secrets in this video?
  • Variable is significant in multiple linear regression but not in t-test of the subgroups
  • Stability of an ideal Differentiator with a "classical" Op Amp vs using a transimpedance amplifier
  • Definition of "Supports DSP" or "has DSP extensions" in a processor
  • Electric force is zero in a solenoid
  • proper way to write C code that injects message into /var/log/messages

hypothesis testing multiple means

Teach yourself statistics

Hypothesis Test for a Mean

This lesson explains how to conduct a hypothesis test of a mean, when the following conditions are met:

  • The sampling method is simple random sampling .
  • The sampling distribution is normal or nearly normal.

Generally, the sampling distribution will be approximately normally distributed if any of the following conditions apply.

  • The population distribution is normal.
  • The population distribution is symmetric , unimodal , without outliers , and the sample size is 15 or less.
  • The population distribution is moderately skewed , unimodal, without outliers, and the sample size is between 16 and 40.
  • The sample size is greater than 40, without outliers.

This approach consists of four steps: (1) state the hypotheses, (2) formulate an analysis plan, (3) analyze sample data, and (4) interpret results.

State the Hypotheses

Every hypothesis test requires the analyst to state a null hypothesis and an alternative hypothesis . The hypotheses are stated in such a way that they are mutually exclusive. That is, if one is true, the other must be false; and vice versa.

The table below shows three sets of hypotheses. Each makes a statement about how the population mean μ is related to a specified value M . (In the table, the symbol ≠ means " not equal to ".)

Set Null hypothesis Alternative hypothesis Number of tails
1 μ = M μ ≠ M 2
2 μ M μ < M 1
3 μ M μ > M 1

The first set of hypotheses (Set 1) is an example of a two-tailed test , since an extreme value on either side of the sampling distribution would cause a researcher to reject the null hypothesis. The other two sets of hypotheses (Sets 2 and 3) are one-tailed tests , since an extreme value on only one side of the sampling distribution would cause a researcher to reject the null hypothesis.

Formulate an Analysis Plan

The analysis plan describes how to use sample data to accept or reject the null hypothesis. It should specify the following elements.

  • Significance level. Often, researchers choose significance levels equal to 0.01, 0.05, or 0.10; but any value between 0 and 1 can be used.
  • Test method. Use the one-sample t-test to determine whether the hypothesized mean differs significantly from the observed sample mean.

Analyze Sample Data

Using sample data, conduct a one-sample t-test. This involves finding the standard error, degrees of freedom, test statistic, and the P-value associated with the test statistic.

SE = s * sqrt{ ( 1/n ) * [ ( N - n ) / ( N - 1 ) ] }

SE = s / sqrt( n )

  • Degrees of freedom. The degrees of freedom (DF) is equal to the sample size (n) minus one. Thus, DF = n - 1.

t = ( x - μ) / SE

  • P-value. The P-value is the probability of observing a sample statistic as extreme as the test statistic. Since the test statistic is a t statistic, use the t Distribution Calculator to assess the probability associated with the t statistic, given the degrees of freedom computed above. (See sample problems at the end of this lesson for examples of how this is done.)

Sample Size Calculator

As you probably noticed, the process of hypothesis testing can be complex. When you need to test a hypothesis about a mean score, consider using the Sample Size Calculator. The calculator is fairly easy to use, and it is free. You can find the Sample Size Calculator in Stat Trek's main menu under the Stat Tools tab. Or you can tap the button below.

Interpret Results

If the sample findings are unlikely, given the null hypothesis, the researcher rejects the null hypothesis. Typically, this involves comparing the P-value to the significance level , and rejecting the null hypothesis when the P-value is less than the significance level.

Test Your Understanding

In this section, two sample problems illustrate how to conduct a hypothesis test of a mean score. The first problem involves a two-tailed test; the second problem, a one-tailed test.

Problem 1: Two-Tailed Test

An inventor has developed a new, energy-efficient lawn mower engine. He claims that the engine will run continuously for 5 hours (300 minutes) on a single gallon of regular gasoline. From his stock of 2000 engines, the inventor selects a simple random sample of 50 engines for testing. The engines run for an average of 295 minutes, with a standard deviation of 20 minutes. Test the null hypothesis that the mean run time is 300 minutes against the alternative hypothesis that the mean run time is not 300 minutes. Use a 0.05 level of significance. (Assume that run times for the population of engines are normally distributed.)

Solution: The solution to this problem takes four steps: (1) state the hypotheses, (2) formulate an analysis plan, (3) analyze sample data, and (4) interpret results. We work through those steps below:

Null hypothesis: μ = 300

Alternative hypothesis: μ ≠ 300

  • Formulate an analysis plan . For this analysis, the significance level is 0.05. The test method is a one-sample t-test .

SE = s / sqrt(n) = 20 / sqrt(50) = 20/7.07 = 2.83

DF = n - 1 = 50 - 1 = 49

t = ( x - μ) / SE = (295 - 300)/2.83 = -1.77

where s is the standard deviation of the sample, x is the sample mean, μ is the hypothesized population mean, and n is the sample size.

Since we have a two-tailed test , the P-value is the probability that the t statistic having 49 degrees of freedom is less than -1.77 or greater than 1.77. We use the t Distribution Calculator to find P(t < -1.77) is about 0.04.

  • If you enter 1.77 as the sample mean in the t Distribution Calculator, you will find the that the P(t < 1.77) is about 0.04. Therefore, P(t >  1.77) is 1 minus 0.96 or 0.04. Thus, the P-value = 0.04 + 0.04 = 0.08.
  • Interpret results . Since the P-value (0.08) is greater than the significance level (0.05), we cannot reject the null hypothesis.

Note: If you use this approach on an exam, you may also want to mention why this approach is appropriate. Specifically, the approach is appropriate because the sampling method was simple random sampling, the population was normally distributed, and the sample size was small relative to the population size (less than 5%).

Problem 2: One-Tailed Test

Bon Air Elementary School has 1000 students. The principal of the school thinks that the average IQ of students at Bon Air is at least 110. To prove her point, she administers an IQ test to 20 randomly selected students. Among the sampled students, the average IQ is 108 with a standard deviation of 10. Based on these results, should the principal accept or reject her original hypothesis? Assume a significance level of 0.01. (Assume that test scores in the population of engines are normally distributed.)

Null hypothesis: μ >= 110

Alternative hypothesis: μ < 110

  • Formulate an analysis plan . For this analysis, the significance level is 0.01. The test method is a one-sample t-test .

SE = s / sqrt(n) = 10 / sqrt(20) = 10/4.472 = 2.236

DF = n - 1 = 20 - 1 = 19

t = ( x - μ) / SE = (108 - 110)/2.236 = -0.894

Here is the logic of the analysis: Given the alternative hypothesis (μ < 110), we want to know whether the observed sample mean is small enough to cause us to reject the null hypothesis.

The observed sample mean produced a t statistic test statistic of -0.894. We use the t Distribution Calculator to find P(t < -0.894) is about 0.19.

  • This means we would expect to find a sample mean of 108 or smaller in 19 percent of our samples, if the true population IQ were 110. Thus the P-value in this analysis is 0.19.
  • Interpret results . Since the P-value (0.19) is greater than the significance level (0.01), we cannot reject the null hypothesis.

IMAGES

  1. Six Sigma Tools

    hypothesis testing multiple means

  2. Multiple Testing · Pathway Guide

    hypothesis testing multiple means

  3. Comparing Two Means

    hypothesis testing multiple means

  4. Two Sample T-Test (Two Means)

    hypothesis testing multiple means

  5. Hypothesis Testing Solved Examples(Questions and Solutions)

    hypothesis testing multiple means

  6. Hypothesis Testing Solved Problems

    hypothesis testing multiple means

VIDEO

  1. hypothesis testing of comparing means of two independent samples

  2. Testing of Hypothesis Problem 1 MA3251 Statistics and Numerical Methods in Tamil Engineering Sem 2

  3. Hypothesis Test Two Means Independent Samples on StatCrunch

  4. Multiple Regression and Hypothesis Testing

  5. Data Analysis for Genomics

  6. Lesson 11

COMMENTS

  1. 10.3

    10.3 - Multiple Comparisons. If our test of the null hypothesis is rejected, we conclude that not all the means are equal: that is, at least one mean is different from the other means. The ANOVA test itself provides only statistical evidence of a difference, but not any statistical evidence as to which mean or means are statistically different.

  2. Comparing More Than Two Means: One-Way ANOVA

    Terminology. The factor that varies between samples is called the factor. (Every once in a while things are easy.) The r different values or levels of the factor are called the treatments.Here the factor is the choice of fat and the treatments are the four fats, so r = 4.. The computations to test the means for equality are called a 1-way ANOVA or 1-factor ANOVA.

  3. 10.29: Hypothesis Test for a Difference in Two Population Means (1 of 2

    Step 1: Determine the hypotheses. The hypotheses for a difference in two population means are similar to those for a difference in two population proportions. The null hypothesis, H 0, is again a statement of "no effect" or "no difference.". H 0: μ 1 - μ 2 = 0, which is the same as H 0: μ 1 = μ 2. The alternative hypothesis, H a ...

  4. hypothesis testing

    1. Let's say I have a dataset with two groups (male and female), a target variable ( y y) and multiple features ( X1 X 1, X2 X 2 and X3 X 3 ). I can test the hypothesis that the population means of X1 X 1 are equal between men and women with a simple t t test. H0 H 0: μm X1 = μf X2 μ X 1 m = μ X 2 f.

  5. Hypothesis Testing

    There are 5 main steps in hypothesis testing: State your research hypothesis as a null hypothesis and alternate hypothesis (H o) and (H a or H 1 ). Collect data in a way designed to test the hypothesis. Perform an appropriate statistical test. Decide whether to reject or fail to reject your null hypothesis. Present the findings in your results ...

  6. Hypothesis Test for a Difference in Two Population Means (1 of 2

    Step 1: Determine the hypotheses. The hypotheses for a difference in two population means are similar to those for a difference in two population proportions. The null hypothesis, H 0, is again a statement of "no effect" or "no difference.". H 0: μ 1 - μ 2 = 0, which is the same as H 0: μ 1 = μ 2. The alternative hypothesis, H a ...

  7. Hypothesis Testing for Means & Proportions

    We then determine the appropriate test statistic (Step 2) for the hypothesis test. The formula for the test statistic is given below. Test Statistic for Testing H0: p = p 0. if min (np 0 , n (1-p 0 )) > 5. The formula above is appropriate for large samples, defined when the smaller of np 0 and n (1-p 0) is at least 5.

  8. PDF Lecture 10: Multiple Testing

    Why Multiple Testing Matters Genomics = Lots of Data = Lots of Hypothesis Tests A typical microarray experiment might result in performing 10000 separate hypothesis tests. If we use a standard p-value cut-off of 0.05, we'd expect 500 genes to be deemed "significant" by chance.

  9. Multiple comparisons problem

    Classification of multiple hypothesis tests. The following table defines the possible outcomes when testing multiple null hypotheses. ... Based on the Poisson distribution with mean 50, the probability of observing more than 61 significant tests is less than 0.05, so if more than 61 significant results are observed, it is very likely that some ...

  10. Multiple Hypothesis Testing

    Dudoit S et al (2003) Multiple hypothesis testing in microarray experiments. Stat Sci 18:71-103. Article Google Scholar Higdon R, van Belle G, Kolker E (2008) A note on the false discovery rate and inconsistent comparison between experiments. Bioinformatics 24:1225-1228

  11. PDF Lecture 11: Multiple hypothesis test

    The multiple hypothesis testing is the scenario that we are conducting several hypothesis tests at the same time. Suppose we have ntests, each leads to a p-value. So we can view the 'data' as P 1; ;P n 2[0;1], where P i is the p-value of the i-th test. We can think of this problem as conducting hypothesis tests of n nulls: H 1;0; ;H n;0 ...

  12. PDF 1 Why is multiple testing a problem?

    simply due to chance keeps going up.Methods for dealing with multiple testing frequently call for adjusting in some way, so that the probability of observing at least one signi cant result due to chance remains. e level.2 The Bonferroni correctionThe Bonferroni correct. on sets the signi cance cut-o at =n. For example, in the example above ...

  13. Hypothesis Testing: 2 Means (Independent Samples)

    Since we are being asked for convincing statistical evidence, a hypothesis test should be conducted. In this case, we are dealing with averages from two samples or groups (the home run distances), so we will conduct a Test of 2 Means. n1 = 70 n 1 = 70 is the sample size for the first group. n2 = 66 n 2 = 66 is the sample size for the second group.

  14. PDF Multiple Hypothesis Tests

    discussing the problem of multiple hypothesis testing We begin by assuming that some set of null hypotheses is of primary interest, and that we have a set of observations with a joint distribution ... Consider a hypothesis set involving 4 means, with the highest hypothesis in the hierarchy H 1234 and the six hypotheses H ij;i 6= j = 1;2;3;4 as ...

  15. Common pitfalls in statistical analysis: The perils of multiple testing

    The .gov means it's official. ... Multiple testing refers to situations where a dataset is subjected to statistical testing multiple times - either at multiple time-points or through multiple subgroups or for multiple end-points. ... "Family" is defined as a set of tests related to the same hypothesis. Various approaches for correcting ...

  16. Tests for More Than Two Samples

    The plasma glucose concentration means in at least two categories are significantly different. Naturally, we will want to know which category pair has different glucose concentrations. One way to answer this question is to conduct several two-sample tests and then adjust for multiple testing using the Bonferroni correction.

  17. Hypothesis Test for a Difference in Two Population Means (1 of 2)

    Step 1: Determine the hypotheses. The hypotheses for a difference in two population means are similar to those for a difference in two population proportions. The null hypothesis, H 0, is again a statement of "no effect" or "no difference.". H 0: μ 1 - μ 2 = 0, which is the same as H 0: μ 1 = μ 2. The alternative hypothesis, H a ...

  18. 9.2: Comparing Two Independent Population Means (Hypothesis test)

    This is a test of two independent groups, two population means. Random variable: ˉXg − ˉXb = difference in the sample mean amount of time girls and boys play sports each day. H0: μg = μb. H0: μg − μb = 0. Ha: μg ≠ μb.

  19. Choosing the Right Statistical Test

    ANOVA and MANOVA tests are used when comparing the means of more than two groups (e.g., the average heights of children, teenagers, and adults). Predictor variable. Outcome variable. Research question example. Paired t-test. Categorical. 1 predictor. Quantitative. groups come from the same population.

  20. Hypothesis Testing for the Mean

    Table 8.3: One-sided hypothesis testing for the mean: H0: μ ≤ μ0, H1: μ > μ0. Note that the tests mentioned in Table 8.3 remain valid if we replace the null hypothesis by μ = μ0. The reason for this is that in choosing the threshold c, we assumed the worst case scenario, i.e, μ = μ0 .

  21. Which statistical test to use to test differences in multiple means

    I have 3 populations, let's call them cluster 1, cluster 2 and cluster 3. The data are continuous. I want to see if there's a difference between the means of the three clusters. I know that the t-test tests for the difference of means, but that is only for 2 samples. What test should I use for multiple samples, i.e. 3, in my case?

  22. Hypothesis Test for a Mean

    The first set of hypotheses (Set 1) is an example of a two-tailed test, since an extreme value on either side of the sampling distribution would cause a researcher to reject the null hypothesis. The other two sets of hypotheses (Sets 2 and 3) are one-tailed tests, since an extreme value on only one side of the sampling distribution would cause a researcher to reject the null hypothesis.

  23. 8.6: Hypothesis Test of a Single Population Mean with Examples

    Example 8.6.4. Statistics students believe that the mean score on the first statistics test is 65. A statistics instructor thinks the mean score is higher than 65. He samples ten statistics students and obtains the scores 65 65 70 67 66 63 63 68 72 71. He performs a hypothesis test using a 5% level of significance.