Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base
  • Chi-Square Goodness of Fit Test | Formula, Guide & Examples

Chi-Square Goodness of Fit Test | Formula, Guide & Examples

Published on May 24, 2022 by Shaun Turney . Revised on June 22, 2023.

A chi-square (Χ 2 ) goodness of fit test is a type of Pearson’s chi-square test . You can use it to test whether the observed distribution of a categorical variable differs from your expectations.

You recruit a random sample of 75 dogs and offer each dog a choice between the three flavors by placing bowls in front of them. You expect that the flavors will be equally popular among the dogs, with about 25 dogs choosing each flavor.

The chi-square goodness of fit test tells you how well a statistical model fits a set of observations. It’s often used to analyze genetic crosses .

Table of contents

What is the chi-square goodness of fit test, chi-square goodness of fit test hypotheses, when to use the chi-square goodness of fit test, how to calculate the test statistic (formula), how to perform the chi-square goodness of fit test, when to use a different test, practice questions and examples, other interesting articles, frequently asked questions about the chi-square goodness of fit test.

A chi-square (Χ 2 ) goodness of fit test is a goodness of fit test for a categorical variable . Goodness of fit is a measure of how well a statistical model fits a set of observations.

  • When goodness of fit is high , the values expected based on the model are close to the observed values.
  • When goodness of fit is low , the values expected based on the model are far from the observed values.

The statistical models that are analyzed by chi-square goodness of fit tests are distributions . They can be any distribution, from as simple as equal probability for all groups, to as complex as a probability distribution with many parameters.

  • Hypothesis testing

The chi-square goodness of fit test is a hypothesis test . It allows you to draw conclusions about the distribution of a population based on a sample. Using the chi-square goodness of fit test, you can test whether the goodness of fit is “good enough” to conclude that the population follows the distribution.

With the chi-square goodness of fit test, you can ask questions such as: Was this sample drawn from a population that has…

  • Equal proportions of male and female turtles?
  • Equal proportions of red, blue, yellow, green, and purple jelly beans?
  • 90% right-handed and 10% left-handed people?
  • Offspring with an equal probability of inheriting all possible genotypic combinations (i.e., unlinked genes)?
  • A Poisson distribution of floods per year?
  • A normal distribution of bread prices?
Observed and expected frequencies of dogs’ flavor choices
Garlic Blast 22 25
Blueberry Delight 30 25
Minty Munch 23 25

To help visualize the differences between your observed and expected frequencies, you also create a bar graph:

bar-graph-chi-square-test-goodness-of-fit

The president of the dog food company looks at your graph and declares that they should eliminate the Garlic Blast and Minty Munch flavors to focus on Blueberry Delight. “Not so fast!” you tell him.

You explain that your observations were a bit different from what you expected, but the differences aren’t dramatic. They could be the result of a real flavor preference or they could be due to chance.

Prevent plagiarism. Run a free check.

Like all hypothesis tests, a chi-square goodness of fit test evaluates two hypotheses: the null and alternative hypotheses. They’re two competing answers to the question “Was the sample drawn from a population that follows the specified distribution?”

  • Null hypothesis ( H 0 ): The population follows the specified distribution.
  • Alternative hypothesis ( H a ):   The population does not follow the specified distribution.

These are general hypotheses that apply to all chi-square goodness of fit tests. You should make your hypotheses more specific by describing the “specified distribution.” You can name the probability distribution (e.g., Poisson distribution) or give the expected proportions of each group.

  • Null hypothesis ( H 0 ): The dog population chooses the three flavors in equal proportions ( p 1 = p 2 = p 3 ).
  • Alternative hypothesis ( H a ): The dog population does not choose the three flavors in equal proportions.

The following conditions are necessary if you want to perform a chi-square goodness of fit test:

  • You want to test a hypothesis about the distribution of one categorical variable . If your variable is continuous , you can convert it to a categorical variable by separating the observations into intervals. This process is known as data binning.
  • The sample was randomly selected from the population .
  • There are a minimum of five observations expected in each group.
  • You want to test a hypothesis about the distribution of one categorical variable. The categorical variable is the dog food flavors.
  • You recruited a random sample of 75 dogs.
  • There were a minimum of five observations expected in each group. For all three dog food flavors, you expected 25 observations of dogs choosing the flavor.

The test statistic for the chi-square (Χ 2 ) goodness of fit test is Pearson’s chi-square:

Formula Explanation
is the chi-square test statistic is the summation operator (it means “take the sum of”) is the observed frequency is the expected frequency

The larger the difference between the observations and the expectations ( O − E in the equation), the bigger the chi-square will be.

To use the formula, follow these five steps:

Step 1: Create a table

Create a table with the observed and expected frequencies in two columns.

Garlic Blast 22 25
Blueberry Delight 30 25
Minty Munch 23 25

Step 2: Calculate O − E

Add a new column called “ O −  E ”. Subtract the expected frequencies from the observed frequency.

Garlic Blast 22 25 22 25 = 3
Blueberry Delight 30 25 5
Minty Munch 23 25 2

Step 3: Calculate ( O − E ) 2

Add a new column called “( O −  E ) 2 ”. Square the values in the previous column.

Garlic Blast 22 25 3 ( 3) = 9
Blueberry Delight 30 25 5 25
Minty Munch 23 25 2 4

Step 4: Calculate ( O − E ) 2 / E

Add a final column called “( O − E )² /  E “. Divide the previous column by the expected frequencies.

− )² / 
Garlic Blast 22 25 3 9 9/25 = 0.36
Blueberry Delight 30 25 5 25 1
Minty Munch 23 25 2 4 0.16

Step 5: Calculate Χ 2

Add up the values of the previous column. This is the chi-square test statistic (Χ 2 ).

Garlic Blast 22 25 3 9 9/25 = 0.36
Blueberry Delight 30 25 5 25 1
Minty Munch 23 25 2 4 0.16

Receive feedback on language, structure, and formatting

Professional editors proofread and edit your paper by focusing on:

  • Academic style
  • Vague sentences
  • Style consistency

See an example

null and alternative hypothesis goodness of fit

The chi-square statistic is a measure of goodness of fit, but on its own it doesn’t tell you much. For example, is Χ 2 = 1.52 a low or high goodness of fit?

To interpret the chi-square goodness of fit, you need to compare it to something. That’s what a chi-square test is: comparing the chi-square value to the appropriate chi-square distribution to decide whether to reject the null hypothesis .

To perform a chi-square goodness of fit test, follow these five steps (the first two steps have already been completed for the dog food example):

Step 1: Calculate the expected frequencies

Sometimes, calculating the expected frequencies is the most difficult step. Think carefully about which expected values are most appropriate for your null hypothesis .

In general, you’ll need to multiply each group’s expected proportion by the total number of observations to get the expected frequencies.

Step 2: Calculate chi-square

Calculate the chi-square value from your observed and expected frequencies using the chi-square formula.

\begin{equation*}X^2 = \sum{\dfrac{(O-E)^2}{E}}\end{equation*}

Step 3: Find the critical chi-square value

Find the critical chi-square value in a chi-square critical value table or using statistical software. The critical value is calculated from a chi-square distribution. To find the critical chi-square value, you’ll need to know two things:

  • The degrees of freedom ( df ): For chi-square goodness of fit tests, the df is the number of groups minus one.
  • Significance level (α): By convention, the significance level is usually .05.

Step 4: Compare the chi-square value to the critical value

Compare the chi-square value to the critical value to determine which is larger.

Critical value = 5.99

Step 5: Decide whether the reject the null hypothesis

  • The data allows you to reject the null hypothesis and provides support for the alternative hypothesis.
  • The data doesn’t allow you to reject the null hypothesis and doesn’t provide support for the alternative hypothesis.

Whether you use the chi-square goodness of fit test or a related test depends on what hypothesis you want to test and what type of variable you have.

When to use the chi-square test of independence

There’s another type of chi-square test, called the chi-square test of independence .

  • Use the chi-square goodness of fit test when you have one categorical variable and you want to test a hypothesis about its distribution .
  • Use the chi-square test of independence when you have two categorical variables and you want to test a hypothesis about their relationship .

When to use a different goodness of fit test

The Anderson–Darling and Kolmogorov–Smirnov goodness of fit tests are two other common goodness of fit tests for distributions.

  • Use the Anderson–Darling or the Kolmogorov–Smirnov goodness of fit test when you have a continuous variable (that you don’t want to bin).
  • Use the chi-square goodness of fit test when you have a categorical variable (or a continuous variable that you want to bin).

Do you want to test your knowledge about the chi-square goodness of fit test? Download our practice questions and examples with the buttons below.

Download Word doc Download Google doc

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

  • Chi square test of independence
  • Statistical power
  • Descriptive statistics
  • Degrees of freedom
  • Pearson correlation
  • Null hypothesis

Methodology

  • Double-blind study
  • Case-control study
  • Research ethics
  • Data collection
  • Structured interviews

Research bias

  • Hawthorne effect
  • Unconscious bias
  • Recall bias
  • Halo effect
  • Self-serving bias
  • Information bias

You can use the CHISQ.TEST() function to perform a chi-square goodness of fit test in Excel. It takes two arguments, CHISQ.TEST(observed_range, expected_range), and returns the p value .

You can use the chisq.test() function to perform a chi-square goodness of fit test in R. Give the observed values in the “x” argument, give the expected values in the “p” argument, and set “rescale.p” to true. For example:

chisq.test(x = c(22,30,23), p = c(25,25,25), rescale.p = TRUE)

Chi-square goodness of fit tests are often used in genetics. One common application is to check if two genes are linked (i.e., if the assortment is independent). When genes are linked, the allele inherited for one gene affects the allele inherited for another gene.

Suppose that you want to know if the genes for pea texture (R = round, r = wrinkled) and color (Y = yellow, y = green) are linked. You perform a dihybrid cross between two heterozygous ( RY / ry ) pea plants. The hypotheses you’re testing with your experiment are:

  • This would suggest that the genes are unlinked.
  • This would suggest that the genes are linked.

You observe 100 peas:

  • 78 round and yellow peas
  • 6 round and green peas
  • 4 wrinkled and yellow peas
  • 12 wrinkled and green peas

To calculate the expected values, you can make a Punnett square. If the two genes are unlinked, the probability of each genotypic combination is equal.

RRYY RrYy RRYy RrYY
RrYy rryy Rryy rrYy
RRYy Rryy RRyy RrYy
RrYY rrYy RrYy rrYY

The expected phenotypic ratios are therefore 9 round and yellow: 3 round and green: 3 wrinkled and yellow: 1 wrinkled and green.

From this, you can calculate the expected phenotypic frequencies for 100 peas:

Round and yellow 78 100 * (9/16) = 56.25
Round and green 6 100 * (3/16) = 18.75
Wrinkled and yellow 4 100 * (3/16) = 18.75
Wrinkled and green 12 100 * (1/16) = 6.21
Round and yellow 78 56.25 21.75 473.06 8.41
Round and green 6 18.75 −12.75 162.56 8.67
Wrinkled and yellow 4 18.75 −14.75 217.56 11.6
Wrinkled and green 12 6.21 5.79 33.52 5.4

Χ 2 = 8.41 + 8.67 + 11.6 + 5.4 = 34.08

Since there are four groups (round and yellow, round and green, wrinkled and yellow, wrinkled and green), there are three degrees of freedom .

For a test of significance at α = .05 and df = 3, the Χ 2 critical value is 7.82.

Χ 2 = 34.08

Critical value = 7.82

The Χ 2 value is greater than the critical value .

The Χ 2 value is greater than the critical value, so we reject the null hypothesis that the population of offspring have an equal probability of inheriting all possible genotypic combinations. There is a significant difference between the observed and expected genotypic frequencies ( p < .05).

The data supports the alternative hypothesis that the offspring do not have an equal probability of inheriting all possible genotypic combinations, which suggests that the genes are linked

The two main chi-square tests are the chi-square goodness of fit test and the chi-square test of independence .

A chi-square distribution is a continuous probability distribution . The shape of a chi-square distribution depends on its degrees of freedom , k . The mean of a chi-square distribution is equal to its degrees of freedom ( k ) and the variance is 2 k . The range is 0 to ∞.

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

Turney, S. (2023, June 22). Chi-Square Goodness of Fit Test | Formula, Guide & Examples. Scribbr. Retrieved August 29, 2024, from https://www.scribbr.com/statistics/chi-square-goodness-of-fit/

Is this article helpful?

Shaun Turney

Shaun Turney

Other students also liked, chi-square (χ²) tests | types, formula & examples, chi-square (χ²) distributions | definition & examples, chi-square test of independence | formula, guide & examples, what is your plagiarism score.

11.2 Goodness-of-Fit Test

In this type of hypothesis test, you determine whether the data "fit" a particular distribution or not. For example, you may suspect your unknown data fit a binomial distribution. You use a chi-square test (meaning the distribution for the hypothesis test is chi-square) to determine if there is a fit or not. The null and the alternative hypotheses for this test may be written in sentences or may be stated as equations or inequalities.

The test statistic for a goodness-of-fit test is:

  • O = observed values (data)
  • E = expected values (from theory)
  • k = the number of different data cells or categories

The observed values are the data values and the expected values are the values you would expect to get if the null hypothesis were true. There are n terms of the form ( O − E ) 2 E ( O − E ) 2 E .

The number of degrees of freedom is df = (number of categories – 1).

The goodness-of-fit test is almost always right-tailed. If the observed values and the corresponding expected values are not close to each other, then the test statistic can get very large and will be way out in the right tail of the chi-square curve.

The expected value for each cell needs to be at least five in order for you to use this test.

Example 11.1

Absenteeism of college students from math classes is a major concern to math instructors because missing class appears to increase the drop rate. Suppose that a study was done to determine if the actual student absenteeism rate follows faculty perception. The faculty expected that a group of 100 students would miss class according to Table 11.1 .

Number of absences per term Expected number of students
0–2 50
3–5 30
6–8 12
9–11 6
12+ 2

A random survey across all mathematics courses was then done to determine the actual number (observed) of absences in a course. The chart in Table 11.2 displays the results of that survey.

Number of absences per term Actual number of students
0–2 35
3–5 40
6–8 20
9–11 1
12+ 4

Determine the null and alternative hypotheses needed to conduct a goodness-of-fit test.

H 0 : Student absenteeism fits faculty perception.

The alternative hypothesis is the opposite of the null hypothesis.

H a : Student absenteeism does not fit faculty perception.

a. Can you use the information as it appears in the charts to conduct the goodness-of-fit test?

a. No. Notice that the expected number of absences for the "12+" entry is less than five (it is two). Combine that group with the "9–11" group to create new tables where the number of students for each entry are at least five. The new results are in Table 11.3 and Table 11.4 .

0–2 50
3–5 30
6–8 12
9+ 8
Number of absences per term Actual number of students
0–2 35
3–5 40
6–8 20
9+ 5

b. What is the number of degrees of freedom ( df )?

b. There are four "cells" or categories in each of the new tables.

df = number of cells – 1 = 4 – 1 = 3

Try It 11.1

A factory manager needs to understand how many products are defective versus how many are produced. The number of expected defects is listed in Table 11.5 .

Number produced Number defective
0–100 5
101–200 6
201–300 7
301–400 8
401–500 10

A random sample was taken to determine the actual number of defects. Table 11.6 shows the results of the survey.

Number produced Number defective
0–100 5
101–200 7
201–300 8
301–400 9
401–500 11

State the null and alternative hypotheses needed to conduct a goodness-of-fit test, and state the degrees of freedom.

Example 11.2

Employers want to know which days of the week employees are absent in a five-day work week. Most employers would like to believe that employees are absent equally during the week. Suppose a random sample of 60 managers were asked on which day of the week they had the highest number of employee absences. The results were distributed as in Table 11.7 . For the population of employees, do the days for the highest number of absences occur with equal frequencies during a five-day work week? Test at a 5% significance level.

Monday Tuesday Wednesday Thursday Friday
Number of Absences 15 12 9 9 15

The null and alternative hypotheses are:

  • H 0 : The absent days occur with equal frequencies, that is, they fit a uniform distribution.
  • H a : The absent days occur with unequal frequencies, that is, they do not fit a uniform distribution.

If the absent days occur with equal frequencies, then, out of 60 absent days (the total in the sample: 15 + 12 + 9 + 9 + 15 = 60), there would be 12 absences on Monday, 12 on Tuesday, 12 on Wednesday, 12 on Thursday, and 12 on Friday. These numbers are the expected ( E ) values. The values in the table are the observed ( O ) values or data.

This time, calculate the χ 2 test statistic by hand. Make a chart with the following headings and fill in the columns:

  • Expected ( E ) values (12, 12, 12, 12, 12)
  • Observed ( O ) values (15, 12, 9, 9, 15)
  • ( O – E ) 2
  • ( O  –  E ) 2 E ( O  –  E ) 2 E

Now add (sum) the last column. The sum is three. This is the χ 2 test statistic.

To find the p -value, calculate P ( χ 2 > 3). This test is right-tailed. (Use a computer or calculator to find the p -value. You should get p -value = 0.5578.)

The dfs are the number of cells – 1 = 5 – 1 = 4

Using the TI-83, 83+, 84, 84+ Calculator

Press 2nd DISTR . Arrow down to χ 2 cdf . Press ENTER . Enter (3,10^99,4) . Rounded to four decimal places, you should see 0.5578, which is the p-value.

Next, complete a graph like the following one with the proper labeling and shading. (You should shade the right tail.)

The decision is not to reject the null hypothesis.

Conclusion: At a 5% level of significance, from the sample data, there is not sufficient evidence to conclude that the absent days do not occur with equal frequencies.

TI-83+ and some TI-84 calculators do not have a special program for the test statistic for the goodness-of-fit test. The next example Example 11.3 has the calculator instructions. The newer TI-84 calculators have in STAT TESTS the test Chi2 GOF . To run the test, put the observed values (the data) into a first list and the expected values (the values you expect if the null hypothesis is true) into a second list. Press STAT TESTS and Chi2 GOF . Enter the list names for the Observed list and the Expected list. Enter the degrees of freedom and press calculate or draw . Make sure you clear any lists before you start. To Clear Lists in the calculators: Go into STAT EDIT and arrow up to the list name area of the particular list. Press CLEAR and then arrow down. The list will be cleared. Alternatively, you can press STAT and press 4 (for ClrList ). Enter the list name and press ENTER .

Try It 11.2

Teachers want to know which night each week their students are doing most of their homework. Most teachers think that students do homework equally throughout the week. Suppose a random sample of 56 students were asked on which night of the week they did the most homework. The results were distributed as in Table 11.8 .

Sunday Monday Tuesday Wednesday Thursday Friday Saturday
Number of Students 11 8 10 7 10 5 5

From the population of students, do the nights for the highest number of students doing the majority of their homework occur with equal frequencies during a week? What type of hypothesis test should you use?

Example 11.3

One study indicates that the number of streaming services that American families have is distributed (this is the given distribution for the American population) as in Table 11.9 .

Number of Streaming Services Percent
0 10
1 16
2 55
3 11
4+ 8

The table contains expected ( E ) percents.

A random sample of 600 families in the far western United States resulted in the data in Table 11.10 .

Number of Streaming Services Frequency
0 66
1 119
2 340
3 60
4+ 15
Total = 600

The table contains observed ( O ) frequency values.

At the 1% significance level, does it appear that the distribution "number of streaming services" of far western United States families is different from the distribution for the American population as a whole?

This problem asks you to test whether the far western United States families distribution fits the distribution of the American families. This test is always right-tailed.

The first table contains expected percentages. To get expected ( E ) frequencies, multiply the percentage by 600. The expected frequencies are shown in Table 11.11 .

Number of Streaming Services Percent Expected Frequency
0 10 (0.10)(600) = 60
1 16 (0.16)(600) = 96
2 55 (0.55)(600) = 330
3 11 (0.11)(600) = 66
over 3 8 (0.08)(600) = 48

Therefore, the expected frequencies are 60, 96, 330, 66, and 48. In the TI calculators, you can let the calculator do the math. For example, instead of 60, enter 0.10*600.

H 0 : The "number of streaming services" distribution of far western United States families is the same as the "number of streaming services" distribution of the American population.

H a : The "number of streaming services" distribution of far western United States families is different from the "number of streaming services" distribution of the American population.

Distribution for the test: χ 4 2 χ 4 2 where df = (the number of cells) – 1 = 5 – 1 = 4.

df ≠ 600 – 1

Calculate the test statistic: χ 2 = 29.65

Probability statement: p -value = P ( χ 2 > 29.65) = 0.000006

Compare α and the p -value:

  • p -value = 0.000006

So, α > p -value.

Make a decision: Since α > p -value, reject H o .

This means you reject the belief that the distribution for the far western states is the same as that of the American population as a whole.

Conclusion: At the 1% significance level, from the data, there is sufficient evidence to conclude that the "number of streaming services" distribution for the far western United States is different from the "number of streaming services" distribution for the American population as a whole.

Press STAT and ENTER . Make sure to clear lists L1 , L2 , and L3 if they have data in them (see the note at the end of Example 11.2 ). Into L1 , put the observed frequencies 66 , 119 , 340 , 60 , 15 . Into L2 , put the expected frequencies .10*600, .16*600 , .55*600 , .11*600 , .08*600 . Arrow over to list L3 and up to the name area "L3" . Enter (L1-L2)^2/L2 and ENTER . Press 2nd QUIT . Press 2nd LIST and arrow over to MATH . Press 5 . You should see "sum" (Enter L3) . Rounded to 2 decimal places, you should see 29.65 . Press 2nd DISTR . Press 7 or Arrow down to 7:χ2cdf and press ENTER . Enter (29.65,1E99,4) . Rounded to four places, you should see 5.77E-6 = .000006 (rounded to six decimal places), which is the p-value.

The newer TI-84 calculators have in STAT TESTS the test Chi2 GOF . To run the test, put the observed values (the data) into a first list and the expected values (the values you expect if the null hypothesis is true) into a second list. Press STAT TESTS and Chi2 GOF . Enter the list names for the Observed list and the Expected list. Enter the degrees of freedom and press calculate or draw . Make sure you clear any lists before you start.

Try It 11.3

The expected percentage of the number of pets students have in their homes is distributed (this is the given distribution for the student population of the United States) as in Table 11.12 .

Number of Pets Percent
0 18
1 25
2 30
3 18
4+ 9

A random sample of 1,000 students from the Eastern United States resulted in the data in Table 11.13 .

Number of Pets Frequency
0 210
1 240
2 320
3 140
4+ 90

At the 1% significance level, does it appear that the distribution “number of pets” of students in the Eastern United States is different from the distribution for the United States student population as a whole? What is the p -value?

Example 11.4

Suppose you flip two coins 100 times. The results are 20 HH , 27 HT , 30 TH , and 23 TT . Are the coins fair? Test at a 5% significance level.

This problem can be set up as a goodness-of-fit problem. The sample space for flipping two fair coins is { HH , HT , TH , TT }. Out of 100 flips, you would expect 25 HH , 25 HT , 25 TH , and 25 TT . This is the expected distribution. The question, "Are the coins fair?" is the same as saying, "Does the distribution of the coins (20 HH , 27 HT , 30 TH , 23 TT ) fit the expected distribution?"

Random Variable: Let X = the number of heads in one flip of the two coins. X takes on the values 0, 1, 2. (There are 0, 1, or 2 heads in the flip of two coins.) Therefore, the number of cells is three . Since X = the number of heads, the observed frequencies are 20 (for two heads), 57 (for one head), and 23 (for zero heads or both tails). The expected frequencies are 25 (for two heads), 50 (for one head), and 25 (for zero heads or both tails). This test is right-tailed.

H 0 : The coins are fair.

H a : The coins are not fair.

Distribution for the test: χ 2 2 χ 2 2 where df = 3 – 1 = 2.

Calculate the test statistic: χ 2 = 2.14

Probability statement: p -value = P ( χ 2 > 2.14) = 0.3430

  • p -value = 0.3430

α < p -value.

Make a decision: Since α < p -value, do not reject H 0 .

Conclusion: There is insufficient evidence to conclude that the coins are not fair.

Press STAT and ENTER . Make sure you clear lists L1 , L2 , and L3 if they have data in them. Into L1 , put the observed frequencies 20 , 57 , 23 . Into L2 , put the expected frequencies 25 , 50 , 25 . Arrow over to list L3 and up to the name area "L3" . Enter (L1-L2)^2/L2 and ENTER . Press 2nd QUIT . Press 2nd LIST and arrow over to MATH . Press 5 . You should see "sum" . Enter L3 . Rounded to two decimal places, you should see 2.14 . Press 2nd DISTR . Arrow down to 7:χ2cdf (or press 7 ). Press ENTER . Enter 2.14,1E99,2) . Rounded to four places, you should see .3430 , which is the p-value.

Try It 11.4

Students in a social studies class hypothesize that the literacy rates across the world for every region are 82%. Table 11.14 shows the actual literacy rates across the world broken down by region. What are the test statistic and the degrees of freedom?

MDG Region Adult Literacy Rate (%)
Developed Regions 99.0
Commonwealth of Independent States 99.5
Northern Africa 67.3
Sub-Saharan Africa 62.5
Latin America and the Caribbean 91.0
Eastern Asia 93.8
Southern Asia 61.9
South-Eastern Asia 91.9
Western Asia 84.5
Oceania 66.4

This book may not be used in the training of large language models or otherwise be ingested into large language models or generative AI offerings without OpenStax's permission.

Want to cite, share, or modify this book? This book uses the Creative Commons Attribution License and you must attribute OpenStax.

Access for free at https://openstax.org/books/introductory-statistics-2e/pages/1-introduction
  • Authors: Barbara Illowsky, Susan Dean
  • Publisher/website: OpenStax
  • Book title: Introductory Statistics 2e
  • Publication date: Dec 13, 2023
  • Location: Houston, Texas
  • Book URL: https://openstax.org/books/introductory-statistics-2e/pages/1-introduction
  • Section URL: https://openstax.org/books/introductory-statistics-2e/pages/11-2-goodness-of-fit-test

© Jul 18, 2024 OpenStax. Textbook content produced by OpenStax is licensed under a Creative Commons Attribution License . The OpenStax name, OpenStax logo, OpenStax book covers, OpenStax CNX name, and OpenStax CNX logo are not subject to the Creative Commons license and may not be reproduced without the prior and express written consent of Rice University.

  • Skip to secondary menu
  • Skip to main content
  • Skip to primary sidebar

Statistics By Jim

Making statistics intuitive

Goodness of Fit: Definition & Tests

By Jim Frost 2 Comments

What is Goodness of Fit?

Goodness of fit evaluates how well observed data align with the expected values from a statistical model.

Photo of a stylish suit that provides a good fit.

When diving into statistics , you’ll often ask, “How well does my model fit the data?” A tight fit? Your model’s excellent. A loose fit? Maybe reconsider that model. That’s the essence of goodness of fit. More specifically:

  • A high goodness of fit indicates the observed values are close to the model’s expected values.
  • A low goodness of fit shows the observed values are relatively far from the expected values.

A model that fits the data well provides accurate predictions and deeper insights, while a poor fit can lead to misleading conclusions and predictions. Ensuring a good fit is crucial for reliable outcomes and informed actions.

A goodness of fit measure summarizes the size of the differences between the observed data and the model’s expected values. A goodness of fit test determines whether the differences are statistically significant. Moreover, they can guide us in choosing a model offering better representation. The appropriate goodness of fit measure and test depend on the setting.

In this blog post, you’ll learn about the essence of goodness of fit in the crucial contexts of regression models and probability distributions. We’ll measure it in regression models and learn how to test sample data against distributions using goodness of fit tests.

Goodness of Fit in Regression Models

In regression models, understanding the goodness of fit is crucial to ensure accurate predictions and meaningful insights; here, we’ll delve into key metrics that reveal this alignment with the data.

A regression model fits the data well when the differences between the observed and predicted values are small and unbiased. Statisticians refer to these differences as residuals .

null and alternative hypothesis goodness of fit

As the goodness of fit increases, the data points move closer to the model’s fitted line.

R-squared (R²)

R-squared is a goodness of fit statistic for linear regression models. It measures the percentage of the dependent variable variation the model explains using a convenient 0 – 100% scale.

R-squared evaluates the spread of the data around the fitted regression line. For a data set, higher R-squared values indicate smaller differences between the sample data and the fitted values.

Graph that illustrates a regression model with a low R-squared.

The model with the wider spread has an R-squared of 15% while the one with the narrower spread is 85%.

Think of R² as the percentage that explains the variation. Higher R²? Better fit.

  • High R²: Your model captures a lot of variation.
  • Low R²: The model doesn’t explain much of the variance.

Remember, it’s not the sole indicator. High R² doesn’t always mean a perfect model!

Learn more about How to Interpret R-squared and Independent vs Dependent Variables .

Standard Error of the Regression (S)

This standard error of the regression is a goodness of fit measure that provides the typical size of the absolute difference between observed and predicted values. S uses the units of the dependent variable (DV).

  • Small S: Predictions are close to the data values.
  • Large S: Predictions deviate more.

Suppose your model uses body mass index (BMI) to predict the body fat percentage (the DV). Consequently, if your model’s S is 3.5, then you know that its predicted values are typically 3.5% from the observed body fat percentage values.

However, don’t view it in isolation. Compare it with the dependent variable’s units for context.

Learn more about the Standard Error of the Regression .

Akaike’s Information Criterion (AIC)

Akaike’s Information Criterion is a goodness of fit measure that statisticians designed to compare models and help you pick the best one. The AIC value isn’t meaningful itself, but you’re looking for the model with the lowest AIC.

  • Lower AIC: Your model is probably better (when comparing).
  • Adjusts for complexity: Simpler models are preferred when they fit well.

Learn why you want a simpler model, which statisticians refer to as a parsimonious model: What is a Parsimonious Model? Benefits & Selecting .

There are other indicators, like Adjusted R² and BIC. Each has its unique strength. But for a start, focus on these three.

Goodness of Fit for Probability Distributions

Sometimes, your statistical model is that your data follow a particular probability distribution, such as the normal , lognormal , Poisson , or some other distribution. You want to know if your sample’s distribution is consistent with the hypothesized distribution. Learn more about Probability Distributions .

Because many statistical tests and methods rest on distributional assumptions.

For instance, t-tests and ANOVA assume your data are normal. Conversely, you might expect a Poisson distribution if you’re analyzing the number of daily website visits. Capability analysis in the quality arena depends on knowing precisely which distribution your data follow.

Enter goodness of fit tests.

A goodness of fit test determines whether the differences between your sample data and the distribution are statistically significant. In this context, statistical significance indicates the model does not adequately fit the data. The test results can guide the analytical procedures you’ll use.

I’ll cover two of the many available goodness of fit tests. The Anderson-Darling test works for continuous data , and the chi-square goodness of fit test is for categorical and discrete data.

Anderson-Darling Test

The Anderson-Darling goodness of fit test compares continuous sample data to a particular probability distribution. Statisticians often use it for normality tests, but the Anderson-Darling Test can also assess other probability distributions, making it versatile in statistical analysis.

The hypotheses for the Anderson-Darling test are the following:

  • Null Hypothesis (H₀) : The data follow the specified distribution.
  • Alternative Hypothesis (H A ) : The data do not follow the distribution.

When the p-value is less than your significance level , reject the null hypothesis . Consequently, statistically significant results for a goodness of fit test suggest your data do not fit the chosen distribution, prompting further investigation or model adjustments.

Imagine you’re researching the body fat percentages of pre-teen girls, and you want to know if these percentages follow a normal distribution. You can download the CSV data file: body_fat .

After collecting body fat data from 92 girls, you perform the Anderson-Darling Test and obtain the following results.

Statistical results for the normality goodness of fit test.

Because the p-value is less than 0.05, reject the null hypothesis and conclude the sample data do not follow a normal distribution.

Learn how to identify the distribution of this bodyfat dataset using the Anderson-Darling goodness of fit test.

Chi-squared Goodness of Fit Test

The chi square goodness of fit test reveals if the proportions of a discrete or categorical variable follow a distribution with hypothesized proportions.

Statisticians often use the chi square goodness of fit test to evaluate if the proportions of categorical outcomes are all equal. Or the analyst can list the proportions to use in the test. Alternatively, this test can determine if the observed outcomes fit a discrete probability distribution, like the Poisson distribution.

This goodness of fit test does the following:

  • Calculates deviations: Uses the squared difference between observed and expected.
  • P-value < 0.05: Observed and expected frequencies don’t match.

Imagine you’re curious about dice fairness. You roll a six-sided die 600 times, expecting each face to come up 100 times if it’s fair.

The observed counts are 90, 110, 95, 105, 95, and 105 for sides 1 through 6. The observed values don’t matched the expected values of 100 for each die face. Let’s run the Chi-square goodness of fit test for these data to see if those differences are statistically significant.

Statistical results for the chi-squared goodness of fit test.

The p-value of 0.700 is greater than 0.05, so you fail to reject the null hypothesis . The observed frequencies don’t differ significantly from the expected frequencies. Your sample data do not support the claim that the die is unfair!

To explore other examples of the chi square test in action, read the following:

  • Chi-Square Goodness of Fit Test: Uses & Example
  • How the Chi-Square Test of Independent Works

Goodness of fit tells the story of your data and its relationship with a model. It’s like a quality check. For regression, R², S, and AIC are great starters. For probability distributions, the Anderson-Darling and Chi-squared goodness of fit tests are go-tos. Dive in, fit that model, and let the data guide you!

Share this:

null and alternative hypothesis goodness of fit

Reader Interactions

' src=

October 23, 2023 at 11:02 am

Jim, I have a pricing curve model to estimate the curvature of per unit cost (decrease) as purchased quantity increases. It follows the power law Y=Ax^B.

In my related log-log linear regression, the average residual is $0.00, which makes sense because we kept the Y-intercept in the model. However, in the transformed model in natural units, the residuals no longer average $0.00. Why does that property not carry over to the Y=Ax^B form of regression?

As a side note, I have your book “Regression Analysis,” which I have read several times and learned quite a lot. I believe there are two similar errors in Chapter 13, not related to my question above.

On page 323, when transforming the fitted line in log units back to natural units, the coefficient A in Y=Ax^B should be the common antilog of 0.5758 or 3.7653. Similarly, on page 325, the coefficient A should be the common antilog of 1.879 or 75.6833. This can be visually checked for reasonableness by looking at the graph on page 325. If we look at the x-axis, say at x=1, it appears y should be slightly less than 100. If we evaluate the power expression Y=75.6833x^(-0.6383), the fitted value is 75.68, which seems to be what the graph predicts.

The relevant logarithmic identity is log(ab) = log(a) + log(b). The Y-intercept in the log-log linear model is necessarily in log units, not natural units.

' src=

October 23, 2023 at 2:29 pm

Those are good questions.

I’m not exactly sure what is happening in your model but here are my top two possibilities.

When you transform data, especially using non-linear transformations like logarithms, the relationship between the variables can change. In the log-log linear regression, the relationship is linear, and the residuals (differences between observed and predicted values) average out to $0.00. However, when you transform back to the natural units using an exponential function, the relationship becomes non-linear. This non-linearity can cause the residuals to no longer average out to $0.00.

When you re-express the log-log model to its natural units form, there might be some approximation or rounding errors. These errors can accumulate and affect the average of the residuals.

As for the output in the book, that was all calculated by statistical software that I trust (Minitab). I’ll have to look deeper into what is going on, but I trust the results.

Comments and Questions Cancel reply

Teach yourself statistics

Chi-Square Goodness of Fit Test

This lesson explains how to conduct a chi-square goodness of fit test . The test is applied when you have one categorical variable from a single population. It is used to determine whether sample data are consistent with a hypothesized distribution.

For example, suppose a company printed baseball cards. It claimed that 30% of its cards were rookies; 60% were veterans but not All-Stars; and 10% were veteran All-Stars. We could gather a random sample of baseball cards and use a chi-square goodness of fit test to see whether our sample distribution differed significantly from the distribution claimed by the company. The sample problem at the end of the lesson considers this example.

When to Use the Chi-Square Goodness of Fit Test

The chi-square goodness of fit test is appropriate when the following conditions are met:

  • The sampling method is simple random sampling .
  • The variable under study is categorical .
  • The expected value of the number of sample observations in each level of the variable is at least 5.

This approach consists of four steps: (1) state the hypotheses, (2) formulate an analysis plan, (3) analyze sample data, and (4) interpret results.

State the Hypotheses

Every hypothesis test requires the analyst to state a null hypothesis (H o ) and an alternative hypothesis (H a ). The hypotheses are stated in such a way that they are mutually exclusive. That is, if one is true, the other must be false; and vice versa.

For a chi-square goodness of fit test, the hypotheses take the following form.

  • H o : The data are consistent with a specified distribution.
  • H a : The data are not consistent with a specified distribution.

Typically, the null hypothesis (H o ) specifies the proportion of observations at each level of the categorical variable. The alternative hypothesis (H a ) is that at least one of the specified proportions is not true.

Formulate an Analysis Plan

The analysis plan describes how to use sample data to accept or reject the null hypothesis. The plan should specify the following elements.

  • Significance level. Often, researchers choose significance levels equal to 0.01, 0.05, or 0.10; but any value between 0 and 1 can be used.
  • Test method. Use the chi-square goodness of fit test to determine whether observed sample frequencies differ significantly from expected frequencies specified in the null hypothesis. The chi-square goodness of fit test is described in the next section, and demonstrated in the sample problem at the end of this lesson.

Analyze Sample Data

Using sample data, find the degrees of freedom, expected frequency counts, test statistic, and the P-value associated with the test statistic.

Χ 2 = Σ [ (O i - E i ) 2 / E i ]

  • P-value. The P-value is the probability of observing a sample statistic as extreme as the test statistic. Since the test statistic is a chi-square, use the Chi-Square Distribution Calculator to assess the probability associated with the test statistic. Use the degrees of freedom computed above.

Interpret Results

If the sample findings are unlikely, given the null hypothesis, the researcher rejects the null hypothesis. Typically, this involves comparing the P-value to the significance level , and rejecting the null hypothesis when the P-value is less than the significance level.

Test Your Understanding

Acme Toy Company prints baseball cards. The company claims that 30% of the cards are rookies, 60% veterans but not All-Stars, and 10% are veteran All-Stars.

Suppose a random sample of 100 cards has 50 rookies, 45 veterans, and 5 All-Stars. Is this consistent with Acme's claim? Use a 0.05 level of significance.

The solution to this problem takes four steps: (1) state the hypotheses, (2) formulate an analysis plan, (3) analyze sample data, and (4) interpret results. We work through those steps below:

  • Null hypothesis: The proportion of rookies, veterans, and All-Stars is 30%, 60% and 10%, respectively.
  • Alternative hypothesis: At least one of the proportions in the null hypothesis is false.
  • Formulate an analysis plan . For this analysis, the significance level is 0.05. Using sample data, we will conduct a chi-square goodness of fit test of the null hypothesis.

DF = k - 1 = 3 - 1 = 2 (E i ) = n * p i (E 1 ) = 100 * 0.30 = 30 (E 2 ) = 100 * 0.60 = 60 (E 3 ) = 100 * 0.10 = 10 Χ 2 = Σ [ (O i - E i ) 2 / E i ] Χ 2 = [ (50 - 30) 2 / 30 ] + [ (45 - 60) 2 / 60 ] + [ (5 - 10) 2 / 10 ] Χ 2 = (400 / 30) + (225 / 60) + (25 / 10) = 13.33 + 3.75 + 2.50 = 19.58

where DF is the degrees of freedom, k is the number of levels of the categorical variable, n is the number of observations in the sample, E i is the expected frequency count for level i, O i is the observed frequency count for level i, and Χ 2 is the chi-square test statistic.

The P-value is the probability that a chi-square statistic having 2 degrees of freedom is more extreme (bigger) than 19.58. We use the Chi-Square Distribution Calculator to find P(Χ 2 > 19.58) = 0.00006.

  • Interpret results . Since the P-value (0.00006) is less than the significance level (0.05), we cannot accept the null hypothesis.

Note: If you use this approach on an exam, you may also want to mention why this approach is appropriate. Specifically, the approach is appropriate because the sampling method was simple random sampling, the variable under study was categorical, and each level of the categorical variable had an expected frequency count of at least 5.

Logo for Open Library Publishing Platform

Want to create or adapt books like this? Learn more about how Pressbooks supports open publishing practices.

10.4 The Goodness-of-Fit Test

Learning objectives.

  • Conduct and interpret [latex]\chi^2[/latex]-goodness-of-fit hypothesis tests.

Recall that a categorical (or qualitative) variable is a variable where the data can be grouped by specific categories.  Examples of categorical variables include eye colour, blood type, or brand of car.  A categorical variable is a random variable that takes on categories.  Suppose we want to determine whether the data from a categorical variable “fit” a particular distribution or not.  That is, for a categorical variable with a historical or assumed probability distribution, does a new sample from the population support the assumed probability distribution or does the sample indicate that there has been a change in the probability distribution?

The [latex]\chi^2[/latex]-goodness-of-fit test allows us the test if the sample data from a categorical variable fits the pattern of expected probabilities for the variable.  In a [latex]\chi^2[/latex]-goodness-of-fit test, we are analyzing the distribution of the frequencies for one categorical variable.  This is a hypothesis test where the hypotheses state that the categorial variable does or does not follow an assumed probability distribution and a [latex]\chi^2[/latex]-distribution is used to determine the p -value for the test.

Steps to Conduct a [latex]\chi^2[/latex]-Goodness-of-Fit Test

Suppose a categorical variable has [latex]k[/latex] possible outcomes (categories) with probabilities [latex]p_1, p_2,...,p_k[/latex].  Suppose [latex]n[/latex] independent observations are taken from this categorical variable.

[latex]\begin{eqnarray*} \\ H_0: & & p_1=p_{1_0}, p_2=p_{2_0}.,...p_k=p_{k_0} \\ H_a: & &\mbox{at least one } p_i\neq p_{i_0} \\ \\ \end{eqnarray*}[/latex]

  • Collect the sample information for the test and identify the significance level [latex]\alpha[/latex].

[latex]\begin{eqnarray*}\chi^2 & = & \sum \frac{(\mbox{observed-expected})^2}{\mbox{expected}} \\ \\ df &= & k-1  \\ \\ \mbox{observed} & = & \mbox{observed frequency from the sample data} \\ \mbox{expected} & = & \mbox{expected frequency from assumed distribution} \\ k & = & \mbox{number of categories} \\ \\ \end{eqnarray*}[/latex]

  • The results of the sample data are significant.  There is sufficient evidence to conclude that the null hypothesis [latex]H_0[/latex] is an incorrect belief and that the alternative hypothesis [latex]H_a[/latex] is most likely correct.
  • The results of the sample data are not significant.  There is not sufficient evidence to conclude that the alternative hypothesis [latex]H_a[/latex] may be correct.
  • Write down a concluding sentence specific to the context of the question.
  • The null hypothesis is the claim that the categorial variable follows the assumed distribution.  That is, the probability [latex]p_i[/latex] of each possible outcome of the categorical variable equals a hypothesized probability [latex]p_{i_0}[/latex].
  • The alternative hypothesis is the claim that the categorical variable does not follow the assumed distribution.  That is, for at least one possible outcome of the categorical variable the probability [latex]p_i[/latex] does not equal the claimed probability [latex]p_{i_0}[/latex].
  • In order to use the [latex]\chi^2[/latex]-goodness-of-fit test, the expected frequency for each category must be at least 5.
  • The p -value for a [latex]\chi^2[/latex]-goodness-of-fit test is always the area in the right tail of the [latex]\chi^2[/latex]-distribution.  So, we use chisq.dist.rt to find the p -value for a [latex]\chi^2[/latex]-goodness-of-fit test.
  • Find the difference between the observed frequency (from the sample) and the expected frequency (from the null hypothesis).  The expected frequency equals [latex]n \times p_{i_0}[/latex] where [latex]n[/latex] is the sample size and [latex]p_{i_0}[/latex] is the assumed probability for the [latex]i[/latex]th outcome claimed in the null hypothesis.
  • Square the difference in step (i).
  • Divide the value found in step (iii) by the expected frequency.
  • Add up the values of [latex]\displaystyle{\frac{(\mbox{observed-expected})^2}{\mbox{expected}}}[/latex] for each of the outcomes.
  • We expect that there will be a discrepancy between the observed frequency and the expected frequency.  If this discrepancy is very large, the value of [latex]\chi^2[/latex] will be very large and result in a small p -value.

Absenteeism of college students from math classes is a major concern to math instructors because missing class appears to increase the drop rate.  Suppose that a study was done to determine if the actual student absenteeism rate follows faculty perception.  The faculty believe that the distribution of the number of absences per term is as follows:

0–2 50%
3–5 30%
6–8 12%
9–11 6%
12+ 2%

At the end of the semester, a random survey of 300 students across all mathematics courses was taken and the actual (observed) number of absences for the 300 students is recorded.

0–2 120
3–5 100
6–8 55
9–11 15
12+ 10

At the 5% significance level, determine if the number of absences per term follow the distribution assumed by the faculty.

Let [latex]p_1[/latex] be the probability a student has 0-2 absences,  [latex]p_2[/latex] be the probability a student has 3-5 absences, [latex]p_3[/latex] be the probability a student has 6-8 absences, [latex]p_4[/latex] be the probability a student has 9-11 absences, and [latex]p_5[/latex] be the probability a student has 12 or more absences.

Hypotheses:

[latex]\begin{eqnarray*} H_0: & & p_1=50\%, p_2=30\%, p_3=12\%, p_4=6\%, p_5=2\%  \\ H_a: & & \mbox{at least one of the } p_i's \mbox{ does not equal its stated probability} \end{eqnarray*}[/latex]

From the question, we have [latex]n=300[/latex] and [latex]k=5[/latex].  Now we need to calculate out the [latex]\chi^2[/latex]-score for the test.

The observed frequency for each category is the number of observations in the sample that fall into that category.  This is the information provided in the sample above.

Next, we must calculate out the expected frequencies.  The expected frequency is the number of observations we would expect to see in the sample, assuming the null hypothesis is true.  To calculate the expected frequency for each category, we multiply the sample size [latex]n[/latex] by the probability associated with that category claimed in the null hypothesis.

0-2 120 0.5[latex]\times[/latex]300=150
3-5 100 0.3[latex]\times[/latex]300=90
6-8 55 0.12[latex]\times[/latex]300=36
9-11 15 0.06[latex]\times[/latex]300=18
12+ 10 0.02[latex]\times[/latex]300=6

To calculate the [latex]\chi^2[/latex]-score, we work out the quantity [latex]\displaystyle{\frac{(\mbox{observed-expected})^2}{\mbox{expected}}}[/latex] for each category and then add up these quantities.

[latex]\begin{eqnarray*} \chi^2 & = & \sum \frac{(\mbox{observed-expected})^2}{\mbox{expected}} \\ & = & \frac{(120-150)^2}{150}+\frac{(100-90)^2}{90}+\frac{(55-36)^2}{36}+\frac{(15-18)^2}{18}+\frac{(10-6)^2}{6} \\ & = & 20.305... \end{eqnarray*}[/latex]

The degrees of freedom for the [latex]\chi^2[/latex]-distribution is [latex]df=k-1=5-1=4[/latex].  The [latex]\chi^2[/latex]-goodness-of-fit test is a right tailed test, so we use the chisq.dist.rt function to find the p -value:

This is a chi square distribution. Along the horizontal axis the point chi square is labeled. The area in the right tail to the right of chi square is shaded and labeled with p-value.

chisq.dist.rt
20.305…. 0.0004
4

So the p -value[latex]=0.0004[/latex].

Conclusion:

Because p -value[latex]=0.0004 \lt 0.05=\alpha[/latex], we reject the null hypothesis in favour of the alternative hypothesis.  At the 5% significance level there is enough evidence to suggest that the number of absences per term does not follow the distribution assumed by faculty.

  • The null hypothesis is the claim that the percent of students that fall into each category is as stated.  That is, 50% students miss between 0 and 2 classes, 30% of the students miss between 3 and 5 students, etc.
  • The alternative hypothesis is the claim that at least one of the percent of students that fall into each category is not as stated.  The alternative hypothesis does not say that every [latex]p_i[/latex] does not equal its stated probabilities, only that one of them does not equal its stated probability.
  • Keep all of the decimals throughout the calculation (i.e. in the calculation of the [latex]\chi^2[/latex]-score) to avoid any round-off error in the calculation of the p -value.  This ensures that we get the most accurate value for the p -value.  You can use Excel to calculate the expected frequencies and the [latex]\chi^2[/latex]-score.
  • The function is chisq.dist.rt because we are finding the area in the right tail of a [latex]\chi^2[/latex]-distribution.
  • Field 1 is the value of [latex]\chi^2[/latex].
  • Field 2 is the value of the degrees of freedom [latex]df[/latex].
  • The p -value of 0.0004 is a small probability compared to the significance level, and so is unlikely to happen assuming the null hypothesis is true.  This suggests that the assumption that the null hypothesis is true is most likely incorrect, and so the conclusion of the test is to reject the null hypothesis in favour of the alternative hypothesis.  In other words, student absenteeism does not fit faculty perception.

Employers want to know which days of the week employees have the highest number of absences in a five-day work week.  Most employers would like to believe that employees are absent equally during the week.  Suppose a random sample of 60 managers are asked on which day of the week they had the highest number of employee absences.  The results are recorded in the table below.  At the 5% significance level, test if the day of the week with the highest number of absences occur with equal frequency during a five-day work week.

Monday 15
Tuesday 11
Wednesday 10
Thursday 9
Friday 15

Let [latex]p_1[/latex] be the probability the highest number of absences occurs on Monday,  [latex]p_2[/latex] be probability the highest number of absences occurs on Tuesday, [latex]p_3[/latex] be the probability the highest number of absences occurs on Wednesday, [latex]p_4[/latex] be the probability the highest number of absences occurs on Thursday, and [latex]p_5[/latex] be the probability the highest number of absences occurs on Friday.

If the day of the week with the highest number of absences occurs with equal frequency, then the probability that any day has the highest number of absences is the same as any other day.  Because there are 5 days (categories), if the frequencies are equal then each day would have a probability of 20% [latex]\left(\mbox{or }\frac{1}{5}\right)[/latex].

[latex]\begin{eqnarray*} H_0: & & p_1=p_2=p_3=p_4=p_5=20\%  \\ H_a: & & \mbox{at least one of the } p_i \neq 20\%  \end{eqnarray*}[/latex]

From the question, we have [latex]n=60[/latex] and [latex]k=5[/latex].  Now we need to calculate out the [latex]\chi^2[/latex]-score for the test.

Monday 15 0.2[latex]\times[/latex]60=12
Tuesday 11 0.2[latex]\times[/latex]60=12
Wednesday 10 0.2[latex]\times[/latex]60=12
Thursday 9 0.2[latex]\times[/latex]60=12
Friday 15 0.2[latex]\times[/latex]60=12

[latex]\begin{eqnarray*} \chi^2 & = & \sum \frac{(\mbox{observed-expected})^2}{\mbox{expected}} \\ & = & \frac{(15-12)^2}{12}+\frac{(11-12)^2}{12}+\frac{(10-12)^2}{12}+\frac{(9-12)^2}{12}+\frac{(15-12)^2}{12} \\ & = & 2.666... \end{eqnarray*}[/latex]

This is a chi square distribution. Along the horizontal axis the point chi square is labeled. The area in the right tail to the right of chi square is shaded and labeled with p-value.

chisq.dist.rt
2.666…. 0.6151
4

So the p -value[latex]=0.6151[/latex].

Because p -value[latex]=0.6151 \gt 0.05=\alpha[/latex], we do not reject the null hypothesis.  At the 5% significance level there is enough evidence to suggest that the day of the week with the highest number of absences occur with equal frequency during a five-day work week.

  • The null hypothesis is the claim that the probability each day of the week has the highest number of absences is 20%.
  • The alternative hypothesis is the claim that at least one of the probabilities is not 20%.  The alternative hypothesis does not say that every [latex]p_i[/latex] does not equal 20%, only that one of them does not equal 20%.
  • Keep all of the decimals throughout the calculation (i.e. in the calculation of the [latex]\chi^2[/latex]-score) to avoid any round-off error in the calculation of the p -value.  This ensures that we get the most accurate value for the p -value.
  • The p -value of 0.6151 is a large probability compared to the significance level, and so is likely to happen assuming the null hypothesis is true.  This suggests that the assumption that the null hypothesis is true is most likely correct, and so the conclusion of the test is to not reject the null hypothesis.

Teachers want to know which night each week their students are doing most of their homework.  Most teachers think that students do homework equally throughout the week.  Suppose a random sample of 49 students are asked on which night of the week they did the most homework.  The results are shown in the table below.  At the 5% significance level, are the nights that students do most of their homework equally distributed?

Sunday 11
Monday 8
Tuesday 10
Wednesday 7
Thursday 10
Friday 5
Saturday 5

Let [latex]p_1[/latex] be the probability students do their homework on Sunday,  [latex]p_2[/latex] be the probability students do their homework on Monday, [latex]p_3[/latex] be the probability students do their homework on Tuesday, [latex]p_4[/latex] be the probability students do their homework on Wednesday, [latex]p_5[/latex] be the probability students do their homework on Thursday, [latex]p_6[/latex] be the probability students do their homework on Friday, and [latex]p_7[/latex] be the probability students do their homework on Saturday.

[latex]\begin{eqnarray*} H_0: & & p_1=p_2=p_3=p_4=p_5=p_6=p_7=\frac{1}{7}  \\ H_a: & & \mbox{at least one of the } p_i \neq \frac{1}{7}  \end{eqnarray*}[/latex]

From the question, we have [latex]n=49[/latex] and [latex]k=7[/latex].

Sunday 11 1/7[latex]\times[/latex]49=7
Monday 8 1/7[latex]\times[/latex]49=7
Tuesday 10 1/7[latex]\times[/latex]49=7
Wednesday 7 1/7[latex]\times[/latex]49=7
Thursday 10 1/7[latex]\times[/latex]49=7
Friday 5 1/7[latex]\times[/latex]49=7
Saturday 5 1/7[latex]\times[/latex]49=7

[latex]\begin{eqnarray*} \chi^2 & = & \sum \frac{(\mbox{observed-expected})^2}{\mbox{expected}} \\ & = & \frac{(11-7)^2}{7}+\frac{(8-7)^2}{7}+\frac{(10-7)^2}{7}+\frac{(7-7)^2}{7} \\ &  & +\frac{(10-7)^2}{7}+\frac{(5-7)^2}{7}+\frac{(5-7)^2}{7} \\ & = & 6.142... \end{eqnarray*}[/latex]

The degrees of freedom for the [latex]\chi^2[/latex]-distribution is [latex]df=k-1=7-1=6[/latex].

chisq.dist.rt
6.142…. 0.4074
6

So the p -value[latex]=0.4074[/latex].

Because p -value[latex]=0.4074 \gt 0.05=\alpha[/latex], we do not reject the null hypothesis.  At the 5% significance level there is enough evidence to suggest that the nights students do most of their homework are equally distributed.

One study indicates that the number of televisions that American families have is distributed as shown in this table:

0 10%
1 16%
2 55%
3 11%
4 or more 8%

A researcher wants to determine if the number of televisions that families in the far western part of the U.S. have the same distribution as the above study.  A random sample of 600 families in the far western U.S. is taken and the results are recorded in the following table:

0 66
1 119
2 340
3 60
4 or more 15

At the 1% significance level, does it appear that the distribution of the number of televisions for families in the far western U.S is different from the distribution for the American population as a whole?

Let [latex]p_1[/latex] be the probability a family owns 0 televisions,  [latex]p_2[/latex] be the probability a family owns 1 television, [latex]p_3[/latex] be the probability a family owns 2 televisions, [latex]p_4[/latex] be the probability a family owns 3 televisions, and [latex]p_5[/latex] be the probability a family owns 4 or more televisions.

[latex]\begin{eqnarray*} H_0: & & p_1=10\%, p_2= 16\%, p_3= 55\%, p_4= 11\%, p_5=8\%  \\ H_a: & & \mbox{at least one of the } p_i's \mbox{ does not equal its stated probability} \end{eqnarray*}[/latex]

From the question, we have [latex]n=600[/latex] and [latex]k=5[/latex].

0 66 0.1[latex]\times[/latex]600=60
1 119 0.16[latex]\times[/latex]600=96
2 340 0.55[latex]\times[/latex]600=330
3 60 0.11[latex]\times[/latex]600=66
4 or more 15 0.08[latex]\times[/latex]600=48

[latex]\begin{eqnarray*} \chi^2 & = & \sum \frac{(\mbox{observed-expected})^2}{\mbox{expected}} \\ & = & \frac{(66-60)^2}{60}+\frac{(119-96)^2}{96}+\frac{(340-330)^2}{330}+\frac{(60-66)^2}{66}  +\frac{(15-48)^2}{48} \\ & = & 29.646... \end{eqnarray*}[/latex]

The degrees of freedom for the [latex]\chi^2[/latex]-distribution is [latex]df=k-1=5-1=4[/latex].

chisq.dist.rt
29.646…. 0.000006
4

So the p -value[latex]=0.000006[/latex].

Because p -value[latex]=0.000006 \lt 0.01=\alpha[/latex], we reject the null hypothesis in favour of the alternative hypothesis.  At the 1% significance level there is enough evidence to suggest that the distribution of the number of televisions for families in the far western U.S is different from the distribution for the American population as a whole.

The expected percentage of the number of pets students in the United States have in their homes is distributed as follows:

0 18%
1 25%
2 30%
3 18%
4 or more 9%

A researcher wants to find out if the distribution of the number of pets students in Canada have is the same as the distribution shown in the U.S.  A random sample of 1,000 students from Canada is taken and the results are shown in the table below:

0 210
1 240
2 320
3 140
4+ 90

At the 1% significance level, is the distribution of the number of pets students in Canada have different from the distribution for the United States?

Let [latex]p_1[/latex] be the probability a student owns 0 pets,  [latex]p_2[/latex] be the probability a student owns 1 pet, [latex]p_3[/latex] be the probability a student owns 2 pets, [latex]p_4[/latex] be the probability a student owns 3 pets, and [latex]p_5[/latex] be the probability a student owns 4 or more pets.

[latex]\begin{eqnarray*} H_0: & & p_1=18\%, p_2= 25\%, p_3= 30\%, p_4= 18\%, p_5=9\%  \\ H_a: & & \mbox{at least one of the } p_i's \mbox{ does not equal its stated probability} \end{eqnarray*}[/latex]

From the question, we have [latex]n=1000[/latex] and [latex]k=5[/latex].

0 210 0.18[latex]\times[/latex]1000=180
1 240 0.25[latex]\times[/latex]1000=250
2 320 0.30[latex]\times[/latex]1000=300
3 140 0.18[latex]\times[/latex]1000=180
4 or more 90 0.09[latex]\times[/latex]1000=90

[latex]\begin{eqnarray*} \chi^2 & = & \sum \frac{(\mbox{observed-expected})^2}{\mbox{expected}} \\ & = & \frac{(210-180)^2}{180}+\frac{(240-250)^2}{250}+\frac{(320-300)^2}{300}+\frac{(140-180)^2}{180}  +\frac{(90-90)^2}{90} \\ & = & 15.622... \end{eqnarray*}[/latex]

The degrees of freedom for the [latex]\chi^2[/latex]-distribution is [latex]df-k-1=5-1=4[/latex].

chisq.dist.rt
15.622…. 0.0036
4

So the p -value[latex]=0.0036[/latex].

Because p -value[latex]=0.0036 \lt 0.01=\alpha[/latex], we reject the null hypothesis in favour of the alternative hypothesis.  At the 1% significance level there is enough evidence to suggest that the distribution of the number of pets students in Canada have is different from the distribution for the United States.

Watch this video: Pearson’s chi square test (goodness of fit) | Probability and Statistics | Khan Academy by Khan Academy [11:47]

Concept Review

The [latex]\chi^2[/latex]-goodness-of-fit test is used to determine if a categorical variable follows a hypothesized distribution.  The goodness-of-fit test is a well established process:

  • Write down the null and alternative hypotheses.  The null hypothesis is the claim that the categorical variable follows the hypothesized distribution and the alternative hypothesis is the claim that the categorical variable does not follow the hypothesized distribution.
  • Collect the sample information for the test and identify the significance level.
  • The p -value is the area in the right tail of the [latex]\chi^2[/latex]-distribution where [latex]\displaystyle{\chi^2  =  \sum \frac{(\mbox{observed-expected})^2}{\mbox{expected}}}[/latex] and [latex]df=k-1[/latex].
  • Compare the  p -value to the significance level and state the outcome of the test.

Attribution

“ 11.2   Goodness-of-Fit Test “ in Introductory Statistics by OpenStax  is licensed under a  Creative Commons Attribution 4.0 International License.

Introduction to Statistics Copyright © 2022 by Valerie Watts is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License , except where otherwise noted.

LEARN STATISTICS EASILY

LEARN STATISTICS EASILY

Learn Data Analysis Now!

LEARN STATISTICS EASILY LOGO 2

What is Goodness-of-Fit? A Comprehensive Guide

Goodness-of-fit evaluates the accuracy of a statistical model by assessing its ability to represent observed data. By conducting goodness-of-fit tests, practitioners can determine whether a model’s assumptions hold true, enabling them to refine and improve the model for more accurate predictions and inferences.

What is Goodness-of-Fit?

Goodness-of-fit is a crucial concept in evaluating the performance of statistical models — it indicates the degree to which a statistical model aligns with a collection of observations.

Typically, goodness-of-fit encapsulates the differences between observed values and those expected under the model.

These measures can be applied in statistical hypothesis testing , for instance, to assess the normality of residuals, to determine if two samples originate from the same distributions, or to verify if the frequency of outcomes adheres to a specific distribution.

  • Goodness-of-fit evaluates a statistical model’s accuracy by assessing its ability to represent observed data.
  • The chi-square test compares observed and expected frequencies for categorical data models.
  • The Shapiro-Wilk test assesses normality by comparing a sample’s distribution with a normal one.
  • Test statistics and p-value are crucial for interpreting goodness-of-fit test results.

Rejecting the null hypothesis (H0) in favor of the alternative (H1) suggests the model does not adequately represent the data.

 width=

Ad description. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Types of Goodness-of-Fit Tests

Several goodness-of-fit tests exist, including the Chi-square test, the Kolmogorov-Smirnov test, the Anderson-Darling test, and the Shapiro-Wilk test. Each test serves different purposes and is designed to assess various types of models and data. Therefore, carefully selecting the appropriate test for a specific scenario is essential.

Chi-Square Test:  This test compares observed and expected frequencies for categorical data models and assesses the independence or association between two categorical variables. Significant Chi-square statistics indicate that the null hypothesis of independence should be rejected.

Kolmogorov-Smirnov Test:  This non-parametric test compares continuous or discrete data’s cumulative distribution functions (CDFs), either between a sample and a reference distribution or between two samples. It is more appropriate for larger sample sizes rather than smaller ones.

Lilliefors Test:  This test is an adaptation of the Kolmogorov-Smirnov test for small samples with unknown population parameters, specifically for testing normality and exponentiality.

Anderson-Darling Test:  This test compares a sample’s CDF with a reference CDF and is especially sensitive to deviations in tails. It is suitable for data with extreme values or heavy-tailed distributions.

Cramér-von Mises Test:  This test compares observed and theoretical CDFs and is less sensitive to tail deviations than the Anderson-Darling test.

Shapiro-Wilk Test:  This test assesses normality by comparing a sample’s distribution with a normal distribution and is particularly effective for small sample sizes.

Pearson’s Chi-Square Test for Count Data:  This test compares observed and expected count data frequencies based on specified probability distributions, such as Poisson or negative binomial distributions. It is primarily used for testing the goodness-of-fit of a given distribution.

Jarque-Bera Test:  This test examines the skewness and kurtosis of a dataset to determine deviation from a normal distribution, testing normality.

Hosmer-Lemeshow Test:  This test is used in logistic regression to compare observed and expected event frequencies by dividing data into groups and assessing the model’s goodness-of-fit.

Applications of Goodness-of-Fit Tests

Goodness-of-fit tests have diverse applications across various industries and research fields. Some examples include:

Healthcare:  Assessing the appropriateness of models predicting disease prevalence, patient survival rates, or treatment effectiveness.  Example : Using the Hosmer-Lemeshow test to evaluate the performance of a logistic regression model predicting the likelihood of diabetes based on patient characteristics.

Finance:  Evaluating the accuracy of models forecasting stock prices, portfolio risk, or consumer credit risk.  Example : Applying the Anderson-Darling test to verify if the distribution of stock returns follows a specific theoretical distribution, such as the normal or Student’s t-distribution.

Marketing:  Examining the fit of models predicting consumer behavior, such as purchase decisions, customer churn, or response to marketing campaigns.  Example : Utilizing the Chi-square goodness-of-fit test to determine if a model accurately predicts the distribution of customers across different market segments.

Environmental Studies:  Assessing models predicting environmental phenomena like pollution levels, climate patterns, or species distribution.  Example : Employing the Kolmogorov-Smirnov test to compare observed and predicted rainfall patterns based on a climate model.

Interpreting Goodness-of-Fit Test Results

Interpreting the results of goodness-of-fit tests is a crucial step in the analysis process. Here, we outline the general approach to interpreting test results and provide insights into decision-making based on the outcomes.

Test statistic and p-value:  Goodness-of-fit tests typically provide a test statistic and a p-value. The test statistic measures the discrepancy between the observed data and the model or distribution under consideration. The p-value helps assess the significance of this discrepancy. For example, a lower p-value (usually below a predetermined threshold, such as 0.05) suggests that the observed differences are unlikely due to chance alone, indicating a poor model fit.

Null and alternative hypotheses:  Goodness-of-fit tests are based on null and alternative hypotheses. The null hypothesis (H0) typically states no significant difference between the expected values and the observed data based on the model. The alternative hypothesis (H1) contends that there is a significant difference. If the p-value is below the chosen threshold, we reject the null hypothesis (H0) in favor of the alternative hypothesis (H1), suggesting that the model does not adequately represent the data.

Conclusion and Best Practices

Goodness-of-fit is critical to evaluating statistical models’ performance, ensuring accurate predictions and inferences. Various goodness-of-fit tests, such as the Chi-square, Kolmogorov-Smirnov, and Anderson-Darling, cater to different data types and models. By understanding and applying the appropriate test for a specific scenario, practitioners can effectively assess the adequacy of their models and refine them as needed. Interpreting test results, particularly the test statistic and p-value, is crucial for making informed decisions about a model’s suitability. Ultimately, applying and interpreting goodness-of-fit tests contribute to more accurate and reliable models, benefiting research and decision-making across diverse fields and industries.

Recommended Articles

Interested in learning more about data analysis, statistics, and data science? Take advantage of our other insightful articles on these topics! Explore our blog now and elevate your understanding of data-driven decision-making.

  • Which Normality Test Should You Use?
  • Unlocking Goodness-of-Fit Secrets (Story)
  • Understanding the Assumptions for Chi-Square Test of Independence
  • How to Report Chi-Square Test Results in APA Style: A Step-By-Step Guide
  • What is the Difference Between the T-Test vs. Chi-Square Test?
  • A Guide to Hypotheses Tests (Story)

Frequently Asked Questions (FAQs)

Goodness-of-fit evaluates the accuracy of a statistical model by assessing its ability to represent observed data.

The Chi-square test compares observed and expected frequencies for categorical data models.

The Kolmogorov-Smirnov test is a non-parametric method assessing cumulative distribution functions, suitable for small sample sizes.

The Anderson-Darling test is sensitive to tail deviations. It is helpful for data with extreme values or heavy-tailed distributions.

The Shapiro-Wilk test assesses normality by comparing a sample’s and normal distributions.

The Hosmer-Lemeshow test is used in logistic regression to assess model goodness-of-fit.

Goodness-of-fit tests have applications in healthcare, finance, marketing, and environmental studies.

Test statistics and p-value are crucial for interpreting goodness-of-fit test results and determining the model’s adequacy.

Proper application and interpretation of goodness-of-fit tests lead to more accurate and reliable models, benefiting research and decision-making.

Similar Posts

hypothesis tests

A Comprehensive Guide to Hypotheses Tests in Statistics

Take your data analysis skills to the next level with a deep understanding of hypotheses tests. Learn the fundamentals and applications.

can correlation coefficient be negative

Can Correlation Coefficient Be Negative?

Learn if the correlation coefficient can be negative and explore real-world applications in finance, medicine, and sports.

Can Standard Deviations Be Negative

Can Standard Deviations Be Negative? Data Analysis Understanding

Discover why standard deviations can’t be negative and deepen your understanding of data analysis in this insightful article.

p-value revisited

“What Does The P-Value Mean” Revisited

We Have Already Presented A Didactic Explanation Of The P-Value, But Not That Precise. Now Learn An Accurate Definition For The P-Value!

one way anova

Mastering One-Way ANOVA: A Comprehensive Guide for Beginners

Master One Way ANOVA with this guide, covering assumptions, effect sizes, post hoc tests, common mistakes, and best practices.

how to report correlation in apa

How to Report Pearson Correlation Results in APA Style

Learn how to report correlation in APA style, mastering the key steps and considerations for clearly communicating research findings.

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Save my name, email, and website in this browser for the next time I comment.

null and alternative hypothesis goodness of fit

Chi-Square Goodness of Fit Test

Investopedia

  • Statistics Tutorials
  • Probability & Games
  • Descriptive Statistics
  • Inferential Statistics
  • Applications Of Statistics
  • Math Tutorials
  • Pre Algebra & Algebra
  • Exponential Decay
  • Worksheets By Grade
  • Ph.D., Mathematics, Purdue University
  • M.S., Mathematics, Purdue University
  • B.A., Mathematics, Physics, and Chemistry, Anderson University

The chi-square goodness of fit test is a variation of the more general chi-square test. The setting for this test is a single categorical variable that can have many levels. Often in this situation, we will have a theoretical model in mind for a categorical variable. Through this model we expect certain proportions of the population to fall into each of these levels. A goodness of fit test determines how well the expected proportions in our theoretical model matches reality.

Null and Alternative Hypotheses

The null and alternative hypotheses for a goodness of fit test look different than some of our other hypothesis tests. One reason for this is that a chi-square goodness of fit test is a nonparametric method . This means that our test does not concern a single population parameter. Thus the null hypothesis does not state that a single parameter takes on a certain value.

We start with a categorical variable with n levels and let p i be the proportion of the population at level i . Our theoretical model has values of q i for each of the proportions. The statement of the null and alternative hypotheses are as follows:

  • H 0 : p 1 = q 1 , p 2 = q 2 , . . . p n = q n
  • H a : For at least one i , p i is not equal to q i .

Actual and Expected Counts

The calculation of a chi-square statistic involves a comparison between actual counts of variables from the data in our simple random sample and the expected counts of these variables. The actual counts come directly from our sample. The way that the expected counts are calculated depends upon the particular chi-square test that we are using.

For a goodness of fit test, we have a theoretical model for how our data should be proportioned. We simply multiply these proportions by the sample size n to obtain our expected counts.

Computing Test Statistic

The chi-square statistic for goodness of fit test is determined by comparing the actual and expected counts for each level of our categorical variable. The steps to computing the chi-square statistic for a goodness of fit test are as follows:

  • For each level, subtract the observed count from the expected count.
  • Square each of these differences.
  • Divide each of these squared differences by the corresponding expected value.
  • Add all of the numbers from the previous step together. This is our chi-square statistic.

If our theoretical model matches the observed data perfectly, then the expected counts will show no deviation whatsoever from the observed counts of our variable. This will mean that we will have a chi-square statistic of zero. In any other situation, the chi-square statistic will be a positive number.

Degrees of Freedom

The number of degrees of freedom requires no difficult calculations. All that we need to do is subtract one from the number of levels of our categorical variable. This number will inform us on which of the infinite chi-square distributions we should use.

Chi-square Table and P-Value

The chi-square statistic that we calculated corresponds to a particular location on a chi-square distribution with the appropriate number of degrees of freedom. The p-value determines the probability of obtaining a test statistic this extreme, assuming that the null hypothesis is true. We can use a table of values for a chi-square distribution to determine the p-value of our hypothesis test. If we have statistical software available, then this can be used to obtain a better estimate of the p-value.

Decision Rule

We make our decision on whether to reject the null hypothesis based upon a predetermined level of significance. If our p-value is less than or equal to this level of significance, then we reject the null hypothesis. Otherwise, we fail to reject the null hypothesis.

  • Example of a Chi-Square Goodness of Fit Test
  • What Is a P-Value?
  • Facts About the Number e: 2.7182818284590452...
  • How to Conduct a Hypothesis Test
  • How to Calculate a Sample Standard Deviation
  • Example of Two Sample T Test and Confidence Interval
  • Hypothesis Test for the Difference of Two Population Proportions
  • Differences Between Population and Sample Standard Deviations
  • An Introduction to Hypothesis Testing
  • Hypothesis Test Example
  • An Introduction to the Bell Curve
  • How to Make a Stem and Leaf Plot
  • Histogram Classes
  • How to Make a Boxplot
  • What Is Correlation in Statistics?
  • What Is a Histogram?

Module 11: The Chi Square Distribution

Goodness-of-fit test, learning outcomes.

  • Conduct and interpret chi-square goodness-of-fit hypothesis tests

In this type of hypothesis test, you determine whether the data “fit” a particular distribution or not. For example, you may suspect your unknown data fit a binomial distribution. You use a chi-square test (meaning the distribution for the hypothesis test is chi-square) to determine if there is a fit or not. The null and the alternative hypotheses for this test may be written in sentences or may be stated as equations or inequalities.

The test statistic for a goodness-of-fit test is: [latex]\displaystyle{\sum_{k}}\frac{{({O}-{E})}^{{2}}}{{E}}[/latex]

  • O = observed values (data)
  • E = expected values (from theory)
  • k = the number of different data cells or categories

The observed values are the data values and the expected values are the values you would expect to get if the null hypothesis were true. There are  n terms of the form [latex]\displaystyle\frac{{({O}-{E})}^{{2}}}{{E}}[/latex].

The number of degrees of freedom is  df = (number of categories – 1).

The goodness-of-fit test is almost always right-tailed. If the observed values and the corresponding expected values are not close to each other, then the test statistic can get very large and will be way out in the right tail of the chi-square curve.

Note:  The expected value for each cell needs to be at least five in order for you to use this test.

Absenteeism of college students from math classes is a major concern to math instructors because missing class appears to increase the drop rate. Suppose that a study was done to determine if the actual student absenteeism rate follows faculty perception. The faculty expected that a group of 100 students would miss class according to this table.

Number of absences per term Expected number of students
0–2 50
3–5 30
6–8 12
9–11 6
12+ 2

A random survey across all mathematics courses was then done to determine the actual number  (observed) of absences in a course. The chart in this table displays the results of that survey.

Number of absences per term Actual number of students
0–2 35
3–5 40
6–8 20
9–11 1
12+ 4

Determine the null and alternative hypotheses needed to conduct a goodness-of-fit test.

H 0 : Student absenteeism fits faculty perception.

The alternative hypothesis is the opposite of the null hypothesis.

H a : Student absenteeism does not fit faculty perception.

  • Can you use the information as it appears in the charts to conduct the goodness-of-fit test?
  • What is the number of degrees of freedom ( df )?
Number of absences per term Expected number of students
0–2 50
3–5 30
6–8 12
9+ 8
Number of absences per term Actual number of students
0–2 35
3–5 40
6–8 20
9+ 5
  • There are four “cells” or categories in each of the new tables. df = number of cells – 1 = 4 – 1 = 3

A factory manager needs to understand how many products are defective versus how many are produced. The number of expected defects is listed in the table.

Number produced Number defective
0–100 5
101–200 6
201–300 7
301–400 8
401–500 10

A random sample was taken to determine the actual number of defects. This table shows the results of the survey.

Number produced Number defective
0–100 5
101–200 7
201–300 8
301–400 9
401–500 11

State the null and alternative hypotheses needed to conduct a goodness-of-fit test, and state the degrees of freedom.

H 0 : The number of defaults fits expectations.

H a : The number of defaults does not fit expectations.

Employers want to know which days of the week employees are absent in a five-day work week. Most employers would like to believe that employees are absent equally during the week. Suppose a random sample of 60 managers were asked on which day of the week they had the highest number of employee absences. The results were distributed as in the table below. For the population of employees, do the days for the highest number of absences occur with equal frequencies during a five-day work week? Test at a 5% significance level.

Day of the Week Employees were Most Absent

Monday Tuesday Wednesday Thursday Friday
Number of Absences 15 12 9 9 15

The null and alternative hypotheses are:

  • H 0 : The absent days occur with equal frequencies, that is, they fit a uniform distribution.
  • H a : The absent days occur with unequal frequencies, that is, they do not fit a uniform distribution.

If the absent days occur with equal frequencies, then, out of 60 absent days (the total in the sample: 15 + 12 + 9 + 9 + 15 = 60), there would be 12 absences on Monday, 12 on Tuesday, 12 on Wednesday, 12 on Thursday, and 12 on Friday. These numbers are the expected ( E ) values. The values in the table are the observed ( O ) values or data.

This time, calculate the  χ 2 test statistic by hand. Make a chart with the following headings and fill in the columns:

  • Expected ( E ) values (12, 12, 12, 12, 12)
  • Observed ( O ) values (15, 12, 9, 9, 15)
  • ( O – E ) 2
  • [latex]\displaystyle\frac{{({O}-{E})}^{{2}}}{{E}}[/latex]

Now add (sum) the last column. The sum is three. This is the  χ 2 test statistic.

To find the  p -value, calculate P ( χ 2 > 3). This test is right-tailed. (Use a computer or calculator to find the p -value. You should get p -value = 0.5578.)

The  dfs are the number of cells – 1 = 5 – 1 = 4

Press  2nd DISTR . Arrow down to  χ2cdf . Press ENTER . Enter (3,10^99,4) . Rounded to four decimal places, you should see 0.5578, which is the p-value.

Next, complete a graph like the following one with the proper labeling and shading. (You should shade the right tail.)

This is a blank nonsymmetrical chi-square curve for the test statistic of the days of the week absent.

The decision is not to reject the null hypothesis.

Conclusion: At a 5% level of significance, from the sample data, there is not sufficient evidence to conclude that the absent days do not occur with equal frequencies.

TI-83+ and some TI-84 calculators do not have a special program for the test statistic for the goodness-of-fit test. The next example has the calculator instructions. The newer TI-84 calculators have in  STAT TESTS the test Chi2 GOF . To run the test, put the observed values (the data) into a first list and the expected values (the values you expect if the null hypothesis is true) into a second list. Press STAT TESTS  and Chi2 GOF . Enter the list names for the Observed list and the Expected list. Enter the degrees of freedom and press  calculate or draw . Make sure you clear any lists before you start. To Clear Lists in the calculators: Go into  STAT EDIT and arrow up to the list name area of the particular list. Press  CLEAR and then arrow down. The list will be cleared. Alternatively, you can press STAT and press 4 (for  ClrList ). Enter the list name and press ENTER .

Teachers want to know which night each week their students are doing most of their homework. Most teachers think that students do homework equally throughout the week. Suppose a random sample of 49 students were asked on which night of the week they did the most homework. The results were distributed as in the table.

Sunday Monday Tuesday Wednesday Thursday Friday Saturday
Number of Students 11 8 10 7 10 5 5

From the population of students, do the nights for the highest number of students doing the majority of their homework occur with equal frequencies during a week? What type of hypothesis test should you use?

p -value = 0.6093

We decline to reject the null hypothesis. There is not enough evidence to support that students do not do the majority of their homework equally throughout the week.

One study indicates that the number of televisions that American families have is distributed (this is the  given distribution for the American population) as in the table.

Number of Televisions Percent
0 10
1 16
2 55
3 11
4+ 8

The table contains expected ( E ) percents.

A random sample of 600 families in the far western United States resulted in the data in this table.

Number of Televisions Frequency
Total = 600
0 66
1 119
2 340
3 60
4+ 15

The table contains observed ( O ) frequency values.

At the 1% significance level, does it appear that the distribution “number of televisions” of far western United States families is different from the distribution for the American population as a whole?

This problem asks you to test whether the far western United States families distribution fits the distribution of the American families. This test is always right-tailed.

The first table contains expected percentages. To get expected ( E ) frequencies, multiply the percentage by 600. The expected frequencies are shown in this table.

Number of Televisions Percent Expected Frequency
0 10 (0.10)(600) = 60
1 16 (0.16)(600) = 96
2 55 (0.55)(600) = 330
3 11 (0.11)(600) = 66
over 3 8 (0.08)(600) = 48

Therefore, the expected frequencies are 60, 96, 330, 66, and 48. In the TI calculators, you can let the calculator do the math. For example, instead of 60, enter 0.10*600.

H 0 : The “number of televisions” distribution of far western United States families is the same as the “number of televisions” distribution of the American population.

H a : The “number of televisions” distribution of far western United States families is different from the “number of televisions” distribution of the American population.

Distribution for the test: [latex]\displaystyle\chi^{2}_{4}[/latex] where df = (the number of cells) – 1 = 5 – 1 = 4.

Note : [latex]df\neq600-1[/latex]

Calculate the test statistic: χ 2 = 29.65

This is a nonsymmetric chi-square curve with values of 0, 4, and 29.65 labeled on the horizontal axis. The value 4 coincides with the peak of the curve. A vertical upward line extends from 29.65 to the curve, and the region to the right of this line is shaded. The shaded area is equal to the p-value.

Probability statement: p -value = P ( χ 2 > 29.65) = 0.000006

Compare α and the p -value:

α = 0.01 p -value = 0.000006

So, α > p -value.

Make a decision: Since α > p -value, reject H o .

This means you reject the belief that the distribution for the far western states is the same as that of the American population as a whole.

Conclusion: At the 1% significance level, from the data, there is sufficient evidence to conclude that the “number of televisions” distribution for the far western United States is different from the “number of televisions” distribution for the American population as a whole.

Press STAT and ENTER . Make sure to clear lists L1 , L2 , and L3 if they have data in them (see the note at the end of Example 2). Into L1 , put the observed frequencies 66 , 119 , 349 , 60 , 15 . Into L2 , put the expected frequencies .10*600, .16*600 , .55*600 , .11*600 , .08*600 . Arrow over to list L3 and up to the name area L3 . Enter (L1-L2)^2/L2 and ENTER . Press 2nd QUIT . Press 2nd LIST and arrow over to MATH . Press 5 . You should see "sum" (Enter L3) . Rounded to 2 decimal places, you should see 29.65 . Press 2nd DISTR . Press 7 or Arrow down to 7:χ2cdf and press ENTER . Enter (29.65,1E99,4) . Rounded to four places, you should see 5.77E-6 = .000006  (rounded to six decimal places), which is the p-value.

The newer TI-84 calculators have in STAT TESTS the test Chi2 GOF . To run the test, put the observed values (the data) into a first list and the expected values (the values you expect if the null hypothesis is true) into a second list. Press STAT TESTS and Chi2 GOF . Enter the list names for the Observed list and the Expected list. Enter the degrees of freedom and press calculate or draw . Make sure you clear any lists before you start.

The expected percentage of the number of pets students have in their homes is distributed (this is the given distribution for the student population of the United States) as in this table.

Number of Pets Percent
0 18
1 25
2 30
3 18
4+ 9

A random sample of 1,000 students from the Eastern United States resulted in the data in the table below.

Number of Pets Frequency
0 210
1 240
2 320
3 140
4+ 90

At the 1% significance level, does it appear that the distribution “number of pets” of students in the Eastern United States is different from the distribution for the United States student population as a whole? What is the p -value?

Suppose you flip two coins 100 times. The results are 20 HH , 27 HT , 30 TH , and 23 TT . Are the coins fair? Test at a 5% significance level.

This problem can be set up as a goodness-of-fit problem. The sample space for flipping two fair coins is { HH , HT , TH , TT }. Out of 100 flips, you would expect 25 HH , 25 HT , 25 TH , and 25 TT . This is the expected distribution. The question, “Are the coins fair?” is the same as saying, “Does the distribution of the coins (20 HH , 27 HT , 30 TH , 23 TT ) fit the expected distribution?”

Random Variable: Let X = the number of heads in one flip of the two coins. X takes on the values 0, 1, 2. (There are 0, 1, or 2 heads in the flip of two coins.) Therefore, the number of cells is three . Since X = the number of heads, the observed frequencies are 20 (for two heads), 57 (for one head), and 23 (for zero heads or both tails). The expected frequencies are 25 (for two heads), 50 (for one head), and 25 (for zero heads or both tails). This test is right-tailed.

H 0 : The coins are fair.

H a : The coins are not fair.

Distribution for the test:  [latex]\chi^2_2[/latex] where df = 3 – 1 = 2.

Calculate the test statistic: χ 2 = 2.14

This is a nonsymmetrical chi-square curve with values of 0 and 2.14 labeled on the horizontal axis. A vertical upward line extends from 2.14 to the curve and the region to the right of this line is shaded. The shaded area is equal to the p-value.

Probability statement: p -value = P ( χ 2 > 2.14) = 0.3430

α < p -value.

Make a decision: Since α < p -value, do not reject H 0 .

Conclusion: There is insufficient evidence to conclude that the coins are not fair.

Press STAT and ENTER . Make sure you clear lists L1 , L2 , and L3 if they have data in them. Into L1 , put the observed frequencies 20 , 57 , 23 . Into L2 , put the expected frequencies 25 , 50 , 25 . Arrow over to list L3 and up to the name area "L3" . Enter (L1-L2)^2/L2 and ENTER . Press 2nd QUIT . Press 2nd LIST and arrow over to MATH . Press 5 . You should see "sum" . Enter L3 . Rounded to two decimal places, you should see 2.14 . Press 2nd DISTR . Arrow down to 7:χ2cdf (or press 7 ). Press ENTER . Enter 2.14,1E99,2) . Rounded to four places, you should see .3430 , which is the p-value.

Students in a social studies class hypothesize that the literacy rates across the world for every region are 82%. This table shows the actual literacy rates across the world broken down by region. What are the test statistic and the degrees of freedom?

MDG Region Adult Literacy Rate (%)
Developed Regions 99.0
Commonwealth of Independent States 99.5
Northern Africa 67.3
Sub-Saharan Africa 62.5
Latin America and the Caribbean 91.0
Eastern Asia 93.8
Southern Asia 61.9
South-Eastern Asia 91.9
Western Asia 84.5
Oceania 66.4

chi 2 test statistic = 26.38

This is a nonsymmetric chi-square curve with df = 9. The values 0, 9, and 26.38 are labeled on the horizontal axis. The value 9 coincides with the peak of the curve. A vertical upward line extends from 26.38 to the curve, and the region to the right of this line is shaded. The shaded area is equal to the p-value.

Press STAT and ENTER . Make sure you clear lists L1, L2, and L3 if they have data in them. Into L1, put the observed frequencies 99, 99.5, 67.3, 62.5, 91, 93.8, 61.9, 91.9, 84.5, 66.4 . Into L2 , put the expected frequencies 82, 82, 82, 82, 82, 82, 82, 82, 82, 82 . Arrow over to list L3 and up to the name area L3 . Enter (L1-L2)^2/L2 and ENTER . Press 2nd QUIT . Press 2nd LIST and arrow over to MATH . Press 5 . You should see "sum" . Enter L3 . Rounded to two decimal places, you should see 26.38 . Press 2nd DISTR . Arrow down to 7:χ2cdf (or press 7 ). Press ENTER . Enter 26.38,1E99,9) . Rounded to four places, you should see .0018 , which is the p -value.

  • OpenStax College, Introductory Statistics. Located at : . License : CC BY: Attribution
  • Introductory Statistics . Authored by : Barbara Illowski, Susan Dean. Provided by : Open Stax. Located at : http://cnx.org/contents/[email protected] . License : CC BY: Attribution . License Terms : Download for free at http://cnx.org/contents/[email protected]
  • Pearson's chi square test (goodness of fit) | Probability and Statistics | Khan Academy . Authored by : Khan Academy. Located at : https://www.youtube.com/embed/2QeDRsxSF9M . License : All Rights Reserved . License Terms : Standard YouTube License

Chi-Square Goodness of Fit Test

Chi-Square goodness of fit test is a non-parametric test that is used to find out how the observed value of a given phenomena is significantly different from the expected value.  In Chi-Square goodness of fit test, the term goodness of fit is used to compare the observed sample distribution with the expected probability distribution.  This test determines how well theoretical distribution (such as normal, binomial, or Poisson) fits the empirical distribution. Sample data is divided into intervals. Then the numbers of points that fall into the interval are compared, with the expected numbers of points in each interval.

Procedure for Chi-Square Goodness of Fit Test:

  • Set up the hypothesis:

A. Null hypothesis : The null hypothesis assumes that there is no significant difference between the observed and the expected value.

B. Alternative hypothesis : The alternative hypothesis assumes that there is a significant difference between the observed and the expected value.

  • Compute the value of Chi-Square goodness of fit test using the following formula:

null and alternative hypothesis goodness of fit

Degree of freedom: The degree of freedom depends on the distribution of the sample.  The following table shows the distribution and an associated degree of freedom:

Type of distributionNo of constraintsDegree of freedom
Binominal distribution1n-1
Poisson distribution2n-2
Normal distribution3n-3

Hypothesis testing: Hypothesis testing is the same as in other tests, like t-test, ANOVA , etc.  The calculated value of Chi-Square goodness of fit test is compared with the table value.  If the calculated value is greater than the table value, we will reject the null hypothesis and conclude that there is a significant difference between the observed and the expected frequency.  If the calculated value is less than the table value, we will accept the null hypothesis and conclude that there is no significant difference between the observed and expected value.

Need help with your analysis?

Schedule a time to speak with an expert using the calendar below.

Turn raw data into written, interpreted, APA formatted Chi-Square results in seconds.

Take the Course: Chi-Square Goodness of Fit Test

Related Pages:

  • Conduct and Interpret the Chi-Square Test of Independence
  • Chi-square: degrees of freedom

User Preferences

Content preview.

Arcu felis bibendum ut tristique et egestas quis:

  • Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris
  • Duis aute irure dolor in reprehenderit in voluptate
  • Excepteur sint occaecat cupidatat non proident

Keyboard Shortcuts

2.11 - the lack of fit f-test, investigating new accounts data section  .

We're almost there! We just need to determine an objective way of deciding when too much of the error in our prediction is due to a lack of model fit. That's where the lack of fit F -test comes into play. Let's return to the first checking account example, ( New Accounts data ):

Jumping ahead to the punchline, here's Minitab's output for the lack of fit F -test for this data set:

Analysis of Variance

Source DF Adj SS Adj MS F-Value P-Value
Regression 1 5141 5141 3.14 0.110
Residual Error 9 14742 1638    
Lack of Fit 4 13594 3398 14.80 0.006
Pure Error 5 1148 230    
Total 10 19883      

1 row with no replicates

As you can see, the lack of fit output appears as a portion of the analysis of variance table. In the Sum of Squares (" SS ") column, we see — as we previously calculated — that SSLF = 13594 and SSPE = 1148 sum to SSE = 14742. We also see in the Degrees of Freedom (" DF ") column that — since there are n = 11 data points and c = 6 distinct x values (75, 100, 125, 150, 175, and 200) — the lack of fit degrees of freedom c - 2 = 4 and the pure error degrees of freedom is n - c = 5 sum to the error degrees of freedom n - 2 = 9.

Just as is done for the sums of squares in the basic analysis of variance table, the lack of fit sum of squares and the error sum of squares are used to calculate "mean squares." They are even calculated similarly, namely by dividing the sum of squares by their associated degrees of freedom. Here are the formal definitions of the mean squares:

In the Mean Squares (" MS ") column, we see that the lack of fit mean square MSLF is 13594 divided by 4, or 3398. The pure error mean square MSPE is 1148 divided by 5, or 230:

Source DF Adj SS Adj MS F-Value P-Value
1 5141 5141 3.14 0.110
9 14742 1638    
4 13594 3398 14.80 0.006
5 1148 230    
10 19883      

You might notice that the lack of fit F-statistic is calculated by dividing the lack of fit mean square (MSLF = 3398) by the pure error mean square (MSPE = 230) to get 14.80. How do we know that this F-statistic helps us in testing the hypotheses:

  • \(H_{0 }\): The relationship assumed in the model is reasonable, i.e., there is no lack of fit.
  • \(H_{A }\): The relationship assumed in the model is not reasonable, i.e., there is a lack of fit.

The answer lies in the " expected mean squares ." In our sample of n = 11 newly opened checking accounts, we obtained MSLF = 3398. If we had taken a different random sample of size n = 11, we would have obtained a different value for MSLF . Theory tells us that the average of all of the possible MSLF values we could obtain is:

\(E(MSLF) =\sigma^2+\dfrac{\sum n_i(\mu_i-(\beta_0+\beta_1X_i))^2}{c-2}\)

That is, we should expect MSLF , on average, to equal the above quantity — \(\sigma^{2}\) plus another messy-looking term. Think about that messy term. If the null hypothesis is true, i.e. , if the relationship between the predictor x and the response y is linear, then \(\mu_{i} = \beta_{0} + \beta_{1}X_{i}\) and the messy term becomes 0 and goes away. That is, if there is no lack of fit, we should expect the lack of fit mean square MSLF to equal \(\sigma^{2}\).

What should we expect MSPE to equal? Theory tells us it should, on average, always equal \(\sigma^{2}\):

\(E(MSPE) =\sigma^2\)

Aha — there we go! The logic behind the calculation of the F -statistic is now clear:

  • If there is a linear relationship between x and y , then \(\mu_{i} = \beta_{0} + \beta_{1}X_{i}\). That is, there is no lack of fit in the simple linear regression model. We would expect the ratio MSLF / MSPE to be close to 1.
  • If there is not a linear relationship between x and y , then \(\mu_{i} ≠ \beta_{0} + \beta_{1}X_{i}\). That is, there is a lack of fit in the simple linear regression model. We would expect the ratio MSLF / MSPE to be large, i.e. , a value greater than 1.

So, to conduct the lack of fit test, we calculate the value of the F -statistic:

\(F^*=\dfrac{MSLF}{MSPE}\)

and determine if it is large. To decide if it is large, we compare the F* -statistic to an F -distribution with c - 2 numerator degrees of freedom and n - c denominator degrees of freedom.

In summary Section  

We follow standard hypothesis test procedures in conducting the lack of fit F -test. First, we specify the null and alternative hypotheses:

  • \(H_{0}\): The relationship assumed in the model is reasonable, i.e., there is no lack of fit in the model \(\mu_{i} = \beta_{0} + \beta_{1}X_{i}\).
  • \(H_{A}\): The relationship assumed in the model is not reasonable, i.e., there is lack of fit in the model \(\mu_{i} = \beta_{0} + \beta_{1}X_{i}\).

Second, we calculate the value of the F -statistic:

To do so, we complete the analysis of variance table using the following formulas.

Source DF SS MS F
Regression 1 \(SSR=\sum_{i=1}^{c}\sum_{j=1}^{n_i}(\hat{y}_{ij}-\bar{y})^2\) \(MSR=\dfrac{SSR}{1}\) \(F=\dfrac{MSR}{MSE}\)
Residual Error - 2 \(SSE=\sum_{i=1}^{c}\sum_{j=1}^{n_i}(y_{ij}-\hat{y}_{ij})^2\) \(MSE=\dfrac{SSE}{n-2}\)  
Lack of Fit - 2 \(SSLF=\sum_{i=1}^{c}\sum_{j=1}^{n_i}(\bar{y}_{i}-\hat{y}_{ij})^2\) \(MSLF=\dfrac{SSLF}{c-2}\) \(F^*=\dfrac{MSLF}{MSPE}\)
Pure Error - \(SSPE=\sum_{i=1}^{c}\sum_{j=1}^{n_i}(y_{ij}-\bar{y}_{i})^2\) \(MSPE=\dfrac{SSPE}{n-c}\)  
Total - 1 \(SSTO=\sum_{i=1}^{c}\sum_{j=1}^{n_i}(y_{ij}-\bar{y})^2\)    

In reality, we let statistical software such as Minitab, determine the analysis of variance table for us.

Third, we use the resulting F* -statistic to calculate the P -value. As always, the P -value is the answer to the question "how likely is it that we’d get an F* -statistic as extreme as we did if the null hypothesis were true?" The P -value is determined by referring to an F- distribution with c - 2 numerator degrees of freedom and n - c denominator degrees of freedom.

Finally, we make a decision:

  • If the P -value is smaller than the significance level \(\alpha\), we reject the null hypothesis in favor of the alternative. We conclude that "there is sufficient evidence at the \(\alpha\) level to conclude that there is a lack of fit in the simple linear regression model ."
  • If the P -value is larger than the significance level \(\alpha\), we fail to reject the null hypothesis. We conclude "there is not enough evidence at the \(\alpha\) level to conclude that there is a lack of fit in the simple linear regression model ."

For our checking account example:

in which we obtain:

the F* -statistic is 14.80 and the P -value is 0.006. The P -value is smaller than the significance level \(\alpha = 0.05\) — we reject the null hypothesis in favor of the alternative. There is sufficient evidence at the \(\alpha = 0.05\) level to conclude that there is a lack of fit in the simple linear regression model . In light of the scatterplot, the lack of fit test provides the answer we expected.

The lack of fit test Section  

Fill in the missing numbers (??) in the following analysis of variance table resulting from a simple linear regression analysis.

Click on the light bulb in each cell to reveal the correct answer.

Source DF Adj SS Adj MS F-Value P-Value
Regression ?? 12.597 ?? ?? 0.000
Residual Error ?? ?? ??    
Lack of Fit 3 ?? ?? ?? ??
Pure Error ?? 0.157 ??    
Total 14 15.522      

Stack Exchange Network

Stack Exchange network consists of 183 Q&A communities including Stack Overflow , the largest, most trusted online community for developers to learn, share their knowledge, and build their careers.

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Why does rejecting the null in goodness-of-fit tests not imply accepting the null?

From All of Statistics by Wasserman:

Goodness-of-fit testing has some serious limitations. If reject $H_0$ then we conclude we should not use the model. But if we do not reject $H_0$ we cannot conclude that the model is correct. We may have failed to reject simply because the test did not have enough power. This is why it is better to use nonparametric methods whenever possible rather than relying on parametric assumptions.

My questions are:

Are goodness-of-fit tests parametric or nonparametric?

why "if we do not reject $H_0$ we cannot conclude that the model is correct"?

If it is because "We may have failed to reject simply because the test did not have enough power", why "it is better to use nonparametric methods whenever possible rather than relying on parametric assumptions"? Doesn't the same reason and conclusion apply to nonparametric methods?

Is it correct that "Goodness-of-fit testing" here mean testing if the distribution of a sample is a specific distribution?

Does the same conclusion "if we do not reject $H_0$ we cannot conclude that the model is correct" also apply to testing if the distribution of a sample and the distribution of another sample are the same, such as z test for two normally distributed groups of samples, and Kolmogorov-Smirnov two-sample test?

Thanks and regards!

  • hypothesis-testing

Community's user avatar

  • 2 $\begingroup$ " Are goodness-of-fit tests parametric or nonparametric? " --- you already asked it... why ask it again? $\endgroup$ –  Glen_b Commented May 28, 2013 at 3:52
  • $\begingroup$ I wondered if you'd maybe miscopied the text, but that "If reject $H_0$" error is in the original. $\endgroup$ –  Glen_b Commented May 28, 2013 at 4:03
  • $\begingroup$ @Glen_b: Thanks! (1) To figure the second part, I think answering the first part will help. But I knew that I had asked it, so I linked to it. (2) The quote is what it is in the book, on p169. Is there something wrong? $\endgroup$ –  Tim Commented May 28, 2013 at 10:49
  • $\begingroup$ " The quote is what it is in the book " - yes, I know, I checked; that was what I was saying. Someone reading it is likely - as I did - to think the sentence is mistyped, since it appears not to be correct English. I was confirming that it is exactly as in the original. $\endgroup$ –  Glen_b Commented May 28, 2013 at 10:58
  • $\begingroup$ @Glen_b: Thanks for your confirmation! I was wondering What you think about "it is better to use nonparametric methods whenever possible rather than relying on parametric assumptions"? Doesn't the same reason and conclusion applied to goodness-of-fitness tests also apply to nonparametric methods? $\endgroup$ –  Tim Commented May 28, 2013 at 11:07

Question 3: that depends on the goodness of fit test. So it is always a good idea to read up on the specific goodness of fit test you want to apply to figure out what exactly the null hypothesis is that is being tested.

Question 2: To understand this you need to see that a goodness of fit test is just like any other statistical test, and understand exactly what the logic is behind statistical tests. The outcome of a statistical test is a $p$-value, which is the probability of finding data that deviates from $H_0$ at least as much as the data you have observed when $H_0$ is true. So it is a thought experiment with the following steps:

  • Assume a population in which $H_0$ is true, that is, your model is correct in some specific sense depending on the goodness of fit test.
  • We draw many samples at random from this population, fit the model, and compute the goodness of fit test in each of these samples.
  • Since you have drawn samples at randome, some of these samples will be "weird", i.e. deviate from $H_0$.
  • The $p$-value is the expected proportion of samples that are "as weird or weirder" than the data you have observed.

If you find data with a small $p$-value then that data is unlikely to have come from a population in which the $H_0$ is true, and the fact that you have observed that data is considered evidence against $H_0$. If the $p$-value is below some pre-defined but arbitrary cut off point $\alpha$ (common values are 5% or 1%), then we call it "siginificant" and reject the $H_0$.

Notice what the opposite, not-significant, means: we have not found enough information to reject $H_0$. This is a case of "absence of evidence", which is not the same thing as "evidence of absence". So, "not rejecting $H_0$" is not the same thing as "accepting $H_0$".

Another way to answer your question would be to ask: "could it be that the $H_0$ is true?" the answer is simply no. In a goodness of fit test, the $H_0$ is that the model is in some sense true. The definition of a model is that it is a simplification of reality and simplification is just an other word for "wrong in some useful way". So models are by definition wrong, and thus the $H_0$ cannot be true.

This has consequences for the statement you quoted: "If reject $H_0$ then we conclude we should not use the model." This is incorrect, all that the significance of a goodness of fit test tells you that your model is likely to be wrong, but you already knew that. The interesting question is whether it is so wrong that it is no longer useful. This is a judgement call. Statistical tests can help you in differentiating between patterns that could just be the result of the randomness that is the result of sampling and "real" patterns. A significant result tells you that the latter is likely to be true, but that is not enough to conclude that the model is not a useful simplification of reality. You now need to investigate what exactly the deviation is, how large that deviation is, and what the concequences are for the performance of your model.

Maarten Buis's user avatar

  • $\begingroup$ Thanks! (1) For question 3, I just added two specific tests, such as Z test for two groups of samples, and Kolmogorov-Smirnov two-sample test. What about them? (2) For question 2, why does the quote say that "it is better to use nonparametric methods whenever possible rather than relying on parametric assumptions"? Doesn't the same reason and conclusion applied to goodness-of-fitness tests also apply to nonparametric methods? $\endgroup$ –  Tim Commented May 28, 2013 at 11:06
  • 1 $\begingroup$ We never accept $H_0$. The two possible outcomes of a statistical test are accept or fail to accept $H_0$. $\endgroup$ –  Maarten Buis Commented May 28, 2013 at 11:18
  • 1 $\begingroup$ As to point (2): That would not be my quote. This quote is more a presentation of a preference of the author for a particular style of doing research. For your concrete application you would have to take into account that many so-called non-parametric test have very low power, so that would represent a real trade-off. $\endgroup$ –  Maarten Buis Commented May 28, 2013 at 11:27
  • $\begingroup$ I was wondering if "We never accept H0" and "The two possible outcomes of a statistical test are accept or fail to accept H0" are consistent with each other? Do they mean that we only fail to accept the null? $\endgroup$ –  Tim Commented May 28, 2013 at 11:30
  • $\begingroup$ Sorry, the last part should be: we either reject or fail to reject $H_0$. $\endgroup$ –  Maarten Buis Commented May 28, 2013 at 11:32

Your Answer

Sign up or log in, post as a guest.

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy .

Not the answer you're looking for? Browse other questions tagged hypothesis-testing or ask your own question .

  • Featured on Meta
  • Announcing a change to the data-dump process
  • Bringing clarity to status tag usage on meta sites

Hot Network Questions

  • How much easier/harder would it be to colonize space if humans found a method of giving ourselves bodies that could survive in almost anything?
  • How is message waiting conveyed to home POTS phone
  • Who is the referent of "him" in Genesis 18:2?
  • How can moral disagreements be resolved when the conflicting parties are guided by fundamentally different value systems?
  • How to count mismatches between two rows, column by column R?
  • I have some questions about theravada buddhism
  • Does it pay to put effort in fixing this problem or better reinstall Ubuntu 24.04 from scratch?
  • Which hash algorithms support binary input of arbitrary bit length?
  • What is happening when a TV satellite stops broadcasting during an "eclipse"?
  • Reconstruction of Riemann surface from a germ of holomorphic function
  • What is the highest apogee of a satellite in Earth orbit?
  • What unique phenomena would be observed in a system around a hypervelocity star?
  • Math format bug for \mathbf command in TeXLive 2023/2024
  • Cannot open and HTML file stored on RAM-disk with a browser
  • How did Oswald Mosley escape treason charges?
  • Is it possible to have a planet that's gaslike in some areas and rocky in others?
  • Does the order of ingredients while cooking matter to an extent that it changes the overall taste of the food?
  • Has the US said why electing judges is bad in Mexico but good in the US?
  • Why is there no article after 'by'?
  • Journal keeps messing with my proof
  • Maximizing the common value of both sides of an equation
  • Do passengers transiting in YVR (Vancouver) from international to US go through Canadian immigration?
  • What happens when touching a piece which cannot make a legal move?
  • Is the spectrum of Hawking radiation identical to that of thermal radiation?

null and alternative hypothesis goodness of fit

A comprehensive comparison of goodness-of-fit tests for logistic regression models

  • Original Paper
  • Published: 30 August 2024
  • Volume 34 , article number  175 , ( 2024 )

Cite this article

null and alternative hypothesis goodness of fit

  • Huiling Liu 1 ,
  • Xinmin Li 2 ,
  • Feifei Chen 3 ,
  • Wolfgang Härdle 4 , 5 , 6 &
  • Hua Liang 7  

We introduce a projection-based test for assessing logistic regression models using the empirical residual marked empirical process and suggest a model-based bootstrap procedure to calculate critical values. We comprehensively compare this test and Stute and Zhu’s test with several commonly used goodness-of-fit (GoF) tests: the Hosmer–Lemeshow test, modified Hosmer–Lemeshow test, Osius–Rojek test, and Stukel test for logistic regression models in terms of type I error control and power performance in small ( \(n=50\) ), moderate ( \(n=100\) ), and large ( \(n=500\) ) sample sizes. We assess the power performance for two commonly encountered situations: nonlinear and interaction departures from the null hypothesis. All tests except the modified Hosmer–Lemeshow test and Osius–Rojek test have the correct size in all sample sizes. The power performance of the projection based test consistently outperforms its competitors. We apply these tests to analyze an AIDS dataset and a cancer dataset. For the former, all tests except the projection-based test do not reject a simple linear function in the logit, which has been illustrated to be deficient in the literature. For the latter dataset, the Hosmer–Lemeshow test, modified Hosmer–Lemeshow test, and Osius–Rojek test fail to detect the quadratic form in the logit, which was detected by the Stukel test, Stute and Zhu’s test, and the projection-based test.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save.

  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime

Price includes VAT (Russian Federation)

Instant access to the full article PDF.

Rent this article via DeepDyve

Institutional subscriptions

null and alternative hypothesis goodness of fit

Similar content being viewed by others

null and alternative hypothesis goodness of fit

A generalized Hosmer–Lemeshow goodness-of-fit test for a family of generalized linear models

null and alternative hypothesis goodness of fit

Fifty Years with the Cox Proportional Hazards Regression Model

null and alternative hypothesis goodness of fit

CPMCGLM: an R package for p -value adjustment when looking for an optimal transformation of a single explanatory variable in generalized linear models

Explore related subjects.

  • Artificial Intelligence

Data availibility

No datasets were generated or analysed during the current study.

Chen, K., Hu, I., Ying, Z.: Strong consistency of maximum quasi-likelihood estimators in generalized linear models with fixed and adaptive designs. Ann. Stat. 27 (4), 1155–1163 (1999)

Article   MathSciNet   Google Scholar  

Dardis, C.: LogisticDx: diagnostic tests and plots for logistic regression models. R package version 0.3 (2022)

Dikta, G., Kvesic, M., Schmidt, C.: Bootstrap approximations in model checks for binary data. J. Am. Stat. Assoc. 101 , 521–530 (2006)

Ekanem, I.A., Parkin, D.M.: Five year cancer incidence in Calabar, Nigeria (2009–2013). Cancer Epidemiol. 42 , 167–172 (2016)

Article   Google Scholar  

Escanciano, J.C.: A consistent diagnostic test for regression models using projections. Economet. Theor. 22 , 1030–1051 (2006)

Härdle, W., Mammen, E., Müller, M.: Testing parametric versus semiparametric modeling in generalized linear models. J. Am. Stat. Assoc. 93 , 1461–1474 (1998)

MathSciNet   Google Scholar  

Harrell, F.E.: rms: Regression modeling strategies. R package version 6.3-0 (2022)

Hosmer, D.W., Hjort, N.L.: Goodness-of-fit processes for logistic regression: simulation results. Stat. Med. 21 (18), 2723–2738 (2002)

Hosmer, D.W., Lemesbow, S.: Goodness of fit tests for the multiple logistic regression model. Commun Stat Theory Methods 9 , 1043–1069 (1980)

Hosmer, D.W., Hosmer, T., Le Cessie, S., Lemeshow, S.: A comparison of goodness-of-fit tests for the logistic regression model. Stat. Med. 16 (9), 965–980 (1997)

Hosmer, D., Lemeshow, S., Sturdivant, R.: Applied Logistic Regression. Wiley Series in Probability and Statistics, Wiley, New York (2013)

Book   Google Scholar  

Jones, L.K.: On a conjecture of Huber concerning the convergence of projection pursuit regression. Ann. Stat. 15 , 880–882 (1987)

Kohl, M.: MKmisc: miscellaneous functions from M. Kohl. R package version, vol. 1, p. 8 (2021)

Kosorok, M.R.: Introduction to Empirical Processes and Semiparametric Inference, vol. 61. Springer, New York (2008)

Lee, S.-M., Tran, P.-L., Li, C.-S.: Goodness-of-fit tests for a logistic regression model with missing covariates. Stat. Methods Med. Res. 31 , 1031–1050 (2022)

Lindsey, J.K.: Applying Generalized Linear Models. Springer, Berlin (2000)

McCullagh, P., Nelder, J.A.: Generalized Linear Models, vol. 37. Chapman and Hall (1989)

Nelder, J.A., Wedderburn, R.W.M.: Generalized linear models. J. R. Stat. Soc. Ser. A 135 , 370–384 (1972)

Oguntunde, P.E., Adejumo, A.O., Okagbue, H.I.: Breast cancer patients in Nigeria: data exploration approach. Data Brief 15 , 47 (2017)

Osius, G., Rojek, D.: Normal goodness-of-fit tests for multinomial models with large degrees of freedom. J. Am. Stat. Assoc. 87 (420), 1145–1152 (1992)

Rady, E.-H.A., Abonazel, M.R., Metawe’e, M.H.: A comparison study of goodness of fit tests of logistic regression in R: simulation and application to breast cancer data. Appl. Math. Sci. 7 , 50–59 (2021)

Google Scholar  

Stukel, T.A.: Generalized logistic models. J. Am. Stat. Assoc. 83 (402), 426–431 (1988)

Stute, W., Zhu, L.-X.: Model checks for generalized linear models. Scand. J. Stat. Theory Appl. 29 , 535–545 (2002)

van der Vaart, A.W., Wellner, J.A.: Weak Convergence and Empirical Processes. Springer (1996)

van Heel, M., Dikta, G., Braekers, R.: Bootstrap based goodness-of-fit tests for binary multivariate regression models. J. Korean Stat. Soc. 51 (1), 308–335 (2022)

Yin, C., Zhao, L., Wei, C.: Asymptotic normality and strong consistency of maximum quasi-likelihood estimates in generalized linear models. Sci. China Ser. A Math. 49 , 145–157 (2006)

Download references

Acknowledgements

Li’s research was partially supported by NNSFC grant 11871294. Härdle gratefully acknowledges support through the European Cooperation in Science & Technology COST Action grant CA19130 - Fintech and Artificial Intelligence in Finance - Towards a transparent financial industry; the project “IDA Institute of Digital Assets”, CF166/15.11.2022, contract number CN760046/ 23.05.2024 financed under the Romanias National Recovery and Resilience Plan, Apel nr. PNRR-III-C9-2022-I8; and the Marie Skłodowska-Curie Actions under the European Union’s Horizon Europe research and innovation program for the Industrial Doctoral Network on Digital Finance, acronym DIGITAL, Project No. 101119635

Author information

Authors and affiliations.

Department of Statistics, South China University of Technology, Guangzhou, China

Huiling Liu

School of Mathematics and Statistics, Qingdao University, Shandong, 266071, China

Center for Statistics and Data Science, Beijing Normal University, Zhuhai, 519087, China

Feifei Chen

BRC Blockchain Research Center, Humboldt-Universität zu Berlin, 10178, Berlin, Germany

Wolfgang Härdle

Dept Information Management and Finance, National Yang Ming Chiao Tung U, Hsinchu, Taiwan

IDA Institute Digital Assets, Bucharest University of Economic Studies, Bucharest, Romania

Department of Statistics, George Washington University, Washington, DC, 20052, USA

You can also search for this author in PubMed   Google Scholar

Contributions

LHL, LXM and LH wrote the main manuscript text, LHL and CFF program, HW commented on the methodological section. All authors reviewed the manuscript.

Corresponding author

Correspondence to Hua Liang .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Liu, H., Li, X., Chen, F. et al. A comprehensive comparison of goodness-of-fit tests for logistic regression models. Stat Comput 34 , 175 (2024). https://doi.org/10.1007/s11222-024-10487-5

Download citation

Received : 02 December 2023

Accepted : 19 August 2024

Published : 30 August 2024

DOI : https://doi.org/10.1007/s11222-024-10487-5

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Consistent test
  • Model based bootstrap (MBB)
  • Residual marked empirical process (RMEP)
  • Find a journal
  • Publish with us
  • Track your research

Six Maxims of Statistical Acumen for Astronomical Data Analysis

The acquisition of complex astronomical data is accelerating, especially with newer telescopes producing ever more large-scale surveys. The increased quantity, complexity, and variety of astronomical data demand a parallel increase in skill and sophistication in developing, deciding, and deploying statistical methods. Understanding limitations and appreciating nuances in statistical and machine learning methods and the reasoning behind them is essential for improving data-analytic proficiency and acumen. Aiming to facilitate such improvement in astronomy, we delineate cautionary tales in statistics via six maxims, with examples drawn from the astronomical literature. Inspired by the significant quality improvement in business and manufacturing processes by the routine adoption of Six Sigma, we hope the routine reflection on these Six Maxims will improve the quality of both data analysis and scientific findings in astronomy.

1 Introduction

Although data science aims to address the myriad challenges arising from the entire life cycle of data (Wing, 2019 ) , there are a number of unique, or at least unusual, characteristics of astronomical data that demarcate the statistical challenges in astronomy and affect our approach to analyzing astronomical data. First, astronomical observations are not obtained from designed experiments in the traditional sense. There are no experimental settings that the astrophysical researcher compares by controlling conditions, such as treatment versus placebo. Consequently, observations are not exactly repeatable in the sense that observational conditions, instrumental properties, or the astrophysical phenomena themselves are changing over time. For instance, we cannot observe the same supernova explosion multiple times under different controlled conditions. Second, calibration (e.g., of instruments) is a crucial step of the observation process because it allows us to connect the observed signals to the underlying physics (see, e.g., Guainazzi et al., 2015 ) . Unfortunately, calibration is never exact and thus adds uncertainty to the final analysis (e.g., Lee et al., 2011 ) . Third, sparsity is inevitable even with big data, in a sense that researchers are always interested in the most distant objects and the faintest signals which are studied using small subsets of the full data. For instance, new classes of astronomical phenomena are rarely first discovered as bright sources, but are often among the most interesting scientifically. Fourth, observed astronomical objects are at different stages of their life cycles and evolve on different time scales, all of which are much larger than we can observe. This characteristic can help us understand long time scales via a population study, but homogeneity and completeness of the data may become an issue. Finally, measurement error uncertainties are heteroscedastic and are often given as constants along with the data (Feigelson & Babu, 1998 ; Feigelson et al., 2021 ) .

Taken together, these unusual characteristics can lead to challenges, especially when using off-the-shelf data-analytic tools to analyze astronomical data (Siemiginowska et al., 2019 ) . This is because underlying assumptions of standard statistical methods do not typically take account of these features. For example, even a simple linear regression model requires extra modeling assumptions to account for selection effects and measurement errors for astronomical data analysis (Kelly, 2007 ) . Thus, it is important to check these underlying assumptions on a case-by-case basis to ensure sound astronomical data analysis, especially when deploying methods that were developed outside of astronomy.

It is worth pointing out an important distinction in the jargon between the astronomical and statistical literature. A “model” in astrophysics refers to a parsimonious mathematical representation of expected (or predicted) signal from a physical process that generates emission that is eventually detected via telescopes. This could be the blackbody energy spectrum, or the pulse profile of a pulsar, or the number density of a population of sources in a globular cluster projected onto the sky, etc. In contrast, a “model” in statistics is a stochastic representation of the data-generating process that accounts for discrepancies between the astrophysical model and the data. This stochastic representation is indexed in that it is specified up to a set of unknown model parameters that are fit to the data. It reflects systemic adjustments (including observational constraints and instrument effects), selection effects, stochastic components such as Poisson and Gaussian errors, and anything else that effects the distribution of the data. To take a simple example, the choice of the Poisson ( g ⁢ ( θ ) ) 𝑔 𝜃 (g(\theta)) ( italic_g ( italic_θ ) ) model to represent photon counts forms a statistical model, whereas the astrophysical model stipulates the functional form of g ⁢ ( θ ) 𝑔 𝜃 g(\theta) italic_g ( italic_θ ) , e.g., g ⁢ ( θ ) 𝑔 𝜃 g(\theta) italic_g ( italic_θ ) could be a power-law in energy E 𝐸 E italic_E , i.e., norm × E − α norm superscript 𝐸 𝛼 \textrm{norm}\times E^{-\alpha} norm × italic_E start_POSTSUPERSCRIPT - italic_α end_POSTSUPERSCRIPT , with model parameters θ = { norm , α } 𝜃 norm 𝛼 \theta=\{\textrm{norm},\alpha\} italic_θ = { norm , italic_α } . The physical model is designed to describe a physical process without necessarily representing the stochastic aspects of data generation that lead to uncertainty in parameter estimation. Uncertainty quantification, on the other hand, is at the heart of the statistical model which aims to represent data and its variability as fully as possible.

A statistical model also describes the hierarchical connections between the various processes that translate incoming photons to observed electronic signals (e.g., van Dyk et al., 2001 ) .

As an illustration of the particular difficulties in handling astronomical data, consider the estimation of the time delay between the multiple images of a strongly gravitationally lensed time-varying source. In estimating the time delay between gravitationally lensed light curves of Q0957+561 (Hainline et al., 2012 ) , Tak et al. ( 2018b ) adopt a damped random walk statistical model (also known as a continuous-time auto-regressive model of order one or an Ornstein-Uhlenbeck process) as a data generating process. This model reveals multiple modes in the posterior distribution of the time delay parameter as illustrated in the top panel of Figure  1 . The height of the mode near 400 days is much less than the mode near 1100 days. However, it turns out that the highest mode near 1100 days is spurious, caused by model misspecification. The modes near 1100 days disappear when the astronomical model additionally incorporates polynomial regression to account for the effect of microlensing (Tak et al., 2017b ) that is known to be present in the data (Hainline et al., 2012 ) ; see the bottom panel of Figure  1 . Consequently, the mode near 400 days becomes prominent, in agreement with some previous analyses of this quasar (Schild, 1990 ; Shalyapin et al., 2008 ) .

Refer to caption

This example points out several important aspects in astronomical data analysis. First, different model fits on the same data can reveal completely different possibilities, e.g., for the time delay of Q0957+561. All of these possibilities are worth proper investigation in the context of available scientific knowledge, in an attempt to determine which are simply the result of model misspecification and which are new scientific discoveries. Second, blindly making inference based on the highest mode of the posterior distribution or likelihood function (or smallest loss function in machine learning methods) can be misleading, as illustrated in the top panel of Figure  1 . Thus it is essential to check whether the model captures important characteristics of the data sufficiently well before drawing any conclusions. Lastly, it is the story embedded in the data that can provide insight for improved modeling of physical phenomena, such as microlensing. The better the statistical and astronomical models reflect the data, the better the quality of what the data reveal to us.

In what follows, we discuss several issues that arise in astronomical data analyses in light of the unique or unusual features of astronomical data. We formulate our observations into the following six maxims, each of which is in the sprint of George Box’s well-known aphorism “all models are wrong but some are useful” (Box & Draper, 1987 ) :

All data have stories, but some are mistold.

All assumptions are meant to be helpful, but some can be harmful.

All prior distributions are informative, even those that are uniform.

All models are subject to interpretation, but some are less contrived.

All statistical tests have thresholds, but some are mis-set.

All model checks consider variations of the data, but some variants are more relevant than others.

While we believe that the statement of each of the maxims is new, the ideas that underlie them are not. Rather, the maxims are merely concise statements that we hope capture a sense of the reasoning that defines statistics as a discipline and that is the culmination of the work of generations of data-facing researchers. Our aim is to encourage researchers to carefully consider their (possibly implicit) modelling and statistical assumptions and how these assumptions may affect scientific findings. We hope that by keeping the maxims in mind as part of their daily data-analysis routine, researchers will improve the quality of both data modeling and scientific findings in astronomy.

2 All data have stories, but some are mistold.

In this section, we explain several issues in modeling astronomical data, such as sampling mechanisms, selection effects, preprocessing, and calibration, and discuss possible solutions to improve the quality of astronomical data analysis.

2.1 Sampling mechanism

Statisticians typically assume that the data are measurements of a statistical sample that is representative of the larger class of objects under study. For example, we might have a sample of white dwarf stars from the Milky Way Galaxy and measure the metallicity of each or we might have a sample of exoplanets and measure the mass of each. Formally, statisticians may assume that we have obtained a probability sample from the larger class or population. (In a probability sample, all objects in the population have a known non-zero probability of being included in the sample.) Unfortunately, such a sample is nearly impossible to obtain in astronomy. While it is true that measurements of the properties of individual objects have become more accurate, this accuracy does not translate into a more representative sample of objects. In fact, none of the so-called all-sky surveys have uniform coverage as they all provide preferential or deeper coverage of specific parts of the sky. For example, Sloan Digital Sky Survey (SDSS) is targeted at the northern celestial hemisphere, the Rubin Observatory has lower coverage in the northern hemisphere, and space-based observatories like TESS and eROSITA have deeper coverage towards the poles. Likewise, narrowly focused pencil-beam surveys like the Hubble Deep Field or Chandra Deep Field surveys have varying sensitivity across the field of view due to the detector or telescope responses.

Modeling such data without paying attention to the exact nature of the sampling mechanism and how well they represent the population of interest can result in biased inferences (Kelly, 2007 , Section 5) . Astronomers are generally aware of adverse selection effects introduced by the Eddington or Malmquist biases (Landy & Szalay, 1992 ; Teerikorpi, 2015 ) , but we caution that the systematics of any survey or measurement must be carefully considered on a case-by-case basis.

A well-known example occurs in Hubble ( 1929 ) , where systematically high peculiar velocities in the local Galactic neighborhood initially led to a large overestimate of the eponymous Hubble constant, H 0 subscript 𝐻 0 H_{0} italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT . Indeed, the importance of modeling systematic uncertainty is apparent throughout the history of the measurement of H 0 subscript 𝐻 0 H_{0} italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT . Figure  2 shows that early estimates, from the mid-to-late twentieth century, were either around 50 km s -1  Mpc -1 denoted by the dashed horizontal line (e.g., Sandage & Tammann, 1975 ) or around 100 km s -1  Mpc -1 visualized by a dotted horizontal line (e.g., de Vaucouleurs & Bollinger, 1979 ; de Vaucouleurs & Peters, 1986 ) . The half length of each vertical bar around the point estimate represents its 1 σ 𝜎 \sigma italic_σ uncertainty.

More recently, significant improvements in instrumentation and techniques have led to better understanding of the systematics involved and have narrowed the range of the measured values of H 0 subscript 𝐻 0 H_{0} italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT further. However, a statistically significant discrepancy remains among the estimates, raising a question regarding the validity of the standard cosmological model (Verde et al., 2019 ; Efstathiou, 2020 ; Riess et al., 2021 ) . For example, Figure  3 (excerpted from Figure 1 of Beaton et al. ( 2016 ) ) illustrates the tension between estimates of H 0 subscript 𝐻 0 H_{0} italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT from the so-called late-Universe measurements calibrated by the Cepheid distance scale (in blue), and early-Universe measurements obtained by the cosmic microwave background (in red). Feeney et al. ( 2018 ) show that the Bayesian evidence of the standard cosmological model is about seven times smaller than that of an extended cosmological model, that includes an additive deviation from the standard cosmological model. The corresponding Bayes factor between the two cosmological models is 0.15 ± 0.01 plus-or-minus 0.15 0.01 0.15\pm 0.01 0.15 ± 0.01 given the Planck 2015 XIII data (Planck Collaboration et al., 2016 ) and the distance-ladder data of Riess et al. ( 2016 ) with extra supernova outliers being considered.

Refer to caption

The role of systematics is clear in recent work describing the tension among competing estimates of H 0 subscript 𝐻 0 H_{0} italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT owing to the extraordinary efforts of the astronomical and cosmological communities to pin down H 0 subscript 𝐻 0 H_{0} italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT . How does a researcher with fewer resources recognize similar effects in their analysis and remedy them? We posit that this requires an iterative process that implements corrections and appropriately incorporates model complexity in follow-up analyses. Still, it is important to recognize that any analysis remains vulnerable to imperfect knowledge of the story behind its data.

2.2 Selection effect

In addition to non-uniform coverage, astronomical data are often obtained intentionally and purposefully for specific research projects. When such astronomical data become public through various archives, other researchers may download them and use them as if they were randomly and uniformly selected, possibly unaware of the danger of selection effect in their sample. Likewise, when the contents of different surveys are examined together, their individual characteristics can affect the overall interpretation in complex ways. This is well-appreciated when different catalogs are matched (e.g., Budavári & Szalay, 2008 ; Rots et al., 2017a , b ) , but less so when population studies are carried out. Catalog data are often used as training sets when applying machine learning methods, even though such training sets may not represent the population of interest well due to the non-uniform coverage and selection effects within the catalog. For example, the Chandra Source Catalog (CSC; Evans et al., 2010 ) provides a selection of fields, each observed for individual scientific reasons. Such “samples of convenience” are not probabilistic and not representative of the population, in contrast to flux-limited all-sky surveys like the ROSAT All-Sky Survey (RASS; Voges, 1993 ; Voges et al., 1999 ; Boller et al., 2016 ) . An example of how to deal with these effects is provided by Revsbech et al. ( 2017 ) and Autenrieth et al. ( 2024 ) who reduce the effect of a biased training set in classifying Type Ia supernovae via stratification. It forms training sets that are more representative of the corresponding strata within the test set. Similarly, Izbicki et al. ( 2017 ) propose a non-parametric density estimator for photometric redshift that accounts for selection bias in a non-representative training set by importance reweighting of the training set.

2.3 Preprocessing

Most astronomical data are pre-processed via multi-stage software pipelines specific to a given telescope. As illustrated in Figure  4 , in each stage of the pre-processing hierarchy, one astronomer’s inference is passed down to be used as an input by the next astronomer. Another inference is then made with the previous inference being treated as the data. Unfortunately, this pre-processing is often ignored even though the pre-processing steps can reveal evidence of potential systematic errors. For example, in the case of solar flares databases, precision of recorded flare intensities, complex detection/missing characteristics, temperature effects, incompleteness in matched features may all cause systematic errors in the data (Ryan et al., 2012 ; Aggarwal et al., 2018 ) .

As another example, catalog data pre-processing is performed via standard pipelines and assumptions. This pre-processing procedure generally affects the catalog quality and reliability; outliers may arise if measurements are not performed in a consistent way; different definitions of upper limits may cause an issue of censoring; an incorrectly implemented pre-processing procedure may introduce systematic error. Thus, it is important to understand how the catalog quantities are derived from the raw data through a chain of pre-processing stages, especially when different catalogs are compared or merged (Budavári & Szalay, 2008 ) . Whenever possible, the statistical and systematic errors introduced by the pre-processing procedure should be accounted for within the overall statistical model as much as possible (e.g., Portillo et al., 2017 ) .

Refer to caption

2.4 Calibration

Calibration is a foundational part of astronomical inference, more so than in any other physical science. While instrument calibration is indeed used extensively in fields like experimental physics and geophysics, it is of particularly critical importance in astronomy 1 1 1 Note that the term “calibration” is interpreted differently in astronomical compared to statistical literature. In astronomy, it refers to the process of characterizing uncertainties and bias corrections induced by instruments, enabling the translation of measured signals into physically meaningful units. In statistical literature, however, calibration generally refers to a process of “inverse regression”, where measurements of dependent quantities are used to predict corresponding standard measurements, mediated through a known model function. For example, if a functional form Y = f ⁢ ( X ) 𝑌 𝑓 𝑋 Y=f(X) italic_Y = italic_f ( italic_X ) is learned using a training data set, new measurements of a test data set Y 0 subscript 𝑌 0 Y_{0} italic_Y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT are used to predict X 0 = f − 1 ⁢ ( Y 0 ) subscript 𝑋 0 superscript 𝑓 1 subscript 𝑌 0 X_{0}=f^{-1}(Y_{0}) italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_f start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_Y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) . This is mainly motivated by instrumental calibration in chemistry, where high-quality “standard” measurements X 𝑋 X italic_X are more time-consuming and expensive than “test” measurements Y 𝑌 Y italic_Y . The statistical literature includes theories and methods for various linear, non-linear, multivariate, and dynamic approaches to statistical calibration (Osborne, 1991 ; Kubokawa & Robert, 1994 ; Brown, 1994 ; Oman & Srivastava, 1996 ; Rivers, 2014 ; Brown, 2018 ) . Methods are divided into those designed to handle the case where both standard and test measurements have appreciable error, known as comparative calibration (Kelly, 2007 ; Schafer & Purdy, 1996 ) and methods where the standard measurements are assumed to be perfect or nearly perfect, known as absolute calibration. Astronomical data are for the most part obtained through observations of remote sources, with physical quantities inferred by transforming the observed signals from a detector. Each telescope or focal plane instrument has its own specific characteristics that affect this transformation, and considerable effort is put into determining these, and tracking changes to them (see, e.g., Guainazzi et al., 2015 ; Partridge et al., 2016 ; Payne et al., 2020 ) . Ground-based photometric optical astronomy still relies on obtaining regular observations of “standard stars” with similar airmass to the target being observed, so even atmospheric variations must be adjusted for. High-energy astronomical telescopes construct and store detailed tabular models of the response of a detector to a monochromatic photon 2 2 2 See OGIP Calibration Memo CAL/GEN/92-002 and addendum ( https://heasarc.gsfc.nasa.gov/docs/heasarc/caldb/docs/memos/cal_gen_92_002/cal_gen_92_002.html , https://heasarc.gsfc.nasa.gov/docs/heasarc/caldb/docs/memos/cal_gen_92_002a/cal_gen_92_002a.html ) , and every mission measures and stores its sensitivity (also called effective area in high-energy astronomy) as a function of photon wavelength. As noted by Villanueva et al. ( 2021 ) , the details of calibration can have a dramatic impact on the quality of the data.

It is important to understand, however, that the available calibration products are not perfect. They are the result of measurements carried out in controlled conditions, and thus include measurement errors, as well as systematics that manifest themselves once the instruments are deployed (often in harsh space environments where the chances of radiation damage is high). Differences in calibration between different instruments must be weighed when different data streams are considered together. Where available, calibration uncertainty information must be folded in to the analysis (Lee et al., 2011 ; Xu et al., 2014 ) . More recently, efforts have been made to derive corrections to effective areas and to source flux estimates based on simultaneous observations of sources with different instruments even in the absence of an absolute reference using multiplicative shrinkage (see, for example, Chen et al., 2019a ; Marshall, 2021 ) .

There is a common theme in these examples. Knowing the story behind the data allows one to correct for potential model mis-specification, while not knowing the story may leave one oblivious to the same issue. Understanding their data, including limitations in the data collection process and potential selection effect, enables researchers to make appropriate corrections themselves. In this way, being attentive to the story behind their data enables researchers to make more reliable inferences regarding their populations and sources of interest.

3 All assumptions are meant to be helpful, but some can be harmful.

Popular statistical models were developed for specific purposes or motivated by particular problems. Some well known models, such as Gaussian linear regression, can be applicable in a wide variety of settings across various disciplines including astronomy and astrophysics with little difficulty. Another example is survival analysis which is one of the most popular data analyses in bio-medical sciences. In fact, classical survival analysis is not directly applicable to astronomical data because censoring in astronomy is due to statistical measurement errors rather than exactly measured failures. Nonetheless, it has been successfully applied to analyzing left- or right-censored data caused by telescope sensitivity in astronomy (Feigelson & Nelson, 1985 ; Isobe et al., 1986 ) .

However, the use of well-known models without careful consideration of their assumptions must be discouraged because standard statistical models do not account for unusual features of astronomical data or models. Even the standard linear regression model has underlying Gaussian assumption, while astronomical data may deviate from Gaussianity with outlying observations, low Poisson counts, background subtraction, error propagation, binned data, and/or heavy-tailed and asymmetric distributions.

Checking the assumptions of popular models is often facilitated by well-defined model checking procedures. For example, checking model assumptions via residual analysis is common in regression because it can provide insight into possible improvement for the current model fit. Tanaka et al. ( 1995 ) improve a poor continuum fit of spectral data via subsequent residual analysis; Bulbul et al. ( 2014 ) and Reeves et al. ( 2009 ) detect emission lines and absorption lines via residuals; Mandel et al. ( 2017 ) compare conventional and proposed models for the color-magnitude relation of Type Ia supernova by checking their Hubble residuals to see which model is better supported by the data.

Residual analysis often provides hints that can be used to improve model assumptions. For instance, when a model for light curves such as a damped random-walk model relies on a Gaussian measurement error assumption (Kelly et al., 2009 ) , a residual analysis may reveal some evidence against the Gaussian assumption in the presence of outliers. In an effort to improve residual analysis, Tak et al. ( 2019 ) and Wang & Taylor ( 2022 ) derive a heavy-tailed version of the damped random-walk model that is still able to constrain the same model parameters in a robust manner.

Besides standard model checking procedures, investigating the fitted model in light of the knowledge of domain science is also crucial as it can reveal evidence of potential model misspecification. For instance, the sensitivity or dependence of model fits on the starting values of optimizers or Monte Carlo samplers is not necessarily an indication of a numerical problem. It could instead point to a multi-modal outcome on a non-convex surface of the parameter space, providing several distinct model fits at different modes. Considering that a model describes a data generation process, it also implies that distinct sets of parameter values for a given model could have generated the same observed data, even though each set may not be equally likely to have generated the observed data.

Alternatively, a multi-modal likelihood function may indicate that the model is misspecified or is not elaborate enough to describe the data, either of which can lead to an unrelaible fit. Mode-based estimates, such as maximum likelihood estimates and posterior modes, aim to compute the parameter values corresponding to the particular model within the posited class that best matches the data (under a criteria determined, e.g., by the likelihood or posterior). Even the best match within the posited class of models, however, may not be very good if the class is not sufficiently rich. Using the fully Bayesian posterior distribution with a misspecified model can also lead to unreliable results, as emphasized in Figure  1 , where multimodality disappears when we additionally model microlensing. We emphasize that a well specified model is key to any model-based method, and thus it is critical for researchers to check the fit of their posited model in light of domain science knowledge, instead of blindly proceeding with the highest mode or other computed summary as the best model fit.

Another popular estimation tool in astronomy is χ 2 superscript 𝜒 2 \chi^{2} italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT -minimization that is built on a Gaussian approximation for the measurement errors. For example, when the data are binned 3 3 3 We consider the case where the data are intrinsically binned. When this is not the case, Feigelson & Babu ( 2012 ) suggest avoiding issues of arbitrary binning by using cumulative distribution function for maximum likelihood estimation. Poisson counts, a Gaussian approximation to the Poisson counts is required for χ 2 superscript 𝜒 2 \chi^{2} italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT -minimization. Thus, it is important to understand the limitations that might affect the validity or accuracy of this approximation. The method is often misused in the context of astronomical data analysis, for example, when the estimated variance of the approximate Gaussian distribution is quite different from that of the observed (or average) count, which contradicts the validity of Gaussian approximation to Poisson counts (Feigelson & Babu, 2012 , Chapter 7.4) . The approximation itself can be quite misleading when the underlying Poisson assumption is not appropriate, e.g., when the count data are overdispersed. The approximation becomes less accurate when counts of some bins are too small. In this case, merging adjacent small bins is one way to improve the accuracy of the approximation while sacrificing the resolution of the data (Greenwood & Nikulin, 1996 ) .

Building a model directly on Poisson counts without using any Gaussian approximation is another possibility. (This has not always been well recognized among astronomers (Hilbe, 2014 ) .) For example, Kelly et al. ( 2012 ) adopt Bayesian hierarchical modeling in fitting spectral energy distributions on flux data, instead of using χ 2 superscript 𝜒 2 \chi^{2} italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT -minimization. Hierarchical modeling also provides a mechanism for effectively handling overdispersed data (Gelman et al., 2013 ; Tak et al., 2017a ) . Another benefit that direct likelihood-based modelling has over χ 2 superscript 𝜒 2 \chi^{2} italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT -minimization is that it facilitates the use of information criteria for model selection, such as the Bayesian Information Criterion (Kass & Raftery, 1995 ) , which depend directly on the likelihood.

The central limit theorem is the basis for a Gaussian approximation to Poisson counts and generally plays an important role in statistical inference. It stipulates that the distribution of the average of independent observations becomes more Gaussian as the number of observations approaches infinity. The theorem is the basis of many asymptotic results such as the asymptotic normality of maximum likelihood estimates, the asymptotic χ 2 superscript 𝜒 2 \chi^{2} italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT distribution of the likelihood ratio test statistic via Wilk’s theorem, and the “projection method” for computing error bars (Avni, 1976 ) .

To be confident of the applicability of asymptotic results, researchers must check two things: the assumptions required for the results are met and the data set they are analyzing is sufficiently large. First, all asymptotic results depend critically on their own sets of mathematical assumptions known as regularity conditions. Even with an arbitrarily large data set, the central limit theorem itself fails, for example, if the expected value of the square of the averaged observations is not finite (e.g., when averaging ratios) or if the number of parameters increases sufficiently quickly compared to the sample size (e.g., as with instrument calibration Chen et al., 2019b ) ). The likelihood ratio test that compares the statistical evidence for two posited models is another example where the regularity conditions play a key role. This is because Wilk’s theorem only provides the asymptotic distribution of the likelihood ratio test statistic if, among other conditions, the models being compared are nested (i.e., one model is a special case of the other) and the simpler model is not on the boundary of the parameter space describing the more complex model. Protassov et al. ( 2002 ) show that the latter condition fails when testing for an added spectral emission line because an emission line by definition cannot have a negative normalization but the normalization is zero in the simpler model (with no line).

Second, even if their regularity conditions are met, asymptotic statistical methods are only reliable with sufficiently large data sets. A likelihood function that exhibits multiple significant modes, for example, may be evidence that either the model is misspecified (and a regularity condition is not met) or the data set is not sufficiently large for the asymptotic Gaussian properties of the likelihood to have “kicked in”. In practice, it can be difficult to know whether a data set is large enough. Generally, the more fitted parameters, the more data that are required. Goodness-of-fit tests are a particular challenge, because, in effect they compare the posited model with a fully flexible model, i.e., a model with a large number of fitted parameters. The C-statistic (a.k.a. Cash statistic), for example, is often used for goodness of fit tests in high-energy spectral analysis (Kaastra, 2017 ) . If the C-statistic is applied to high-resolution data with many narrow bins, its asymptotic χ 2 superscript 𝜒 2 \chi^{2} italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT distribution is only guaranteed if the expected counts are large in all bins. The alternative is to work with fewer larger bins, but this sacrifices the resolution of the data.

When there are insufficient data for asymptotic results, either Bayesian procedures or bootstrap-based methods can be used with small data sets. Unfortunately, both are computational more costly than their asymptotic frequentist counterparts. Higher-order asymptotics, which retain more terms in their functional expansions, sometimes can show advantages in such scenarios. For example, Chen et al. ( 2024 ) obtain a computationally efficient and statistically precise procedure for goodness-of-fit tests based on the C-statistic and higher-order asymptotics. The method only involves calculation of moments and works even in low-count settings where the Wilk’s theorem ( χ 2 superscript 𝜒 2 \chi^{2} italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT asymptotics on likelihood ratio tests) does not apply.

4 All prior distributions are informative, even those that are uniform.

Although Bayesian analysis has become popular in astronomy (Pierson, 2013 ; Eadie et al., 2023 ) , it is difficult to find an article that conducts a Bayesian analysis without using uniform priors (often uniform on the logarithmic scale). One possible explanation for this popularity may actually be a misunderstanding, namely a perception that the interpretation of Bayesian inference is more straightforward than that of frequentist inference. For example, one might think that a credible interval is a direct statement about the unknown parameters given the data, while confidence intervals need to be interpreted under a hypothetical repeated sampling scenario of the data. However, it is often forgotten that the interpretation of credible intervals hinges on the interpretation of the prior distribution, which can be philosophically as subtle as frequentist’s repeated sampling scenarios because prior distributions are chosen by researchers.

For instance, uniform priors are often assumed on a logarithmic scale, that is, log ⁡ ( X ) ∼ Unif ⁢ ( a , b ) similar-to 𝑋 Unif 𝑎 𝑏 \log(X)\sim\text{Unif}(a,b) roman_log ( italic_X ) ∼ Unif ( italic_a , italic_b ) , where a 𝑎 a italic_a and b 𝑏 b italic_b are real-valued. One may be tempted to interpret the resulting credible interval as if a non-informative prior were used on the original scale, i.e., on X 𝑋 X italic_X . A uniform prior on log ⁡ ( X ) 𝑋 \log(X) roman_log ( italic_X ) , however, can be very informative indeed on X 𝑋 X italic_X , since it corresponds to a power-law prior distribution on X 𝑋 X italic_X ( d ⁢ log ⁡ ( x ) = d ⁢ x / x 𝑑 𝑥 𝑑 𝑥 𝑥 d\log(x)=dx/x italic_d roman_log ( italic_x ) = italic_d italic_x / italic_x ) that puts substantial probability mass near the lower bound, e a superscript 𝑒 𝑎 e^{a} italic_e start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT . Thus, the posterior distribution depends strongly on whether a uniform prior is assumed on the original or logarithmic scale, and the resulting credible interval needs to be interpreted accordingly. In general, it is a mathematical fact that any prior distribution carries information to be interpreted, as it must specify how likely one state is relative to another; see Section 7 of Craiu et al. ( 2023 ) for more discussion.

In some sense, uniform prior distributions and other so-called “non-informative” prior distributions have lessened the burden of subjectivity and prior interpretation for astronomers, making the likelihood (i.e., the data) a dominant source of the posterior variability. In some cases, they also enable a Bayesian inference to be conducted relatively easily for researchers who prefer a Bayesian approach, e.g., to handle nuisance parameters or for uncertainty quantification (Gelman et al., 2017 ) , even when the maximum likelihood estimate is nearly identical to the maximum a posteriori estimate. Moreover, uniform prior distributions provide researchers a way to incorporate scientific knowledge via their boundaries. In most articles, the bounds of uniform priors are clarified to avoid potential posterior impropriety (Tak et al., 2018a ) .

Even bounded uniform prior distributions, however, must be used with care because the bounds are hard bounds that completely exclude a portion of the parameter space. An issue may occur if the bounds partially or completely exclude important regions of the parameter space a priori. For example, the top panel of Figure  5 magnifies the posterior distribution of the time delay under the microlensing model, previously shown in the second panel of Figure  1 . The prior distribution for the time delay adopted in Tak et al. ( 2017b ) is the uniform prior between − 1178.939 1178.939 -1178.939 - 1178.939 and 1178.939 1178.939 1178.939 1178.939 days, reflecting the widest range of observation times in the analyzed light curves. As an illustration, let us set up a different lower bound of this uniform prior at 430 days, excluding the modal location at around 425 days. The bottom panel of Figure  5 exhibits the resulting posterior distribution of the time delay. The posterior mass accumulates near the lower bound, as if the posterior mass in the top panel were pushed from the left to the lower bound. This is what happens when the hard bound of a uniform prior zeros out the likelihood beyond the bound. The likelihood cannot overcome this hard bound, regardless of how large the data set is; even one trillion observations would not allow posterior probability beyond the bound. ( Lindley ( 1985 ) warns against assigning a probability of zero to events that are not logically impossible in what is often referred to as Cromwell’s rule 4 4 4 The reference to Oliver Cromwell refers to a quotation of his: “I beseech you [to] think it possible that you may be mistaken”. .)

Refer to caption

Substantial posterior probability that is accumulated near the (hard) bounds of a uniform prior distribution may be evidence of mis-specification of the bounded prior distribution. In the astronomical literature, it is not difficult to find examples with substantial posterior mass near the bounds of uniform priors that are set by researchers. This problem can often be identified by inspecting corner plots (pairwise scatter plots with marginal histograms) in published articles, at least when these plots are provided by the authors. A simulated example similar to one we found in the literature survey 5 5 5 For instance, a search for articles that include the word ‘Bayesian’, published in MNRAS in June 2024 yields 58 articles, of which 17 displayed corner plots; 7 of these plots showed boundary issues. is shown in Figure  6 . When a researchers identifies a boundary issue of this sort, it is critical that they carefully investigate the sensitivity of their results to the bounds of their uniform prior, paying particular attention to the robustness of their scientific conclusions to the choice of bounds. Where there is a natural bound, e.g., where a parameter such as a mass or age must be non-negative or positive, we do not consider the accumulation of posterior mass near this natural bound to be an issue. Therefore, unless there is a strong scientific justification, it is always better to set uniform bounds wide enough not to influence the likelihood.

Refer to caption

Besides the boundary issues of uniform priors, a blind use of jointly uniform prior distributions can become a highly informative choice, despite its seemingly non-informative nature (Gelman, 1996 , p223) . For example, when model parameters are constrained such as being in an increasing order (in astronomy, for example, unknown breaking points in multiply broken power laws), a jointly uniform prior on the parameters asymptotically dominates the likelihood function as the number of such model parameters increases. Gelman et al. ( 2017 ) provide more examples where uniform prior distributions can result in inferences that do not make sense. A jointly improper uniform prior can also be problematic in high dimensions, even though it results in a proper posterior distribution; Section 4.2 of Gelman et al. ( 2017 ) discusses a similar problem that arises when independent Gaussian prior distributions are used in high-dimensional parameter spaces.

5 All models are subject to interpretation, but some are less contrived.

Understanding how the statistical/mathematical interpretation of an empirical data analysis should impact our understanding of astrophysical processes can be challenging. We suggest that it is often best to start with the physics and then consider whether the empirical findings make sense in terms of the physics and/or how we can make sense of them.

Kelly et al. ( 2009 ) , for instance, carefully investigated a sample of quasar light curves, relating two model parameters of a damped random walk process to physical properties of a quasar. They show that the timescale (short-term variability) of the fitted process is positively (negatively) correlated to both black hole mass and luminosity. This empirical evidence on the relationships between the parameters of the mathematical model and astrophysical properties has since been intensively investigated and is supported by many astronomers (MacLeod et al., 2010 ; Kozłowski et al., 2010 ; Kim et al., 2012 ; Andrae et al., 2013 ) .

For more elaborate model interpretation, the community has also investigated when this interpretation does not hold. Mushotzky et al. ( 2011 ) and Zu et al. ( 2013 ) show that the damped random-walk process is not suitable for explaining the stochastic variability of quasars when the source variability is on a very short timescale. Also, Graham et al. ( 2014 ) , and Kasliwal et al. ( 2015 ) warn that the process is too simple to explain all types of stochastic variability of quasars. When the underlying true model is not the damped random walk process, Kozłowski ( 2016 ) points out that the association between the model parameters and physical properties can be misleading as timescale estimates become biased.

This productive discussion has motivated astronomers to consider the more general and flexible class of continuous-time auto-regressive moving average processes (Kelly et al., 2014 ) . Substantial community effort has also been devoted to investigating the applicability and physical interpretation of the resulting power spectral densities via empirical evidence (Moreno et al., 2019 ; Yu et al., 2022 ) . This class of models is promising as it can be extended to model long-memory auto-correlations (Marquardt, 2006 ; Marquardt & James, 2007 ) , although its limited applicability to stationary time series data remains.

This collective community effort is the key to building time-tested models with widely accepted astrophysical interpretations as it is crucial to demonstrate the empirical evidence and effectively warn of cases where the interpretation of the model parameters is fallible.

6 All statistical tests have thresholds, but some are mis-set.

Hypothesis testing compares two models for the same data; the two models are called the null and the alternative hypotheses. In the standard frequentist setup, the researcher specifies a test statistic with known (or approximately known) distribution under the null hypothesis. Inconsistency between the observed value of test statistic (computed using the research data) and this null distribution is viewed as evidence that the data is unlikely to have arisen under the null model and thus as evidence in favor of the alternative hypothesis. Inconsistency is typically measured with a p 𝑝 p italic_p -value, the probability that a value of test statistic as extreme or more extreme than the observed value would arise under the null distribution. The principled usage of this paradigm is crucial in various scientific fields because it is the procedure that provides data-driven evidence for a scientific discovery or anomaly against well-established theories. Unlike in biomedical research, hypothesis testing in astronomy is less likely to suffer from common issues regarding p 𝑝 p italic_p -values, such as a blind usage of “ p 𝑝 p italic_p -value < < < 0.05” or p 𝑝 p italic_p -hacking, for example, collecting more data until the p 𝑝 p italic_p -value becomes smaller than 0.05 (Wasserstein & Lazar, 2016 ) . This is partly because astronomers typically use more conservative thresholds for establishing statistical significance, such as a 3 σ 𝜎 \sigma italic_σ level (Vallisneri et al., 2023 ) , making significance more difficult to achieve by simple data manipulation. This 3 σ 𝜎 \sigma italic_σ threshold corresponds to type I error rate of α 𝛼 \alpha italic_α equal to 0.0027 0.0027 0.0027 0.0027 (that is, the probability of rejecting the null when it is correct) for a two-sided test when a test statistic is distributed as N ⁢ ( 0 , 1 ) 𝑁 0 1 N(0,1) italic_N ( 0 , 1 ) under H 0 subscript 𝐻 0 H_{0} italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT .

One aspect that astronomers must keep in mind when interpreting significance levels of multiple test statistics, however, is how to control the family-wise error rate (FWER). The FWER is defined as the probability of committing at least one type I error (false-positive) among m 𝑚 m italic_m hypothesis tests, and is smaller than or equal to 1 − ( 1 − α ind ) m 1 superscript 1 subscript 𝛼 ind 𝑚 1-(1-\alpha_{\text{ind}})^{m} 1 - ( 1 - italic_α start_POSTSUBSCRIPT ind end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT . The notation α ind subscript 𝛼 ind \alpha_{\text{ind}} italic_α start_POSTSUBSCRIPT ind end_POSTSUBSCRIPT denotes the common type I error rate used for each of the m 𝑚 m italic_m individual hypothesis tests. A good example to illustrate the FWER can be found in Abbott et al. ( 2016 ) , where the detection of the first gravitational wave is based on the 4.6 σ 𝜎 \sigma italic_σ and 5.1 σ 𝜎 \sigma italic_σ significance levels of two test statistics, respectively. Clearly, each of the reported significance levels is greater than the 3 σ 𝜎 \sigma italic_σ threshold. However, naively comparing each reported significance level with the 3 σ 𝜎 \sigma italic_σ threshold is equivalent to maintaining an FWER that is less than or equal to 0.0054 ( = 1 − ( 1 − 0.0027 ) 2 absent 1 superscript 1 0.0027 2 =1-(1-0.0027)^{2} = 1 - ( 1 - 0.0027 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). That is, the probability of committing at least one type I error in the two tests for the gravitational wave detection is actually twice as large as the individual type I error rate. To ensure the FWER is less than or equal to 0.0027, as intended, the popular Bonferroni correction (Armstrong, 2014 ) sets the individual type I error rate to be 0.00135 ( = 0.0027 / 2 absent 0.0027 2 =0.0027/2 = 0.0027 / 2 ), which requires comparing each of the reported significance levels with a 3.2 σ 𝜎 \sigma italic_σ threshold, not 3 ⁢ σ 3 𝜎 3\sigma 3 italic_σ .

One possible issue with the Bonferroni method is that it is rarely possible to reject a null hypothesis when the number of hypothesis tests m 𝑚 m italic_m is large. For example, to ensure that the FWER is less than or equal to 0.0027 among 1,000 hypothesis tests, the individual type I error rate must be 2.7 × 10 − 6 2.7 superscript 10 6 2.7\times 10^{-6} 2.7 × 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT , which is a threshold that may be difficult for individual p 𝑝 p italic_p -value to achieve. This can lead to almost no rejection among the 1,000 hypothesis tests. As such, the Bonferroni correction can be too conservative for certain testing scenarios in astronomy, for example, for compiling astronomical catalogs.

An alternative is to instead control the false discovery rate (FDR, Benjamini & Hochberg, 1995 ; Benjamini, 2010 ) . Unlike the FWER, the FDR ensures that the expected proportion of false-positives among all of the false- and true-positives is less than or equal to a preset value. That is, controlling the FDR to be less than or equal to 0.0027 means that the proportion of false-positives (false discoveries) among all of the rejected null hypotheses (discoveries) is less than or equal to 0.27%.

In Abbott et al. ( 2016 ) for the detection of the first gravitational wave, the two p 𝑝 p italic_p -values corresponding to the reported 4.6 σ 𝜎 \sigma italic_σ and 5.1 σ 𝜎 \sigma italic_σ significance levels are 4.2 × 10 − 6 4.2 superscript 10 6 4.2\times 10^{-6} 4.2 × 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT and 3.4 × 10 − 7 3.4 superscript 10 7 3.4\times 10^{-7} 3.4 × 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT , respectively. (We assume that test statistics are distributed as N ⁢ ( 0 , 1 ) 𝑁 0 1 N(0,1) italic_N ( 0 , 1 ) under the null for two-sided tests.) Therefore, the largest index satisfying p i < ( 0.0027 × i ) / 2 subscript 𝑝 𝑖 0.0027 𝑖 2 p_{i}<(0.0027\times i)/2 italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT < ( 0.0027 × italic_i ) / 2 is 2, leading to the rejection of the null hypothesis in both tests. Even though both FDR and FWER end up with the same rejection results in this example, it is worth noting that the former is ensuring that the FDR is less than or equal to 0.0027 while the latter is controlling the FWER.

7 All model checks consider variations of the data, but some variants are more relevant than others.

The notion of replicate data or repeated experiments is fundamental to frequentist statistical methods. In hypothesis testing, for example, evidence is quantified by imagining the distribution of the test statistic that would result if multiple replicate data sets were generated under the null. In Bayesian data analysis, on the other hand, the posterior predictive distribution of additional data given the observed data is used to generate replicate data sets. In both cases, the replicate data represent the statistical variability and possible range of a test/summary statistic, and are used to quantify the expected deviation between the observed data and the null/posited model, thus enabling researchers to quantify uncertainty.

At first blush generating replicates may seem to be a well-stipulated proposition. In practice, however, researchers must consider how the replicates should be generated to be most comparable with their real data. For example, a researcher may only wish to consider replicate data with the same experimental conditions, instrumental effects, exposure time, and sample size as their real data. These are known quantities; varying them among the replicate data sets can make our uncertainty quantification less relevant for the actual uncertainty we care about. Conditioning on these factors reduces the variability of the replicate data sets and makes them more comparable with the real data. This in turn reduces uncertainty, error bars on fitted parameters, and the lengths of confidence intervals; similarly it increases the statistical power to distinguish between the null and the alternative in a hypothesis test.

Such considerations have led to the broad emphasis in statistics on conditioning as much as feasible; see Section 5.2 of Craiu et al. ( 2023 ) for a succinct overview. It has also led statistical theorists to consider if there is flexibility to condition on further attributes of the data in order to further increase statistical power. In a goodness-of-fit test, for example, the aim is to quantify the deviation between the observed data and the fitted model and to assess if it is greater than would be expect under the null. It seems entirely appropriate in this setting to only consider replicate data that have the same fitted model as the real data, e.g., by conditioning on the fitted model parameters 6 6 6 Monte Carlo simulations in astronomy, for instance, obtain uncertainties of unknown parameters by generating replicated data sets given the maximum likelihood estimate computed on the observed data, fitting the model on each replicate set again, and quantifying the variations of the estimated parameters (e.g., Tewes et al., 2013 ) . . Such a procedure is expected to reduce variability among the replicates (as they all have the same fitted parameters), make them more comparable with the real data, and increase statistical power. In this case, p 𝑝 p italic_p -values are typically obtained via the parametric bootstrap  (Efron, 1985 ) , where the estimated parameters are used as the “ground truth parameter” when generating replicate data sets.

Roe & Woodroofe ( 1999 ) considers the specific example of background contaminated Poisson counts. Letting the observed count Y obs superscript 𝑌 obs Y^{\rm obs} italic_Y start_POSTSUPERSCRIPT roman_obs end_POSTSUPERSCRIPT equal the sum of the unobserved source, Y S subscript 𝑌 𝑆 Y_{S} italic_Y start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , and background Y B subscript 𝑌 𝐵 Y_{B} italic_Y start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT counts, Roe & Woodroofe ( 1999 ) make that astute observation that while Y B subscript 𝑌 𝐵 Y_{B} italic_Y start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT is unknown, it is bounded by Y obs superscript 𝑌 obs Y^{\rm obs} italic_Y start_POSTSUPERSCRIPT roman_obs end_POSTSUPERSCRIPT , i.e., we know Y B ≤ Y obs subscript 𝑌 𝐵 superscript 𝑌 obs Y_{B}\leq Y^{\rm obs} italic_Y start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ≤ italic_Y start_POSTSUPERSCRIPT roman_obs end_POSTSUPERSCRIPT . By considering only replicates data with Y B rep ≤ Y obs superscript subscript 𝑌 𝐵 rep superscript 𝑌 obs Y_{B}^{\rm rep}\leq Y^{\rm obs} italic_Y start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_rep end_POSTSUPERSCRIPT ≤ italic_Y start_POSTSUPERSCRIPT roman_obs end_POSTSUPERSCRIPT , Roe & Woodroofe ( 1999 ) devise more coherent confidence intervals for the source intensity.

To give a concrete example of the advantage of conditional goodness-of-fit tests, we consider a Poisson model, where

(1)

We use the C-statistic (Cash, 1979 ; Kaastra, 2017 ) , defined as minus twice the logarithm of the ratio of the likelihood under the null and that under the alternative, with both likelihoods evaluated at their respective maximum likelihood estimates. Specifically, the maximum likelihood estimate under the null is λ ^ = ∑ i = 1 10 N i / 10 ^ 𝜆 superscript subscript 𝑖 1 10 subscript 𝑁 𝑖 10 \hat{\lambda}=\sum_{i=1}^{10}N_{i}/10 over^ start_ARG italic_λ end_ARG = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 10 end_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / 10 and under the alternative is s ^ i = N i subscript ^ 𝑠 𝑖 subscript 𝑁 𝑖 \hat{s}_{i}=N_{i} over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT . We consider two null distributions for the C-statistic. The unconditional null resamples data according to the Poisson ( λ ^ ) ^ 𝜆 (\hat{\lambda}) ( over^ start_ARG italic_λ end_ARG ) distribution. The conditional null , on the other hand, conditions on the maximum likelihood estimate of λ 𝜆 \lambda italic_λ which is equivalent to conditioning on the the total count, ∑ i = 1 n N i superscript subscript 𝑖 1 𝑛 subscript 𝑁 𝑖 \sum_{i=1}^{n}N_{i} ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , resulting in resampling data from a multinomial distribution.

The results are shown in Figure  7 . The upper panel illustrates that the power obtained from the conditional test (denoted by the dashed curve) is uniformly greater than that of the unconditional test (represented by the solid curve). To emphasize this improvement, the percentage increase in power is displayed in the bottom panel. Although the percentage improvement decreases as the significance level increases, it remains at least 10% when the significance level is below 0.1. In particular, when the significance level is set to 0.0027 (corresponding to the typical 3 σ 𝜎 \sigma italic_σ threshold in astronomy), denoted by the dot-dashed vertical line, the percentage improvement from using the conditional null distribution exceeds 30%. Chen et al. ( 2024 ) provide a rigorous study of conditional and unconditional goodness-of-fit tests based on the C-statistic under a general setup designed for realistic high-energy spectra models.

Refer to caption

8 Concluding Remarks

Astronomical data are now being produced at an unprecedented rate and with increasing complexity, and even more large-scale telescopes are expected to come into operation soon. Even though the quantity and complexity of modern astronomical data naturally demand sophisticated statistical tools for various purposes, no single all-purpose statistical tool exists that can be deployed without careful consideration of its limitations and underlying assumptions. Rather state-of-the-art statistical methods require care, both in selecting an appropriate method and applying it properly. In some cases, existing methods do not suffice and new techniques must be developed. All together, this means that astronomers must be cognizant of the limitations and assumptions of the statistical and machine learning tools they employ, and must be cautious when using them.

We have proposed six statistical maxims to promote statistically sound data analytic practices and to improve the quality of scientific findings in astronomy. We hope that researchers are able to easily check these maxims as part of their daily data analytic routines. These maxims, however, are certainly not sufficient to solve all possible problem that might arise from the myriad of data types used in astronomical data analyses. Rather, we hope that this work gives a bit of momentum to the astronomy community to continue further discussion about sound astronomical data analyses.

  • Abbott et al. (2016) Abbott, B. P., Abbott, R., Abbott, T. D., et al. 2016, Phys. Rev. Lett., 116, 061102, doi:  10.1103/PhysRevLett.116.061102
  • Aggarwal et al. (2018) Aggarwal, A., Schanche, N., Reeves, K. K., Kempton, D., & Angryk, R. 2018, The Astrophysical Journal Supplement Series, 236, 15, doi:  10.3847/1538-4365/aab77f
  • Akritas & Bershady (1996) Akritas, M. G., & Bershady, M. A. 1996, The Astrophysical Journal, 470, 706, doi:  10.1086/177901
  • Andrae et al. (2013) Andrae, R., Kim, D.-W., & Bailer-Jones, C. A. L. 2013, A&A, 554, A137, doi:  10.1051/0004-6361/201321335
  • Andreon & Hurn (2013) Andreon, S., & Hurn, M. 2013, Statistical Analysis and Data Mining: The ASA Data Science Journal, 6, 15, doi:  10.1002/sam.11173
  • Armstrong (2014) Armstrong, R. A. 2014, Ophthalmic and Physiological Optics, 34, 502, doi:  10.1111/opo.12131
  • Autenrieth et al. (2024) Autenrieth, M., van Dyk, D. A., Trotta, R., & Stenning, D. C. 2024, Statistical Analysis and Data Mining: The ASA Data Science Journal, 17, e11643, doi:  https://doi.org/10.1002/sam.11643
  • Avni (1976) Avni, Y. 1976, The Astrophysical Jorunal, 210, 642
  • Beaton et al. (2016) Beaton, R. L., Freedman, W. L., Madore, B. F., et al. 2016, ApJ, 832, 210, doi:  10.3847/0004-637X/832/2/210
  • Benjamini (2010) Benjamini, Y. 2010, Journal of the Royal Statistical Society Series B: Statistical Methodology, 72, 405, doi:  10.1111/j.1467-9868.2010.00746.x
  • Benjamini & Hochberg (1995) Benjamini, Y., & Hochberg, Y. 1995, Journal of the Royal Statistical Society. Series B (Methodological), 57, 289. http://www.jstor.org/stable/2346101
  • Boller et al. (2016) Boller, T., Freyberg, M. J., Trümper, J., et al. 2016, A&A, 588, A103, doi:  10.1051/0004-6361/201525648
  • Bovy et al. (2011) Bovy, J., Hennawi, J. F., Hogg, D. W., et al. 2011, ApJ, 729, 141, doi:  10.1088/0004-637X/729/2/141
  • Box & Draper (1987) Box, G. E. P., & Draper, N. R. 1987, Empirical Model-Building and Response Surfaces, 1st edn. (John Wiley & Sons)
  • Brown (1994) Brown, P. J. 1994, Measurement, Regression, and Calibration (Oxford University Press)
  • Brown (2018) Brown, P. J. 2018, Journal of the Royal Statistical Society: Series B (Methodological), 44, 287, doi:  10.1111/j.2517-6161.1982.tb01209.x
  • Budavári & Szalay (2008) Budavári, T., & Szalay, A. S. 2008, ApJ, 679, 301, doi:  10.1086/587156
  • Bulbul et al. (2014) Bulbul, E., Markevitch, M., Foster, A., et al. 2014, ApJ, 789, 13, doi:  10.1088/0004-637X/789/1/13
  • Cash (1979) Cash, W. 1979, The Astrophysical Journal, 228, 939
  • Chen et al. (2024) Chen, Y., Li, X., Meng, X., et al. 2024, in preparation
  • Chen et al. (2019a) Chen, Y., Meng, X.-L., Wang, X., et al. 2019a, Journal of the American Statistical Association, 114, 1018, doi:  10.1080/01621459.2018.1528978
  • Chen et al. (2019b) —. 2019b, Journal of the American Statistical Association, 114, 1018, doi:  10.1080/01621459.2018.1528978
  • Craiu et al. (2023) Craiu, R. V., Gong, R., & Meng, X.-L. 2023, Annual Review of Statistics and Its Application, 10, 699, doi:  10.1146/annurev-statistics-040220-015348
  • de Vaucouleurs & Bollinger (1979) de Vaucouleurs, G., & Bollinger, G. 1979, ApJ, 233, 433, doi:  10.1086/157405
  • de Vaucouleurs & Peters (1986) de Vaucouleurs, G., & Peters, W. L. 1986, ApJ, 303, 19, doi:  10.1086/164048
  • Eadie et al. (2023) Eadie, G. M., Speagle, J. S., Cisewski-Kehe, J., et al. 2023, arXiv e-prints, arXiv:2302.04703, doi:  10.48550/arXiv.2302.04703
  • Efron (1985) Efron, B. 1985, Biometrika, 72, 45
  • Efstathiou (2020) Efstathiou, G. 2020, arXiv e-prints, arXiv:2007.10716. https://arxiv.org/abs/2007.10716
  • Evans et al. (2010) Evans, I. N., Primini, F. A., Glotfelty, K. J., et al. 2010, ApJS, 189, 37, doi:  10.1088/0067-0049/189/1/37
  • Feeney et al. (2018) Feeney, S. M., Mortlock, D. J., & Dalmasso, N. 2018, MNRAS, 476, 3861, doi:  10.1093/mnras/sty418
  • Feigelson & Babu (1998) Feigelson, E., & Babu, G. 1998, in Symposium-International Astronomical Union, Vol. 179, Cambridge University Press, 363–370, doi:  10.1017/S0074180900129043
  • Feigelson & Babu (2012) Feigelson, E. D., & Babu, G. J. 2012, arXiv preprint. https://arxiv.org/abs/1205.2064
  • Feigelson & Babu (2012) Feigelson, E. D., & Babu, G. J. 2012, Modern Statistical Methods for Astronomy: With R Applications (Cambridge University Press), doi:  10.1017/CBO9781139015653
  • Feigelson et al. (2021) Feigelson, E. D., De Souza, R. S., Ishida, E. E., & Babu, G. J. 2021, Annual Review of Statistics and Its Application, 8, 493, doi:  10.1146/annurev-statistics-042720-112045
  • Feigelson & Nelson (1985) Feigelson, E. D., & Nelson, P. I. 1985, ApJ, 293, 192, doi:  10.1086/163225
  • Fuller (1987) Fuller, W. A. 1987, Measurement Error Models (New York, NY: John Wiley & Sons, Inc.), doi:  10.1002/9780470316665
  • Gelman (1996) Gelman, A. 1996, Statistica Sinica, 6, 215. http://www.jstor.org/stable/24306008
  • Gelman et al. (2013) Gelman, A., Carlin, J. B., Stern, H. S., et al. 2013, Bayesian Data Analysis (Boca Raton, FL, USA: CRC Press)
  • Gelman et al. (2017) Gelman, A., Simpson, D., & Betancourt, M. 2017, Entropy, 19, doi:  10.3390/e19100555
  • Graham et al. (2014) Graham, M. J., Djorgovski, S. G., Drake, A. J., et al. 2014, Monthly Notices of the Royal Astronomical Society, 439, 703, doi:  10.1093/mnras/stt2499
  • Greenwood & Nikulin (1996) Greenwood, P. E., & Nikulin, M. S. 1996, A Guide to Chi-Squared Testing (New York, NY: John Wiley & Sons, Inc.)
  • Guainazzi et al. (2015) Guainazzi, M., David, L., Grant, C. E., et al. 2015, Journal of Astronomical Telescopes, Instruments, and Systems, 1, 047001, doi:  10.1117/1.JATIS.1.4.047001
  • Hainline et al. (2012) Hainline, L. J., Morgan, C. W., Kochanek, C. S., et al. 2012, in American Astronomical Society Meeting Abstracts, Vol. 219, American Astronomical Society Meeting Abstracts #219, 108.02
  • Hilbe (2014) Hilbe, J. M. 2014, Modeling Count Data (Cambridge University Press), doi:  10.1017/CBO9781139236065
  • Hu & Tak (2020) Hu, Z., & Tak, H. 2020, The Astronomical Journal, 160, 265, doi:  10.3847/1538-3881/abc1e2
  • Hubble (1929) Hubble, E. 1929, Contributions from the Mount Wilson Observatory, 3, 23
  • Isobe et al. (1986) Isobe, T., Feigelson, E. D., & Nelson, P. I. 1986, The Astrophysical Journal, 306, 490, doi:  10.1086/164359
  • Izbicki et al. (2017) Izbicki, R., Lee, A. B., & Freeman, P. E. 2017, The Annals of Applied Statistics, 11, 698 , doi:  10.1214/16-AOAS1013
  • Jurić et al. (2023) Jurić, M., Axelrod, T., Becker, A., et al. 2023, Data Products Definition Document, Vera C. Rubin Observatory. https://lse-163.lsst.io/
  • Kaastra (2017) Kaastra, J. 2017, Astronomy & Astrophysics, 605, A51
  • Kasliwal et al. (2015) Kasliwal, V. P., Vogeley, M. S., & Richards, G. T. 2015, Monthly Notices of the Royal Astronomical Society, 451, 4328, doi:  10.1093/mnras/stv1230
  • Kass & Raftery (1995) Kass, R. E., & Raftery, A. E. 1995, Journal of the American Statistical Association, 90, 773, doi:  10.1080/01621459.1995.10476572
  • Kelly (2007) Kelly, B. C. 2007, ApJ, 665, 1489, doi:  10.1086/519947
  • Kelly et al. (2009) Kelly, B. C., Bechtold, J., & Siemiginowska, A. 2009, The Astrophysical Journal, 698, 895, doi:  10.1088/0004-637x/698/1/895
  • Kelly et al. (2014) Kelly, B. C., Becker, A. C., Sobolewska, M., Siemiginowska, A., & Uttley, P. 2014, The Astrophysical Journal, 788, 33, doi:  10.1088/0004-637X/788/1/33
  • Kelly et al. (2012) Kelly, B. C., Shetty, R., Stutz, A. M., et al. 2012, ApJ, 752, 55, doi:  10.1088/0004-637X/752/1/55
  • Kim et al. (2012) Kim, D.-W., Protopapas, P., Trichas, M., et al. 2012, The Astrophysical Journal, 747, 107, doi:  10.1088/0004-637x/747/2/107
  • Kozłowski (2016) Kozłowski, S. 2016, Monthly Notices of the Royal Astronomical Society, 459, 2787, doi:  10.1093/mnras/stw819
  • Kozłowski et al. (2010) Kozłowski, S., Kochanek, C. S., Udalski, A., et al. 2010, The Astrophysical Journal, 708, 927
  • Kubokawa & Robert (1994) Kubokawa, T., & Robert, C. P. 1994, Journal of Multivariate Analysis, 51, 178
  • Landy & Szalay (1992) Landy, S. D., & Szalay, A. S. 1992, ApJ, 391, 494, doi:  10.1086/171365
  • Lee et al. (2011) Lee, H., Kashyap, V. L., van Dyk, D. A., et al. 2011, ApJ, 731, 126, doi:  10.1088/0004-637X/731/2/126
  • Lindley (1985) Lindley, D. V. 1985, Making Decisions (New York, NY: Wiley)
  • MacLeod et al. (2010) MacLeod, C., Ivezić, Ž., Kochanek, C., et al. 2010, The Astrophysical Journal, 721, 1014
  • Mandel et al. (2017) Mandel, K. S., Scolnic, D. M., Shariff, H., Foley, R. J., & Kirshner, R. P. 2017, ApJ, 842, 93, doi:  10.3847/1538-4357/aa6038
  • Marquardt (2006) Marquardt, T. 2006, Bernoulli, 12, 1099 , doi:  10.3150/bj/1165269152
  • Marquardt & James (2007) Marquardt, T., & James, L. F. 2007, Technical report
  • Marshall (2021) Marshall, H. L. 2021, AJ, 162, 134, doi:  10.3847/1538-3881/ac173d
  • Meyer et al. (2023) Meyer, A. D., van Dyk, D. A., Tak, H., & Siemiginowska, A. 2023, ApJ, 950, 37, doi:  10.3847/1538-4357/acbea1
  • Moreno et al. (2019) Moreno, J., Vogeley, M. S., Richards, G. T., & Yu, W. 2019, PASP, 131, 063001, doi:  10.1088/1538-3873/ab1597
  • Mushotzky et al. (2011) Mushotzky, R. F., Edelson, R., Baumgartner, W., & Gand hi, P. 2011, The Astrophysical Journal Letter, 743, L12, doi:  10.1088/2041-8205/743/1/L12
  • Oman & Srivastava (1996) Oman, S. D., & Srivastava, M. 1996, Scandinavian journal of statistics, 473
  • Osborne (1991) Osborne, C. 1991, International Statistical Review, 59, 309, doi:  10.2307/1403690
  • Partridge et al. (2016) Partridge, B., López-Caniego, M., Perley, R. A., et al. 2016, The Astrophysical Journal, 821, 61, doi:  10.3847/0004-637x/821/1/61
  • Payne et al. (2020) Payne, E., Talbot, C., Lasky, P. D., Thrane, E., & Kissel, J. S. 2020, Phys. Rev. D, 102, 122004, doi:  10.1103/PhysRevD.102.122004
  • Pierson (2013) Pierson, S. 2013, AMSTATNEWS. https://magazine.amstat.org/blog/2013/12/01/science-policy-intel/
  • Planck Collaboration et al. (2016) Planck Collaboration, Ade, P. A. R., Aghanim, N., et al. 2016, A&A, 594, A13, doi:  10.1051/0004-6361/201525830
  • Portillo et al. (2017) Portillo, S. K. N., Lee, B. C. G., Daylan, T., & Finkbeiner, D. P. 2017, AJ, 154, 132, doi:  10.3847/1538-3881/aa8565
  • Protassov et al. (2002) Protassov, R., van Dyk, D. A., Connors, A., Kashyap, V. L., & Siemiginowska, A. 2002, ApJ, 571, 545, doi:  10.1086/339856
  • Reeves et al. (2009) Reeves, J. N., O’Brien, P. T., Braito, V., et al. 2009, ApJ, 701, 493, doi:  10.1088/0004-637X/701/1/493
  • Revsbech et al. (2017) Revsbech, E. A., Trotta, R., & van Dyk, D. A. 2017, Monthly Notices of the Royal Astronomical Society, 473, 3969, doi:  10.1093/mnras/stx2570
  • Riess et al. (2021) Riess, A. G., Casertano, S., Yuan, W., et al. 2021, The Astrophysical Journal, 908, L6, doi:  10.3847/2041-8213/abdbaf
  • Riess et al. (2016) Riess, A. G., Macri, L. M., Hoffmann, S. L., et al. 2016, ApJ, 826, 56, doi:  10.3847/0004-637X/826/1/56
  • Rivers (2014) Rivers, D. L. 2014, Dynamic Bayesian Approaches to the Statistical Calibration Problem (Virginia Commonwealth University)
  • Roe & Woodroofe (1999) Roe, B. P., & Woodroofe, M. B. 1999, Phys. Rev. D, 60, 053009, doi:  10.1103/PhysRevD.60.053009
  • Rots et al. (2017a) Rots, A. H., Burke, D. J., Civano, F., Hain, R., & Nguyen, D. 2017a, in American Astronomical Society Meeting Abstracts, Vol. 229, American Astronomical Society Meeting Abstracts #229, 156.03
  • Rots et al. (2017b) Rots, A. H., Nguyen, D., Budavari, T., et al. 2017b, in AAS/High Energy Astrophysics Division, Vol. 16, AAS/High Energy Astrophysics Division #16, 113.01
  • Ryan et al. (2012) Ryan, D. F., Milligan, R. O., Gallagher, P. T., et al. 2012, The Astrophysical Journal Supplement Series, 202, 11, doi:  10.1088/0067-0049/202/2/11
  • Sandage & Tammann (1975) Sandage, A., & Tammann, G. A. 1975, ApJ, 197, 265, doi:  10.1086/153510
  • Schafer & Purdy (1996) Schafer, D. W., & Purdy, K. G. 1996, Biometrika, 83, 813
  • Schild (1990) Schild, R. E. 1990, AJ, 100, 1771, doi:  10.1086/115634
  • Sereno (2016) Sereno, M. 2016, Monthly Notices of the Royal Astronomical Society, 455, 2149, doi:  10.1093/mnras/stv2374
  • Shaffer (1995) Shaffer, J. P. 1995, Annual Review of Psychology, 46, 561, doi:  https://doi.org/10.1146/annurev.ps.46.020195.003021
  • Shalyapin et al. (2008) Shalyapin, V. N., Goicoechea, L. J., Koptelova, E., Ullán, A., & Gil-Merino, R. 2008, A&A, 492, 401, doi:  10.1051/0004-6361:200810447
  • Shy et al. (2022) Shy, S., Tak, H., Feigelson, E. D., Timlin, J. D., & Babu, G. J. 2022, AJ, 164, 6, doi:  10.3847/1538-3881/ac6e64
  • Siemiginowska et al. (2019) Siemiginowska, A., Eadie, G., Czekala, I., et al. 2019, BAAS, 51, 355. https://arxiv.org/abs/1903.06796
  • Tak et al. (2019) Tak, H., Ellis, J. A., & Ghosh, S. K. 2019, Journal of Computational and Graphical Statistics, 28, 415, doi:  10.1080/10618600.2018.1537925
  • Tak et al. (2018a) Tak, H., Ghosh, S. K., & Ellis, J. A. 2018a, Monthly Notices of the Royal Astronomical Society, 481, 277, doi:  10.1093/mnras/sty2326
  • Tak et al. (2017a) Tak, H., Kelly, J., & Morris, C. 2017a, Journal of Statistical Software, 78, 1, doi:  10.18637/jss.v078.i05
  • Tak et al. (2017b) Tak, H., Mandel, K., van Dyk, D. A., et al. 2017b, The Annals of Applied Statistics, 11, 1309, doi:  10.1214/17-AOAS1027
  • Tak et al. (2018b) Tak, H., Meng, X.-L., & van Dyk, D. A. 2018b, Journal of Computational and Graphical Statistics, 27, 479, doi:  10.1080/10618600.2017.1415911
  • Tanaka et al. (1995) Tanaka, Y., Nandra, K., Fabian, A. C., et al. 1995, Nature, 375, 659, doi:  10.1038/375659a0
  • Teerikorpi (2015) Teerikorpi, P. 2015, A&A, 576, A75, doi:  10.1051/0004-6361/201425489
  • Tewes et al. (2013) Tewes, M., Courbin, F., Meylan, G., et al. 2013, Astronomy & Astrophysics, 556, A22, doi:  10.1051/0004-6361/201220352
  • Vallisneri et al. (2023) Vallisneri, M., Meyers, P. M., Chatziioannou, K., & Chua, A. J. K. 2023, Phys. Rev. D, 108, 123007, doi:  10.1103/PhysRevD.108.123007
  • van Dyk et al. (2001) van Dyk, D. A., Connors, A., Kashyap, V. L., & Siemiginowska, A. 2001, ApJ, 548, 224, doi:  10.1086/318656
  • Verde et al. (2019) Verde, L., Treu, T., & Riess, A. G. 2019, Nature Astronomy, 3, 891, doi:  10.1038/s41550-019-0902-0
  • Villanueva et al. (2021) Villanueva, G. L., Cordiner, M., Irwin, P. G. J., et al. 2021, Nature Astronomy, 5, 631, doi:  10.1038/s41550-021-01422-z
  • Voges (1993) Voges, W. 1993, Advances in Space Research, 13, 391, doi:  10.1016/0273-1177(93)90147-4
  • Voges et al. (1999) Voges, W., Aschenbach, B., Boller, T., et al. 1999, A&A, 349, 389. https://arxiv.org/abs/astro-ph/9909315
  • Wang & Taylor (2022) Wang, Q., & Taylor, S. R. 2022, MNRAS, 516, 5874, doi:  10.1093/mnras/stac2679
  • Wasserstein & Lazar (2016) Wasserstein, R. L., & Lazar, N. A. 2016, The American Statistician, 70, 129, doi:  10.1080/00031305.2016.1154108
  • Wing (2019) Wing, J. M. 2019, Harvard Data Science Review, 1, doi:  10.1162/99608f92.e26845b4
  • Xu et al. (2014) Xu, J., van Dyk, D. A., Kashyap, V. L., et al. 2014, ApJ, 794, 97, doi:  10.1088/0004-637X/794/2/97
  • Yu et al. (2022) Yu, W., Richards, G. T., Vogeley, M. S., Moreno, J., & Graham, M. J. 2022, ApJ, 936, 132, doi:  10.3847/1538-4357/ac8351
  • Zu et al. (2013) Zu, Y., Kochanek, C. S., Kozłowski, s., & Udalski, A. 2013, The Astrophysical Journal, 765, 106, doi:  10.1088/0004-637X/765/2/106

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 31 August 2024

Regulatory transposable elements in the encyclopedia of DNA elements

  • Alan Y. Du   ORCID: orcid.org/0000-0003-3223-9705 1 , 2   na1 ,
  • Jason D. Chobirko   ORCID: orcid.org/0000-0001-8495-9152 3   na1 ,
  • Xiaoyu Zhuo   ORCID: orcid.org/0000-0002-1400-5609 1 , 2   na1 ,
  • Cédric Feschotte 3 &
  • Ting Wang   ORCID: orcid.org/0000-0002-6800-242X 1 , 2 , 4  

Nature Communications volume  15 , Article number:  7594 ( 2024 ) Cite this article

Metrics details

  • Epigenomics
  • Gene regulation
  • Genome evolution
  • Mobile elements

Transposable elements (TEs) comprise ~50% of our genome, but knowledge of how TEs affect genome evolution remains incomplete. Leveraging ENCODE4 data, we provide the most comprehensive study to date of TE contributions to the regulatory genome. We find 236,181 (~25%) human candidate cis-regulatory elements (cCREs) are TE-derived, with over 90% lineage-specific since the human-mouse split, accounting for 8–36% of lineage-specific cCREs. Except for SINEs, cCRE-associated transcription factor (TF) motifs in TEs are derived from ancestral TE sequence more than expected by chance. We show that TEs may adopt similar regulatory activities of elements near their integration site. Since human-mouse divergence, TEs have contributed 3–56% of TF binding site turnover events across 30 examined TFs. Finally, TE-derived cCREs are similar to non-TE cCREs in terms of MPRA activity and GWAS variant enrichment. Overall, our results substantiate the notion that TEs have played an important role in shaping the human regulatory genome.

Introduction

Barbara McClintock, who discovered TEs in maize 1 , was the first to recognize their ability to act as cis -regulatory elements (CREs) controlling the expression of nearby genes. In the ensuing decades, it became clear that a large fraction of the genome of multicellular organisms consists of interspersed repeats primarily derived from TEs. In mammals, TEs account for 28–75% of the genome sequence 2 . In humans, at least 46% of the ~3.1 GB haploid genome is derived from TEs 3 . Most TEs in the human genome can be classified into LINE, SINE, LTR and DNA transposon classes. LINEs are autonomous retrotransposons that use target primed reverse transcription to insert into the genome. SINEs are short, non-autonomous elements that rely on the LINE machinery to mobilize. LTR elements in the human genome are mostly derived from endogenous retroviruses (ERVs) which expand using the retroviral replication mechanism. Unlike the other three classes which use RNA intermediates to transpose, DNA transposons mobilize directly via a “cut-and-paste” DNA mechanism 4 . The vast majority of human TEs have long ceased transposition activity, and only a small subset of LINEs and SINEs are known to be currently capable of mobilization in modern humans.

Although McClintock viewed TEs as essential “controlling elements”, it is clear that the vast majority of TE sequences in the human genome have not evolved under functional constraint and therefore do not appear to contribute significantly to organismal fitness 5 . Still, about 11% of evolutionarily constrained bases in human fall under TEs 5 , and there have been many reports showing that some TEs function as CREs, including promoters and enhancers, regulating important biological processes (reviewed in refs. 6 , 7 , 8 , 9 , 10 , 11 ). However, we still lack a global picture of how many CREs are derived from TEs and how many are truly functional.

Another fundamental question concerns the evolution of TEs from selfish elements to CREs co-opted for gene regulation. One model postulates that TEs ancestrally harbor CREs and transcription factor binding sites (TFBSs) in order to regulate their own genes, which are then occasionally co-opted for regulating host gene expression. Many examples of previously characterized TE-derived CREs are consistent with this ancestral origin model 12 , 13 , 14 , 15 . An alternative model is that TEs acquire TFBSs and cis -regulatory activity post-insertion through mutation over time. This has been observed for P53, PAX-6, and MYC TFBSs in human Alu SINE elements and circadian clock TFBSs in mouse RSINE1 elements, in which imperfect binding motifs matured into canonical binding motifs over evolutionary time 16 , 17 , 18 .

The ENCODE and Roadmap projects have sought to characterize the landscape of CREs in the human genome, providing invaluable resources for scientists all over the world 19 , 20 . Data from these projects have facilitated systematic investigation of TE contributions to regulatory functions in the genome 21 , 22 . In ENCODE phase 3, genome-wide annotations of candidate cis -regulatory elements (cCREs) were created in both human and mouse genomes 19 . Based on four epigenomic assays and gene annotation, cCREs were classified into promoter-like sequence (PLS), proximal enhancer-like sequence (pELS), distal enhancer-like sequence (dELS), high-H3K4me3 elements (DNase-H3K4me3), and potential boundary elements (CTCF-only). Regions with enhancer signal were separated into pELS and dELS based on their distance to annotated transcription state sites (TSSs). DNase-H3K4me3 cCREs represent regions with promoter signal without a nearby annotated TSS. CTCF-only cCREs represent regions that could be genome folding anchors or other architectural elements. Altogether, cCREs comprise 7.9% and 3.4% of human and mouse genomes, respectively. In the latest ENCODE phase 4, functional assays such as massively parallel reporter assay (MPRA) have been included to validate regulatory element predictions. Here we leverage these resources to quantify the contribution of TEs to the regulatory genome and derive general principles for how TEs become regulatory elements.

Landscape of TE-derived cCREs in human

To broadly characterize the contribution of TEs to the human regulatory genome, we intersected TEs with cCREs from the version 2 registry of cCREs 19 . As a conservative estimate, we considered cCREs with at least 50% of their sequences coming from a single annotated TE to be TE-derived. Using this criterion, we found that TEs supply ~25% (236,181/926,535) of all human cCREs (Fig.  1A ). When cCREs are separated by annotation type, TE contribution ranges from 4.6% in PLS to 38.2% in CTCF-only cCREs. Compared to their genomic proportion, TEs are generally underrepresented in all types of cCREs (Supplementary Fig.  1 ). Notably, TEs are most depleted in PLS, possibly due to a combination of purifying selection against TE insertion in promoters and incomplete annotation of TE promoters. By contrast, DNase-H3K4me3 and CTCF-only cCREs were enriched for the LTR class of elements (log 2 enrichments of 0.42 and 0.46, respectively). These results suggest that LTR elements have been a prominent source of non-canonical promoters and CTCF binding sites, an observation consistent with previous reports 23 , 24 , 25 , 26 .

figure 1

A Proportion of genome and cCREs that are TE-derived. B Number of elements per TE subfamily, grouped by TE class, that are associated with a cCRE. C Enrichment of TE subfamily overlapping with cCREs relative to their abundance in the genome, grouped by TE class. Shown in ( B ) and ( C ) are 326 DNA transposons subfamilies, 184 LINE subfamilies, 595 LTR subfamilies, and 61 SINE subfamilies annotated by RepeatMasker. D Proportion of cCREs that are TE-derived across 25 fully profiled cell/tissue types. E Percentage of cCREs that are TE-derived for cell/tissue specific cCREs to ubiquitously used cCREs. The x-axis is the number of fully profiled cell/tissue types in which the cCRE is found. cCRE candidate cis -regulatory element, dELS distal enhancer-like sequence, pELS proximal enhancer-like sequence, PLS promoter-like sequence, TE transposable element. For all boxplots in this paper, box, interquartile range (IQR); center, median; whiskers, 1.5 × IQR.

Within each TE class, TEs are further subdivided into many subfamilies of variable age and sequence composition. Thus, we next examined TE contributions to the human regulatory genome at the subfamily level. In terms of absolute numbers of cCRE-associated TEs, LINE and SINE classes contribute the most cCREs per subfamily on average (~5 times more than LTR and ~9 times more/contributed by a single TE, resultin/ than DNA elements) (Fig.  1B ). On the other hand, after normalizing to genomic abundance, the LTR class is the most enriched per subfamily on average for cCREs (~4 times more than LINE/SINE and ~2 times more than DNA elements) (Fig.  1C ). These results confirm that LTR elements are generally more likely to supply cCREs in the human genome, possibly because they contain strong promoter and enhancer sequences 10 . However, numerically the majority of TE-derived cCREs come from SINEs and LINEs due to their sheer abundance in the human genome.

Considering that regulatory elements are often active in a cell-type-specific manner, we evaluated the contribution of TEs to each of the 25 ENCODE cell/tissue types with full cCRE profiling. Overall, TEs make up between 9 and 19% of cCREs across cell/tissue types (Fig.  1D ). The proportion of TE classes contributing to cCREs stays relatively stable across cell/tissue types (Supplementary Fig.  1 ). Next, we examined whether TE-derived cCREs are more or less likely to be cell-type specific compared to non-TE cCREs. We grouped all cCREs by the number of cell types that share them. With more ubiquitously active cCREs across the 25 fully profiled cell/tissue types (i.e., less cell-type specific), the percentage of cCREs that are TE-derived decreases (Fig.  1E ), indicating that cCREs contributed by TEs are more likely to be cell-type specific. This observation is consistent with previous reports that find TEs to contribute cell-type specific regulatory elements 21 , 27 , 28 .

Evolution of TE-derived cCREs across human and mouse

Next, we investigated the contribution of TEs in the evolution of cCREs in the human and mouse lineages. Starting from 926,535 human cCREs, we identified syntenic mouse regions using UCSC liftOver 29 , yielding 601,136 syntenic regions corresponding to ~66% rate of synteny (Fig.  2A , Supplementary Fig.  2 ). This is significantly higher than the ~40% rate of synteny based on whole genome comparison ( p  = 1.5 × 10 −323 , binomial test), which is expected as cCREs should be enriched for functional regulatory elements and therefore more evolutionarily constrained 30 . To identify cCREs derived from TEs orthologous in human and mouse (acquired from their common ancestor), we required that the human cCRE be TE-associated and the corresponding mouse syntenic region contains the same annotated TE (“Methods”). As expected, orthologous TEs are primarily composed of old TE subfamilies that exist in both human and mouse (Supplementary Fig.  2 ). This approach identified 18,010 TE-derived human cCREs (1.9% of all human cCREs) with a mouse orthologous sequence. Overall, 97% (228,670/236,181) of human TE-derived cCREs are only found in the human lineage. We performed the reciprocal analysis starting from mouse cCREs and found similar results: 1.7% (5900/339,815) of mouse cCREs are TE-derived and have human orthology, and 93% (38,815/41,800) of mouse TE-derived cCREs are only found in the mouse lineage. Thus, TE-derived cCREs are overwhelmingly lineage-specific, and ancient TEs are a minor source of cCREs shared between human and mouse.

figure 2

A Classification of shared and lineage-specific cCREs for human to mouse comparison. For orthologous TE-cCREs, syntenic cCRE in mouse is not required but can be present. B Percentage of cCREs that are shared or lineage-specific for orthologous TE and syntenic non-TE human anchored cCRE regions. Shared cCREs are split into “same” and “different” categories depending on the syntenic human and mouse cCRE types. Grouping by cCRE type is done using the anchored human cCRE. Multinomial tests for goodness of fit (log-likelihood ratio, exact or Monte Carlo simulations with 1,000,000 random trials) were performed to compare TE and non-TE distributions (dELS Monte Carlo p -value = 0; pELS Monte Carlo p -value = 0; PLS exact p -value = 3.922 × 10 −26 ; DNase-H3K4me3 exact p -value = 3.975 × 10 −16 ; CTCF-only exact p -value = 7.233 × 10 −28 ; no multiple test correction). C 100-way vertebrate phastCons score distributions for orthologous TEs and non-orthologous TEs associated with human cCREs. One-sided Wilcoxon rank-sum test p -value < 2.2 × 10 −16 . D 100-way vertebrate phastCons score distributions for orthologous TEs that have cCRE in both human and mouse vs. human only. One-sided Wilcoxon rank-sum test p -value < 2.2 × 10 −16 . E Percentage of conserved and novel (lineage-specific) cCREs that are TE-derived, split up by cCRE type. Percentages for human and mouse are shown by red and blue dots, respectively. Bars represent the mean percentage between human and mouse. *** p  < 0.001.

To investigate how often orthologous TEs evolve shared function across lineages, we first searched for orthologous TEs with conserved cis -regulatory function in human and mouse. Of 98,278 human cCREs with the same syntenic mouse cCRE, 1575 (1.6%) are derived from orthologous TEs. This is similar to the percentage of human cCREs that are TE-derived and have a mouse TE ortholog (1.9%), indicating that conserved regulatory elements are not enriched for TEs. We next asked whether orthologous TEs and non-TE syntenic sequences have different annotations between human and mouse. We categorized each human-mouse pair of sequences into same cCRE type (same), shared cCREs but different type (different), or lineage-specific cCREs. Regardless of cCRE type, orthologous TEs contributing cCREs in human display a significantly different proportion of same, different, and lineage-specific cCRE annotations compared to non-TE sequences (Fig.  2B ). Contrary to the null expectation where the proportions are the same between TEs and non-TEs, orthologous TEs that contribute cCREs are more lineage-specific than the non-TE syntenic background, ranging from 7.9% difference for dELS to 41.2% difference for PLS in human (Exact multinomial tests, p  < 0.001). We performed the same analyses starting from mouse cCREs and found similar results, with differences ranging from 8.7% for DNase-H3K4me3 to 36% for PLS (Supplementary Fig.  2 , exact multinomial tests, p  < 0.001). This suggests that among cCREs with a shared sequence origin, TEs are more likely to diverge in cis -regulatory activity to provide lineage-specific function compared to non-TE sequences.

Sequence conservation is generally a good indicator for conserved function. Since we can be confident that orthologous TEs in human and mouse descend from a common ancestor, we tested whether sequence conservation as measured by phastCons score is correlated with their shared annotation as cCREs. Considering only TE subfamilies that are ancestral to human and mouse, we confirmed that orthologous TE sequences have higher phastCons scores than non-orthologous TEs (Wilcoxon test, p  < 2.2 × 10 −16 ) (Fig.  2C ). Furthermore, we found that orthologous TEs with shared cCRE annotation have higher phastCons scores compared to orthologous TEs with lineage-specific cCRE annotation (Wilcoxon test, p  < 2.2 × 10 −16 ) (Fig.  2D ). This result suggests that TE-derived cCREs shared by both species are under stronger functional constraint than those that are lineage-specific cCREs. Thus, this set of ~1500 orthologous TE-derived cCREs are strong candidates for being co-opted for important and conserved cellular function.

Given that most human TE-cCREs are not found in mouse and vice versa, we sought to quantify the contribution of TEs to lineage-specific cCREs relative to non-TE sequences. In human, 85% (788,108/926,535) of cCREs were identified as lineage-specific due to either lack of syntenic sequence in mouse or synteny with no mouse cCRE. Among human lineage cCREs, 29% (228,670/788,108) could be attributed to TEs. In mouse, 61.6% (209,338/339,815) of cCREs were identified as lineage-specific, of which 18.5% (38,815) were TE-associated. We found that TEs have contributed between 6 and 38% of lineage-specific cCREs depending on cCRE type, with the lowest contribution to PLS and the highest to CTCF binding sites (Fig.  2E ). Despite more cCRE data being available for human compared to mouse, we observed a similar trend in human and mouse in which TEs supplied 10–40% of human lineage cCREs and 6–33% of mouse lineage cCREs. Overall, these results support the long-standing hypothesis that TEs have had a substantial impact on cis -regulatory innovation during mammalian evolution 31 , 32 , 33 , 34 .

Origin of cCRE-associated transcription factor motifs in TEs

As TFBSs are a major component in driving cis -regulatory activity of a sequence, we looked for TF motifs that are associated with cCRE activity in TEs. For each TE subfamily defined by RepeatMasker, we looked for TF motifs that are enriched in cCRE-associated copies of the subfamily relative to non-cCRE copies of the same subfamily (Supplementary Fig.  3A , “Methods”). By using copies of the same TE subfamily as background sequences in this analysis, we minimize the influence of TF motifs that are merely enriched in the TE subfamily compared to the rest of the genome. In total, we could detect 1183 cCRE-associated TF motifs across 376 TE subfamilies (Supplementary Fig.  3B ). The TFs most frequently associated with cCREs include AP1, CTCF, ETS, KLF, and Ebox motifs (Supplementary Data  5 ).

We investigated whether cCRE-enriched TF motifs likely originated from the ancestral TE or arose through mutations after insertion. We first asked what percentage of cCRE-enriched motifs can be identified in the TE’s consensus sequence, which represents their ancestral TE sequence. Of 1183 motifs, 541 (46%) are found in consensus sequences (Supplementary Fig.  3C ). Notably, SINEs have the lowest percentage (mean of 12%) among TE classes. To increase resolution and specificity, we extended our analysis to consider motif location for individual TE copies. If a TF motif is truly derived from its ancestral TE insertion, we expect the motif to be in the same relative position as the consensus sequence’s motif. Thus, we inferred the ancestral origin of each TE’s motifs based on the presence or absence of the motif within 10 bp of a consensus motif (Fig.  3A ). At a mean of 7%, SINEs once again have the lowest percentage of ancestrally derived TF motifs (percent ancestral origin) for cCRE-associated motifs across TE subfamilies (Fig.  3B ). We also observed that cCRE-associated TF motifs have significantly higher percent ancestral origin compared to randomly selected motifs for DNA transposons ( p  = 1.2 × 10 −8 ), LINEs ( p  = 8.7 × 10 −8 ), LTRs ( p  < 2.2 × 10 −16 ), and ERV internal regions (ERV-int) ( p  = 0.0067). This suggests that ancestral TE sequences serve as an important source of TF motifs in cCREs for most but not all TE classes. For SINEs, the percentage of cCRE-associated motifs of ancestral origin was not different from random expectation ( p  = 0.86), indicating that the ancestral TE sequence does not generally explain the presence of TF motifs enriched within SINE-derived cCREs.

figure 3

A Calculation of ancestral origin percentage for cCRE-associated TF motifs. For TF motifs found in the TE subfamily consensus sequence (motif #1), percent ancestral origin was calculated as the percentage of motifs in individual TE copies that align to within 10 bp of the consensus motif (dotted line). For TF motifs that are not found in the consensus sequence (motif #2), we assumed that the TE subfamily’s ancestral sequence did not contain the TF motif and all instances of the motif arose through mutation. The mean percentage was taken across all cCRE-associated TF motifs for a TE subfamily to equally weight each TF motif. B Distribution of mean percent ancestral origin of cCRE-associated TF motif for each TE subfamily, separated by TE class. Two-sided Wilcoxon rank-sum test with Benjamini–Hochberg correction was performed to compare percent ancestral origin between observed cCRE-associated TF motifs and randomly selected TF motifs (DNA transposons p -value = 1.17 × 10 −8 , LINE p -value = 8.71 × 10 −8 , LTR p -value < 1.1 × 10 −15 , ERV-int p -value = 0.00668, SINE p -value = 0.863), and to compare percent ancestral origin between TE classes (DNA transposon vs. SINE p -value = 0.0097, LTR vs. LINE p -value = 0.0239, LTR vs. SINE p -value = 0.0002). C Correlation between TE subfamily Kimura divergence and cCRE-associated TF motif percent ancestral origin. R-squared and p -values for each linear regression is shown along with the 95% confidence interval band. The TE subfamily Kimura divergence represents the age of the TE subfamily given the neutral evolution of most TEs. * p  < 0.05, ** p  < 0.01, *** p  < 0.001.

Next, we considered the evolutionary fate of TF motifs within TE sequences. If TEs contain ancestral TF motifs that are occasionally retained within cCREs, we expect many ancestrally derived TF motifs to gradually decay away through accumulated mutations as most TE copies neutrally evolve. Consistent with this prediction, TE subfamily age, as measured by the mean Kimura divergence of individual copies from the subfamily consensus, is negatively correlated with the percentage of ancestrally derived TF motifs (Fig.  3C ). When we break down this analysis per TE class, we found that this correlation held for LINEs, LTRs, and ERV-int, but not for DNA transposons or SINEs. As an orthogonal analysis, we calculated the percentage of TE copies containing the TF motif within each subfamily, based on the hypothesis that ancestrally derived motifs should be found in a higher percentage of copies compared to mutation-derived motifs. TE subfamily age was negatively correlated with the percentage of copies with motif for all TE classes except for SINEs (Supplementary Fig.  3E ). Taken together, these findings suggest that most TE subfamilies arrive in the genome already containing cis -regulatory sequence features that are retained within cCREs. SINEs tend to adopt a different trajectory whereby TF motifs do not preexist within their ancestral sequence but evolve subsequently via mutations. It is important to note the considerable variation between different TE subfamilies, highlighting that each TE subfamily has its own unique evolutionary path to maintain or acquire cis -regulatory activity.

Examining the consensus coverage of cCRE-associated TEs compared to non-cCRE TEs revealed an unexpected enrichment over the 5’ end of LINE1, even after controlling for length (Supplementary Figs.  4 and 5 ). This indicates that LINE1s that contain the 5’ end disproportionately contribute to cCREs. Our observation is consistent with the prediction that the 5’ region of LINE1 harbors their promoter sequence and contains a wealth of TFBSs 35 , 36 , 37 , 38 . These results suggest that the 5’ end of LINEs may be similar to LTRs in providing regulatory sequence.

Genomic context influences the cis -regulatory potential of TEs

As TEs are not evenly distributed throughout the genome, we next sought to explore whether there is any relationship between the genomic loci of TE-derived cCREs and non-TE cCREs. Specifically, we quantified the relative distance, or normalized distance between two genomic loci, from either TEs or cCRE-associated TEs to the nearest non-TE-derived cCREs. Here, we zoomed in to focus on relative distance from 0 to 0.1 where the differences were most prominent. If TEs randomly develop into cCREs regardless of their insertion location, we should observe a uniform distribution (a flat line in the relative distance plot) of distances between cCRE-associated TEs and non-TE cCREs. As expected, TE insertions are uniformly distributed in the genome relative to cCREs (left panels in Fig.  4A ). However, TEs associated with PLS, pELS, and DNase-H3K4me3 are significantly closer to other cCREs of the same category (higher proportion at low relative distances) when using a cell type-agnostic approach by Kolmogorov–Smirnov test (KS test) (middle panels in Fig.  4A ). While not significantly closer when considering cell type-agnostic annotations of dELS, TEs associated with dELS are significantly closer to non-TE dELS sites after separating dELS by cell or tissue type (brown lines in right panels in Fig.  4A ). This suggests that, despite being uniformly distributed in the genome in general, TE insertions close to other promoters or enhancers are more likely to be promoters or enhancers themselves. At the TE class level, LTR elements associated with cCREs are more likely to be distant from non-TE cCREs (Fig.  4A ), which implies that LTRs are less dependent on genomic context in displaying regulatory activity compared to other TE classes. Lastly, we found that the distances from TEs associated with CTCF-only sites to non-TE CTCF-only sites are more consistent with random distribution in both human and mouse (red lines in Fig.  4A and Supplementary Fig.  6A ), despite abundant B2-derived CTCF binding sites in the mouse genome 25 . This indicates that CTCF binding sites provided by TEs are scattered randomly in the genome, which could facilitate formation of new chromatin boundaries. We performed the same analysis using mouse cCREs and TEs and found all trends observed in human to be consistent in the mouse genome (Supplementary Fig.  6A ).

figure 4

A Relative distance of all TEs to cCREs, cCRE-associated TEs to all cCREs of the same type (cell agnostic), and cCRE-associated TEs to same cell/tissue type cCREs. Relative distances are further separated by TE class and cCRE type. See Supplementary Data 6 for KS test p -values. B Median distances for TF bound TEs (red) and non-bound TEs (blue) across 409 TFs in 535 TF ChIP-seq datasets. One-sided Wilcoxon rank-sum test was performed with Benjamini–Hochberg correction ( p -value < 2.2 × 10 −16 for all TE classes). C Percentage of TE-derived TFBSs for 30 TFs with ChIP-seq in human K562 and mouse MEL cells. TE percentage is further divided into binding sites that are species-specific with no synteny, binding sites that are species-specific with synteny, and binding sites that are shared. D Percentage of putative TFBS turnover events that come from TEs. Each percentage is split up by TE class contribution for the TF. E Browser shot of USF2 binding site turnover in human facilitated by primate lineage insertion of MER5A. Underlying USF2 motif sequence alignment in human and mouse are shown (if available). TF transcription factor, KS test Kolmogorov–Smirnov test. *** p  < 0.001.

To further probe the connection between linear genomic distance and TE-derived cis -regulatory activity, we next examined the distance of TEs to TFBSs in K562 cells. For each of 409 TFs with ChIP-seq datasets where at least one TE subfamily was bound at least 10 times, we compared the distance of TF-bound TEs to their nearest non-TE TFBS of the same TF to that of non-bound TEs of the same subfamily. Across all TFs, we found that TF-bound TE copies are ~10 times closer to non-TE TFBSs of the same TF than non-bound TE copies, regardless of TE class (Fig.  4B ). These results are consistent with our distance analysis with cCREs and suggest that TEs with cis -regulatory activity tend to be proximal to other cis -regulatory elements.

Since TF-bound TEs tend to reside near non-TE TFBSs, we hypothesized that TEs can introduce local redundancy in TF binding, which may promote TFBS turnover during evolution 39 , 40 , 41 , 42 , whereby the TE-derived TFBS can functionally replace the nearby ancestral TFBS. To test this hypothesis, we selected all TFs (30) with high-quality ChIP-seq data in both human K562 and mouse MEL erythroleukemic cells, which are biologically homologous. As reported previously 43 , we found that up to ~40% of TFBSs are contributed by TEs in each cell line (Fig.  4C ). While most TFBSs in TEs are derived from species-specific TEs, 13–54% and 20–58% of TE-derived TFBSs are syntenic in both human and mouse, respectively. These syntenic TEs are frequently bound by TFs in just one lineage (syntenic but specific). This suggests that TEs are involved in a dynamic evolutionary process where TFBSs can be gained or lost in one lineage through lineage-specific mutations. In addition to providing space for innovation by providing novel TFBSs, TEs are also thought to help maintain local TFBSs through TF turnover. To identify putative TFBS turnover events, we searched for lineage-specific TFBSs within 5 kb of a syntenic TFBS in the other lineage and inferred which TFBS was ancestral based on synteny and phastCons score (Supplementary Fig.  7A ). Using this approach, we discovered a total of 6700 and 9245 putative TFBS turnover events across 30 TFs in human and mouse, respectively (Supplementary Fig.  7B ). TEs make up 3–56% of putative turnover events, with most derived from lineage-specific TE insertions (Fig.  4D, E ). The TFs with the highest TE-derived turnover rates are CTCF and RAD21 in mouse, both of which are part of the cohesin loading complex. These results are consistent with previous studies that have found TEs to participate in CTCF binding site turnover after human-mouse divergence 25 . Our findings point to TEs as important drivers of TFBS turnover during evolution.

TE- and non-TE-derived cCREs have similar features

Since TEs contribute a large proportion of cCREs across the human genome, we explored whether TE-derived cCREs have distinct properties from non-TE cCREs. First, we considered sequence intrinsic cis -regulatory activity as measured by MPRA. Using ENCODE phase 4 lentivirus-based MPRA (lentiMPRA) data in K562 cells, which assayed all open chromatin sites in K562, we asked if the tested genomic sequences display differential regulatory activity based on TE annotation 44 . We classified sequences as TE-associated if at least 50% of the sequence is contributed by a single TE, resulting in ~34,000 TE-associated and ~81,000 non-TE sequences tested by MPRA. We further split sequences based on cCRE type. Overall, TE-associated sequences have similar or higher levels of MPRA activity compared to non-TE sequences of the same cCRE type (Fig.  5A ). MPRA activity was significantly higher for TE-associated sequences in all cCRE types except PLS and CTCF-bound cCREs (Wilcoxon rank-sum test, pval <0.05), but the differences in activity were subtle (median log 2 fold change difference of 0.207, 0.060, 0.060, 0.078, and 0.052 for DNase-H3K4me3, dELS, pELS, DNase-only, and None, respectively). This is consistent with a previous study that found TE sequences of one LTR subfamily with higher MPRA activity than positive control sequences, albeit with far fewer tested elements 15 . Overall, these results suggest that TE-derived sequences possess the sequence potential for cis -regulatory activity as strong or stronger than their non-TE counterparts.

figure 5

A K562 lentiMPRA activity split by cCRE annotation or lack thereof. Number of TE and non-TE sequences tested for each group is listed. Median log 2 fold change over negative control activity is displayed as a white dash. Two-sided Wilcoxon rank-sum test with Benjamini–Hochberg correction was performed to compare TE and non-TE sequence MPRA activity (None p -value = 4.13 × 10 −91 , DNase-only p -value = 3.51 × 10 −21 , CTCF p -value = 0.166, pELS p -value = 0.0223, dELS p -value = 7.19 × 10 −10 , DNase-H3K4me3 p -value = 5.85 × 10 −6 , PLS p -value = 0.195). B 1000 Genomes Project common variants (allele frequency >1%) in TE and non-TE cCREs. Percentage of cCREs that overlap common variants (top) and variants per 100 bp (bottom) are shown for each cCRE type. Percentages of common variant overlap for upstream and downstream cCRE flanking regions are shown as black dots (top). Mean variants per 100 bp is displayed by a black dot and listed below each boxplot (bottom). Outliers have been removed from boxplots. Comparisons between TE cCREs, non-TE cCREs, and flanking regions were done using permutation tests (1000 permutations, two-sided). The numbers of TE cCREs shown in the boxplot for CTCF-only, dELS, DNase-H3K4me3, pELS, and PLS cCREs are 14,110, 120,930, 5766, 14,787, and 1100. The numbers of non-TE cCREs shown in the boxplot for CTCF-only, dELS, DNase-H3K4me3, pELS, and PLS cCREs are 22,089, 305,405, 10,820, 63,862, and 16,508. The numbers of TE-cCRE flanking regions shown in the boxplot for CTCF-only, dELS, DNase-H3K4me3, pELS, and PLS cCREs are 28,609, 242,971, 11,514, 29,325, and 2185. See Supplementary Data  3 for p -values. C Venn diagram of human-lineage TE and non-TE cCRE overlap with TF ChIP-seq, ATAC-seq, 30-way phastCons, and MPRA activity. One-sided Chi-square test was performed to compare TE and non-TE cCRE overlaps. D Enrichment of GWAS variants by EBI EFO parent term for all cCREs, non-TE cCREs, and TE cCREs. Permutation test (1000 permutations, one-sided) was performed to compare observed overlap of cCREs with GWAS variants to shuffled genomic background. See Supplementary Data  2 for p -values. MPRA, massively parallel reporter assay; lentiMPRA, lentivirus-based MPRA. ** p  < 0.01, *** p  < 0.001, not significant (ns).

Next, we examined the frequency of variants found in the human population for TE-derived and non-TE cCREs based on the 1000 Genomes Project 45 . The expectation is that regions under functional constraint, like DNA elements regulating genes, would have fewer common variants, defined here as variants with human population allele frequency greater than 1%. Promoter distal TE-derived cCREs (dELS and CTCF-only) overlap variants less often than regions directly flanking them (Fig.  5B top, Supplementary Fig.  8 ). Furthermore, the frequency of common variants found in TE-derived cCREs is lower than their flanking regions apart from promoter sequences (PLS and DNase-H3K4me3) (Fig.  5B bottom, Supplementary Fig.  8 ). These results suggest that non-promoter TE-derived cCREs are under functional constraint and less tolerant of sequence variation than their surrounding sequences. However, TE-derived PLS, pELS, and dELS cCREs have significantly more common variants compared to non-TE cCREs, though the trend exists for all cCRE types. This indicates that TE-derived cCREs are generally less constrained in the human population than those apparently not derived from TEs.

Besides the epigenomic marks used by ENCODE to define cCREs, we used four additional metrics for identifying regulatory elements to compare the global profiles of TE-derived and non-TE cCREs: ATAC-seq for open chromatin, TF ChIP-seq for TF binding, MPRA for regulatory potential of the underlying sequence, and phastCons score for sequence conservation. As the vast majority of TE-derived cCREs are lineage-specific, we limited our analysis to TE-derived and non-TE cCREs that are found in human but not in mouse, allowing us to compare cCREs of roughly similar ages. Overall, TE-derived cCREs are not significantly different from non-TE cCREs in the proportion of elements that have any combination of ATAC-seq peaks, TF ChIP-seq peaks, MPRA activity, and high phastCons scores (Chi-square test, p  = 0.24, Fig.  5C ). This shows that the genomic features that are commonly used to annotate regulatory elements genome-wide are largely the same between TE and non-TE elements.

Finally, we investigated whether TE-derived cCREs could be physiologically relevant using the NHGRI-EBI GWAS catalog 46 . In addition to the comprehensive list of top GWAS SNPs (GWAS variants), we divided SNPs into different parent terms as defined by EBI for physiologically related diseases and traits. Compared to randomly shuffled genomic coordinates, the general set of cCREs is enriched for GWAS variants across all GWAS parent terms, in line with a previous study (Fig.  5D , Supplementary Fig.  9 ) 47 . TE-derived cCREs are enriched for GWAS variants overall and in 11/17 parent terms (Supplementary Fig.  9 , Supplementary Data  2 ). However, they have consistently lower enrichment for GWAS variants compared to non-TE cCREs, which may be due to underrepresented profiling of SNPs in TEs from genotyping arrays (Supplementary Data  2 ). Altogether, these results suggest that TE-derived cCREs are functionally comparable to non-TE cCREs and carry sequences that are physiologically important for human traits and disease.

TEs make up a large portion of most mammalian genomes, and many studies have shown that TEs contribute to their regulatory landscape. However, the extent to which TEs supply different types of regulatory elements and the factors that allow them to evolve as regulatory elements are not well understood. In this study, we utilize cCREs to define the contribution of TEs to the human regulatory space, finding that ~25% of all cCREs are TE-derived. This overall contribution is similar to previous estimates by Pehrsson et al. who studied the overlap of TEs with active regulatory states in the RoadMap Epigenome Project 22 . We observed that TEs do not contribute to the different types of cCREs equally; they contribute more substantially to gene-distal enhancers than to gene-proximal enhancers and promoters. This pattern is likely driven by selection against TE insertions in the proximity of genes 48 . Regardless of their cCRE type, we found that TE-derived cCREs are more likely to be restricted to one or a few cell/tissue types compared to non-TE cCREs. This result suggests that TEs could be important for regulatory innovation by providing gene regulatory elements that are active in a limited number of cellular contexts. The documented contribution of TEs to gene regulation in rapidly-evolving processes such as innate immunity and placentation support this hypothesis 13 , 49 .

Different TEs have invaded and expanded in genomes at various points during evolution, leading to some being shared between species and others being lineage-specific. Andrews et al. had previously found that over 80% of primate-specific TFBS and cCREs overlap with TEs based on alignments of 241 mammalian genomes 50 . In contrast, we only identified ~29% of all human-lineage cCREs to be TE-derived based on cCRE annotations in human and mouse. The two main reasons for the difference in cCREs attributed to TEs are the definition of TE overlap and definition of primate-specific or human-lineage cCREs. First, Andrews et al. counted 1 bp intersection of cCRE and TE as overlap, whereas we were much more stringent in only counting a cCRE as overlapping if at least half of the cCRE sequence intersects with a single TE copy. Second, Andrews et al. looked only at primate-specific regions of the genome while we extended our analysis to include regions with synteny between human and mouse. By examining human-mouse shared cCREs, we found that TEs provide a small fraction (up to 2%) of conserved cCREs orthologous between human and mouse, though this may be an underestimation due to less extensive profiling of cells/tissues in mouse relative to human (839 and 157 cell/tissue types profiled in human and mouse, respectively). Thus, it is possible that we are underestimating the contribution of TEs to conserved regulatory elements and many TE-derived mouse cCREs have yet to be discovered. Including syntenic TEs in human and mouse also allowed us to gain additional insights. Among TEs that contribute to cCREs and are orthologous (shared by descent) between human and mouse, most are retained as or become cCREs in only one lineage, indicating either lineage-specific loss or lineage-specific gain of regulatory activity, respectively. Furthermore, a majority of non-orthologous, lineage-specific TE-cCREs come from TE subfamilies that are old enough to be found in both human and mouse (Supplementary Fig.  2 ). Taken together, our results suggest that most TE-derived regulatory elements come from old TEs. This is consistent with a previous study by Villar et al. which found that evolutionarily young enhancers were primarily adapted from ancestral DNA sequences over 100 million years of age 51 .

To broadly understand where cis -regulatory activity in TEs comes from, we investigated the evolutionary origins of cCRE-associated TF motifs. In LINEs, LTR elements, and DNA transposons, cCRE-associated TF motifs originate from their consensus sequences more than expected by chance, suggesting an ancestral source for many important TE-derived TF motifs. As TEs age, the bulk of these TF motifs degrade over time. These results suggest that the amplification of these TEs disperses TF motifs available immediately upon insertion, but only a small subset is co-opted for host gene regulation. SINEs, which are extremely abundant in mammals (Alu elements alone account for 10% of the human genome sequence), show a very different trend compared to the other main TE classes. They have the lowest proportion of cCRE-associated TF motifs stemming from their consensus sequence, and the percentage of motifs that have an ancestral origin does not decrease as they age. Although SINEs have slightly higher substitution rates than L1s and DNA transposons of the same age, this does not explain the lack of relationship between TE subfamily age and percentage of cCRE-enriched TF motifs with ancestral origin 52 . These results suggest that SINEs provide relatively fewer mature TF motifs but instead frequently supply raw sequence material from which additional TF motifs emerge over time by mutation. This model is consistent with previous studies documenting the progressive birth of enhancers from Alu and RSINE1 elements in human and mouse, respectively 18 , 53 . Since SINEs have given rise to ~5% of human cCREs, these findings indicate that this “seed-and-mature” process has been a rich source of new cis -regulatory elements in the human genome.

When TEs insert themselves into the genome, the newly integrated copy and its progenitor are typically identical in sequence and therefore have the same sequence potential for cis -regulatory activity. However, only a small subset of TE copies within any given subfamily overlaps with cCREs. What influences some TE copies to retain or evolve cis -regulatory activity? One likely influential factor is the genomic context of the TE and its proximity to functional sequences such as genes and cis -regulatory elements 18 . Consistent with this model, we demonstrate that TEs with either cCREs or TFBSs have shorter genomic distance to non-TE cCREs or TFBSs compared to other TEs. Among TEs, the distance between LTR elements with cCREs/TFBSs and non-TE cCREs/TFBSs is the highest, indicating the relatively independent activity of LTRs, possibly due to high density of TFBSs in LTRs 54 , 55 . Based on the observation that TF bound TEs are close to non-TE TFBSs, we propose that the clustering of TFBSs from TE insertions introduces functional redundancy that can promote the turnover of TFBSs during evolution, as is prominent in mammals 39 , 40 , 41 , 42 . By examining the binding profiles of 30 TFs in human and mouse leukemia cell lines, we estimate that TEs have contributed 3–56% of all putative TFBS turnover events depending on the TF. Together these results suggest that the insertion of TEs near existing cis -regulatory elements is a major driver of TFBS evolutionary turnover.

An outstanding question for future studies is to determine the extent to which TE-derived cCREs have contributed to human adaptation and phenotypic variation. Our analysis brings hints that a subset of TE-derived cCREs serve important biological functions. First, we found that the sequences of TE-derived cCREs are generally more evolutionarily constrained than their non-cCRE counterparts. Second, we observed that TE-derived cCREs are enriched for GWAS variants, albeit to a lesser extent than non-TE cCREs. While this could indicate that TEs are less likely to be physiologically relevant, it could also reflect technical shortcomings associated with genotyping within TE sequences. Genotyping arrays, which use short oligonucleotide probes to discern SNPs, are designed to avoid repetitive regions of the genome. Our analysis of nine genotyping arrays from Affymetrix and Illumina shows between 30 and 36% of SNPs located in repetitive DNA, short of the 45% TE content in the human genome (Supplementary Data  2 ). These observations suggest that GWAS studies may have missed trait-associated SNPs residing within TE sequences and there is a need to consider TE-derived variants in follow-up studies to GWAS 56 .

Here, we focused on K562 for analyses related to TF binding and turnover as well as MPRA functional assay support simply due to the abundance of data available in this cell line. However, it is important to note that K562 is a cancer cell line with many chromosomal abnormalities, including a near-triploid genome 57 , 58 . Changes in copy number may affect the results of epigenomic profiling assays, as previously demonstrated by the correlation between whole genome sequencing and POLR2A ChIP-seq 59 . There is also the possibility that some of the TE-cCREs in K562 are cancer-specific and shared across multiple cancer cell lines. We observed that K562 TE-cCREs are biased to being specific to K562, less shared across cancer cell lines than all cCREs, and depleted for promoter cCREs (Supplementary Fig.  1D, E ). Since these trends in cancer cell lines are consistent with the overall trends for all TE-cCREs, it is likely that observations for TE-derived cCREs in K562 are generally applicable.

The nature of TEs as repetitive elements with rich and varied evolutionary histories raises challenges during their study. Mapping short reads back to TEs uniquely has been a long-standing problem. With the application of long, paired-end reads the problem has been largely alleviated, as demonstrated by Sexton and Han 60 . Therefore, the utilization of paired-end reads in ENCODE4 made it possible to reliably identify cCREs within most TEs. However, we still need to bear in mind that the cCREs within recent L1 subfamilies could still be undercounted in the study due to limitations in mapping. In addition, technical limitations in identifying particularly old TEs could have impacted several of our analyses. As TEs accumulate mutations over time, their sequences diverge from the consensus sequence used to annotate them. This can lead to incorrect or missing annotation, especially for RepeatMasker-based annotations which rely on alignment to consensus sequences 61 , 62 . In our human-mouse comparison, we observed that ~20% of TEs in syntenic regions were classified as belonging in the same repeat family but not assigned to the same subfamily. Although a few are real instances where different TEs created independent insertions in the same syntenic region, most cases likely arise due to a combination of high sequence divergence from the consensus sequence and high similarity between subfamily consensus sequences. Incorrect annotation of TE subfamily elements could have also affected our analyses that compare TE copies within their subfamily, like for cCRE-enriched TF motifs and their origins. Finally, since highly conserved regulatory elements are old by definition, TEs that provided the underlying regulatory element sequence may have already decayed past the point of recognition. Any missing TE annotations would have led to underestimating the scale of TE contribution to conserved regulatory elements, possibly including many gene-proximal or broadly used elements. It would be interesting to see if more sensitive TE detection methods would implicate TEs as significant contributors to regulatory elements of ancient origins.

In summary, we have shown that TEs are substantial contributors to cis -regulatory innovation and maintenance. We confirm previous reports that TEs make up ~25% of the human regulatory genome and provide direct, genome-wide functional evidence from K562 lentiMPRA. This is also reflected in the depletion of common human population variants and enrichment of GWAS variants in TE-derived cCREs relative to background. To gain insights into regulatory innovation, we quantify the proportion of lineage-specific regulatory elements in humans and mice that are derived from TEs (8–36% depending on type). By taking advantage of the phylogenetic relationship between TE copies in the same subfamily, we probe the origins of TF motifs for regulatory TEs, discovering that most TE classes bring TF motifs to be potentially co-opted while SINEs primarily gain regulatory TF motifs through mutations. Although most focus in the field has been on innovation, we demonstrate that TEs are active participants in maintaining regulatory features like TFBSs (3–56% of putative TF turnover events). Lastly, we provide evidence that TE genomic insertion site is potentially a general factor in determining which TE copies become regulatory elements. While many ideas are not completely novel to the field, this study has provided systematic analyses that explore whether the trends described in a handful of TE subfamilies are generally applicable. With TEs becoming increasingly recognized to be intertwined with how genomes have evolved and operate, we believe that our work will serve as an encyclopedia to help advance our understanding of fundamental biology and disease.

Annotation of TE-derived cCREs

Genomic cCRE annotations in hg38 (cell agnostic and 25 fully profiled ENCODE cell/tissue types) and mm10 were downloaded from ( https://screen.wenglab.org/ ) and the ENCODE portal ( https://www.encodeproject.org/ ) 19 . Genomic TE annotations in hg38 and mm10 were obtained from RepeatMasker ( https://repeatmasker.org/ ) 63 . We used BedTools intersect 64 to find cCREs that are associated with TEs, requiring at least 50% of the cCRE to overlap a single TE.

Enrichment of TEs in cCREs

We calculated the enrichment of TE subfamilies for cCREs as follows.

For visualization, we included TE subfamilies with no overlap with cCREs as log 2 enrichment of −10, which is lower than any subfamily with cCRE overlap.

Human-mouse cCRE comparison

To characterize human and mouse cCREs as shared or lineage-specific (Supplementary Fig.  2 ), we first used liftOver with -minMatch option of 0.1 to identify syntenic regions in the other species. The syntenic region was determined to be a cCRE or TE-derived if at least 50% of the syntenic region overlaps with a cCRE or TE. Syntenic regions with cCREs were classified as “shared” if the cCRE type was the same in both species and “different” if the cCRE type was different. TEs in syntenic regions of human and mouse were counted as orthologous if both TEs are annotated as belonging to the same TE family (e.g., SINE/Alu). To calculate the total number of human-specific cCREs, we started by summing syntenic human cCRE (394,610), non-syntenic human cCRE (167,134), and human TE-cCRE (215,752) categories from Fig.  2A . Then, we subtracted human TE-cCREs with a syntenic mouse cCRE but no syntenic TE (5361) and added orthologous TEs that are lineage-specific (15,973).

To compare sequence conservation, 100-way phastCons scores 65 were downloaded from https://genome.ucsc.edu/ . Two-sided Wilcoxon rank-sum test was used for comparisons between groups of TE-cCREs.

Identification of cCRE-enriched TF motifs

First, TEs in each subfamily were separated based on overlap with hg38 cCREs, with subfamilies that lacked cCRE overlap removed from analysis ( n  = 116). Next, TE subfamilies were split into three groups depending on whether the length distributions of cCRE (foreground) and non-cCRE (background) elements were significantly different based on Kolmogorov–Smirnov (KS) test. One group of subfamilies ( n  = 194) have no significant difference with all background elements included. For the second group of subfamilies ( n  = 993) with significant difference in length distribution between foreground and background elements, background elements were binned and randomly selected to match the proportion of foreground elements found in each bin. Random selection of background elements in the second group of subfamilies was performed 10 times. TE subfamilies that could not achieve matched foreground/background length distributions were disregarded from further analysis ( n  = 17).

To identify cCRE-enriched motifs, we ran AME motif enrichment using the HOCOMOCOv11 human core transcription factor motif database 66 for each TE subfamily, with cCRE elements as foreground and non-cCRE elements (all elements or random selection) as background/control. Enriched motifs were grouped according to motif archetypes 47 . To confirm AME results, we scanned TE subfamily elements for the top enriched motif within each archetype and performed Fisher’s exact test, further filtering for motifs that have significant association with cCRE annotation ( p  < 0.05 after multiple test correction with Benjamini–Hochberg method), at least 10 elements having both the motif and cCRE annotation, and odds ratios of at least 2. We also filtered for TF motifs that pass Fisher’s exact test using TE subfamily consensus coverage-controlled background sequences, identifying TF motifs that distinguish cCRE overlapping TE copies from non-overlapping copies based on sequence variation alone.

Origin of cCRE-associated TF motifs

In order to estimate the percentage of cCRE-enriched motifs that were derived from an ancestral origin, we first derived consensus sequences for each TE subfamily from RepeatMasker and the RepBase-derived RepeatMasker Library 20170127 (Supplementary Methods). We could not obtain consensus sequences for four subfamilies (L2d, L2d2, Alu, and MLT1B-int), which were excluded from further analysis. Next, we scanned each consensus sequence for all HOCOMOCOv11 human core motifs. For each motif found in a TE subfamily’s consensus sequence, we scanned all elements within the subfamily for the motif. The relative position of each motif to the consensus sequence was found by aligning each element to its consensus sequence using Needle pairwise alignment 67 . Finally, the percent ancestral origin rate of a given motif was calculated as the percentage of motifs that were within 10 bp of the consensus sequence motif. As we had grouped motifs based on motif archetype, we used the ancestral origin rate of the top enriched motif per archetype. In the case that the top motif was not found in the consensus sequence, we allowed for any other enriched motif in the archetype that was in the consensus to substitute. Any motif archetype that had no cCRE-enriched motif in the consensus sequence was assigned an ancestral origin rate of 0.

Relative distance of TE to closest cCRE

To estimate the spatial correlation between TEs and cCREs, we calculated relative distance first described by Favorov et al. and implemented within the BEDTools suite 64 , 68 . Briefly, for each cCRE type, TEs were assigned to their closest non-TE cCRE. Then, the distance between cCREs was split into 100 equal sized intervals and the frequency of TEs that fall within each interval was counted. We shuffled TE coordinates using bedtools shuffle 100 times to constitute the null hypothesis set, and applied KS test with Bonferroni multiple test correction to evaluate the difference between observation and shuffled null expectation.

Bound vs unbound TE distance to nearest TF peak

A total of 587 IDR thresholded TF ChIP-seq peak files in K562 were downloaded from ENCODE after filtering out those with “NOT_COMPLIANT” or “ERROR” audit labels. Supplementary Data  4 lists the TFs and ChIP-seq datasets that were used. For each TF, individual TEs were classified as bound if they intersected the peak summit and non-bound otherwise. We then calculated the linear distance from each TE to the nearest non-overlapping peak. For each TE subfamily with at least 10 TF bound individual TEs, we randomly sampled an equal number of non-bound individual TEs as those which were bound and ranked them in descending order of distance. After repeating random sampling 1000 times, we averaged each of the ranks across all 1000 samples to get a distribution of average distances to the nearest non-overlapping peak for non-bound TEs. We then calculated p -values using two-sided Wilcoxon rank-sum tests between bound and unbound TEs within each subfamily. We calculated the log 10 ratio of average median distances to the nearest non-overlapping peak between bound and non-bound TEs as the following:

Identification of putative TF turnover events

IDR-thresholded peaks for 30 TF ChIP-seq datasets with matching K562 TF ChIP-seq were downloaded for mouse MEL from ENCODE 19 . Syntenic regions to TF binding peaks were identified with the same method described above for human-mouse comparison. If a TF peak in one species overlapped at least 50% of a peak in the other species, it was classified as “shared”. Otherwise, the TF peak was classified as “syntenic but specific” for alignable sequences but with species-specific TF binding. To identify putative TFBS turnover events after human-mouse divergence, we identified all TF peaks in one species within 5 kb of the syntenic region of a TF peak in the other species (Supplementary Fig.  7A ). For each peak, mean phastCons score was assigned using 100-way vertebrate phastCons scores in human or 60-way vertebrate phastCons scores in mouse 65 . We calculated the median phastCons score for conserved TF binding peaks in human and mouse to infer human-mouse ancestral TF binding. For each pair of lineage-specific peaks, the human-mouse ancestral TFBS was inferred based on human-mouse synteny and phastCons score higher than the median phastCons score for conserved TF binding peaks. Pairs of lineage-specific peaks were identified as putative TFBS turnover events if a single non-ancestral TF binding peak was within 5 kb of an ancestral TF binding peak. TE-derived peaks were classified using the prior mentioned criteria of 50% overlap with the TF binding peak.

MPRA comparison

K562 lentiMPRA data was downloaded from the ENCODE portal 44 . TE-derived or non-TE-derived cCRE annotations were intersected with lentiMPRA sequence coordinates and then assigned the maximum log 2 fold change (log2FC) value (both strands). cCREs were grouped based on their annotation with the following exceptions: “None” = Low-DNase or no intersection, “CTCF” = any classification bound by CTCF. Two-sided Wilcoxon rank-sum test was performed comparing the log2FC values of TE-derived cCREs with non-TE-derived cCREs within the same category. An alternative hypothesis of TE-derived cCREs having a greater log2FC value than non-TE-derived cCREs was used. P -values for all tests underwent Benjamini–Hochberg multiple-test correction.

Human population variant frequency in cCREs

We extracted variants within the human population characterized by the 1000 genomes project 45 and further selected variants with allele frequency >1% as common variants. For each cCRE that did not overlap coding sequence in GENCODEv41 69 , we counted how many common variants intersect with them, and the number of common variants was normalized per 100 bp of sequence. The percentage with variant overlap and the distribution of variants per 100 bp was obtained for TE-derived cCREs, non-TE cCREs, and cCRE flanking regions. Flanking regions were defined as non-coding genomic regions directly upstream and downstream with the same length as the cCRE. Permutation tests were then performed to compare percentage with variant overlap and mean variants per 100 bp between TE-derived cCREs, non-TE cCREs, and cCRE flanking regions relative to random genomic background (Supplementary Data  3 , Supplementary Methods).

Venn Diagram comparing features for TE-derived and non-TE-derived cCREs

ATAC-seq in K562 cells were downloaded from ENCODE and 30-way (27 primates) phastCons scores were downloaded from https://genome.ucsc.edu/ . Non-TE and TE-derived cCREs were classified as being accessible (ATAC-seq), bound by a TF (TF ChIP-seq), MPRA active (MPRA), or having high levels of sequence conservation among primates (phastCons). PhastCons scores were binned into 20 bp windows and each cCRE was assigned the mean of intersecting phastCons scores. cCREs with the top 10% of phastCons scores were selected as high sequence conservation. For ATAC-seq and TF ChIP-seq, a cCRE containing a peak summit within its interval was considered accessible or bound, respectively. MPRA log2FC was obtained for each cCRE as previously described, and cCREs were classified as active in MPRA if the maximum log2FC was greater than 1. Finally, a Venn Diagram was generated using the combined classification of cCREs. Chi-square test was performed to test for differences in feature classification between TE- and non-TE-derived cCREs.

Enrichment of cCREs in GWAS variants

The NHGRI-EBI GWAS catalog with added ontology annotations and GWAS to EFO mappings was downloaded from http://www.ebi.ac.uk/gwas 46 . The strongest SNP was chosen for each reported entry. We used dbSNP153 70 to assign chromosome positions in hg38 to SNPs if chromosome position was not already listed; SNPs with neither chromosome position nor dbSNP153 rs ID were excluded. The number of GWAS SNPs found in cCREs was counted following BEDTools overlap 64 . Each SNP was counted at most once for each parent term. Permutation test by genome-wide shuffling of cCRE coordinates was performed 1000 times to obtain a random expectation for GWAS SNP overlap. Enrichment was calculated as the number of overlapping GWAS variants in cCREs divided by the number of overlapping GWAS variants in random shuffled coordinates. P -value was defined as the proportion of random shuffles that reached the number of overlapping GWAS variants in cCREs or higher. As a negative control, we shuffled cCRE coordinates an additional 100 times and took the mean number of GWAS variant overlaps.

Statistics and reproducibility

No statistical method was used to predetermine sample size (all TEs and datasets were used where reasonable). All data points were included in statistical analysis. Statistical analyses and graphical representations were performed using R versions 3.5.1 and 4.0.1.

Reporting summary

Further information on research design is available in the  Nature Portfolio Reporting Summary linked to this article.

Data availability

All accession codes and download links for publicly available data are listed in Supplementary Data  1 .  Source data are provided with this paper.

Code availability

All code for analysis is available at https://github.com/twlab/ENCODE_TE 71 .

McClintock, B. The origin and behavior of mutable loci in maize. Proc. Natl Acad. Sci. USA 36 , 344 (1950).

Article   ADS   PubMed   PubMed Central   Google Scholar  

Osmanski, A. B. et al. Insights into mammalian TE diversity through the curation of 248 genome assemblies. Science 380 , eabn1430 (2023).

Article   PubMed   PubMed Central   Google Scholar  

Nurk, S. et al. The complete sequence of a human genome. Science 376 , 44–53 (2022).

Wells, J. N. & Feschotte, C. A field guide to eukaryotic transposable elements. Annu. Rev. Genet. 54 , 539–561 (2020).

Christmas, M. J. et al. Evolutionary constraint and innovation across hundreds of placental mammals. Science 380 , eabn3943 (2023).

Rebollo, R., Romanish, M. T. & Mager, D. L. Transposable elements: an abundant and natural source of regulatory sequences for host genes. Annu. Rev. Genet. 46 , 21–42 (2012).

Article   PubMed   Google Scholar  

Chuong, E. B., Elde, N. C. & Feschotte, C. Regulatory activities of transposable elements: from conflicts to benefits. Nat. Rev. Genet. 18 , 71–86 (2017).

Bourque, G. et al. Ten things you should know about transposable elements. Genome Biol. 19 , 199 (2018).

Sundaram, V. & Wysocka, J. Transposable elements as a potent source of diverse cis-regulatory sequences in mammalian genomes. Philos. Trans. R. Soc. B 375 , 20190347 (2020).

Article   Google Scholar  

Fueyo, R., Judd, J., Feschotte, C. & Wysocka, J. Roles of transposable elements in the regulation of mammalian transcription. Nat. Rev. Mol. Cell Biol. 23 , 481–497 (2022).

Lawson, H. A., Liang, Y. & Wang, T. Transposable elements in mammalian chromatin organization. Nat. Rev. Genet. https://doi.org/10.1038/s41576-023-00609-6 (2023).

Wang, T. et al. Species-specific endogenous retroviruses shape the transcriptional network of the human tumor suppressor protein p53. Proc. Natl Acad. Sci. USA 104 , 18613–18618 (2007).

Chuong, E. B., Elde, N. C. & Feschotte, C. Regulatory evolution of innate immunity through co-option of endogenous retroviruses. Science 351 , 1083–1087 (2016).

Sundaram, V. et al. Functional cis-regulatory modules encoded by mouse-specific endogenous retrovirus. Nat. Commun. 8 , 14550 (2017).

Du, A. Y. et al. Functional characterization of enhancer activity during a long terminal repeat’s evolution. Genome Res 32 , 1840–1851 (2022).

PubMed   PubMed Central   Google Scholar  

Zemojtel, T., Kielbasa, S. M., Arndt, P. F., Chung, H. R. & Vingron, M. Methylation and deamination of CpGs generate p53-binding sites on a genomic scale. Trends Genet. 25 , 63–66 (2009).

Zemojtel, T. et al. CpG deamination creates transcription factor–binding sites with high efficiency. Genome Biol. Evol. 3 , 1304–1311 (2011).

Judd, J., Sanderson, H. & Feschotte, C. Evolution of mouse circadian enhancers from transposable elements. Genome Biol. 22 , 1–26 (2021).

The ENCODE Project Consortium. et al. Expanded encyclopaedias of DNA elements in the human and mouse genomes. Nature 583 , 699–710 (2020).

Article   ADS   Google Scholar  

Roadmap Epigenomics Consortium. et al. Integrative analysis of 111 reference human epigenomes. Nature 518 , 317–330 (2015).

Article   PubMed Central   Google Scholar  

Trizzino, M., Kapusta, A. & Brown, C. D. Transposable elements generate regulatory novelty in a tissue-specific fashion. BMC Genomics 19 , 468 (2018).

Pehrsson, E. C., Choudhary, M. N. K., Sundaram, V. & Wang, T. The epigenomic landscape of transposable elements across normal human development and anatomy. Nat. Commun. 10 , 1–16 (2019).

Brocks, D. et al. DNMT and HDAC inhibitors induce cryptic transcription start sites encoded in long terminal repeats. Nat. Genet. 49 , 1052–1060 (2017).

Schmidt, D. et al. Waves of retrotransposon expansion remodel genome organization and CTCF binding in multiple mammalian lineages. Cell 148 , 335–348 (2012).

Choudhary, M. N. K. et al. Co-opted transposons help perpetuate conserved higher-order chromosomal structures. Genome Biol. 21 , 1–14 (2020).

Google Scholar  

Choudhary, M. N. K., Quaid, K., Xing, X., Schmidt, H. & Wang, T. Widespread contribution of transposable elements to the rewiring of mammalian 3D genomes. Nat. Commun. 14 , 1–12 (2023).

Simonti, C. N., Pavličev, M. & Capra, J. A. Transposable element exaptation into regulatory regions is rare, influenced by evolutionary age, and subject to pleiotropic constraints. Mol. Biol. Evol. 34 , 2856 (2017).

Diehl, A. G., Ouyang, N. & Boyle, A. P. Transposable elements contribute to cell and species-specific chromatin looping and gene regulation in mammalian genomes. Nat. Commun. 11 , 1–18 (2020).

Kuhn, R. M., Haussler, D. & James Kent, W. The UCSC genome browser and associated tools. Brief. Bioinform. 14 , 144–161 (2013).

Chinwalla, A. T. et al. Initial sequencing and comparative analysis of the mouse genome. Nature 420 , 520–562 (2002).

Article   ADS   PubMed   Google Scholar  

Jordan, I. K., Rogozin, I. B., Glazko, G. V. & Koonin, E. V. Origin of a substantial fraction of human regulatory sequences from transposable elements. Trends Genet. 19 , 68–72 (2003).

Van De Lagemaat, L. N., Landry, J. R., Mager, D. L. & Medstrand, P. Transposable elements in mammals promote regulatory variation and diversification of genes with specialized functions. Trends Genet. 19 , 530–536 (2003).

Lowe, C. B., Bejerano, G. & Haussler, D. Thousands of human mobile element fragments undergo strong purifying selection near developmental genes. Proc. Natl Acad. Sci. USA 104 , 8005–8010 (2007).

Feschotte, C. Transposable elements and the evolution of regulatory networks. Nat. Rev. Genet. 9 , 397–405 (2008).

Swergold, G. D. Identification, characterization, and cell specificity of a human LINE-1 promoter. Mol. Cell. Biol. 10 , 6718–6729 (1990).

Minakami, R. et al. Identification of an internal cis-element essential for the human Li transcription and a nuclear factor(s) binding to the element. Nucleic Acids Res. 20 , 3139–3145 (1992).

Alexandrova, E. A. et al. Sense transcripts originated from an internal part of the human retrotransposon LINE-1 5′ UTR. Gene 511 , 46–53 (2012).

Sun, X. et al. Transcription factor profiling reveals molecular choreography and key regulators of human retrotransposon expression. Proc. Natl. Acad. Sci. USA . https://doi.org/10.1073/pnas.1722565115 (2018).

Stefflova, K. et al. Cooperativity and rapid evolution of cobound transcription factors in closely related mammals. Cell 154 , 530–540 (2013).

Yue, F. et al. A comparative encyclopedia of DNA elements in the mouse genome. Nature 515 , 355–364 (2014).

Cheng, Y. et al. Principles of regulatory information conservation between mouse and human. Nature 515 , 371–375 (2014).

Vierstra, J. et al. Mouse regulatory DNA landscapes reveal global principles of cis-regulatory evolution. Science 346 , 1007–1012 (2014).

Sundaram, V. et al. Widespread contribution of transposable elements to the innovation of gene regulatory networks. Genome Res. 24 , 1963–1976 (2014).

Agarwal, V. et al. Massively parallel characterization of transcriptional regulatory elements in three diverse human cell types. Preprint at bioRxiv https://doi.org/10.1101/2023.03.05.531189 (2023).

The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 526 , 68–74 (2015).

Sollis, E. et al. The NHGRI-EBI GWAS Catalog: knowledgebase and deposition resource. Nucleic Acids Res. 51 , D977–D985 (2023).

Vierstra, J. et al. Global reference mapping of human transcription factor footprints. Nature 583 , 729–736 (2020).

Medstrand, P., Van De Lagemaat, L. N. & Mager, D. L. Retroelement distributions in the human genome: variations associated with age and proximity to genes. Genome Res. 12 , 1483–1495 (2002).

Lynch, V. J., Leclerc, R. D., May, G. & Wagner, G. P. Transposon-mediated rewiring of gene regulatory networks contributed to the evolution of pregnancy in mammals. Nat. Genet. 43 , 1154–1159 (2011).

Andrews, G. et al. Mammalian evolution of human cis-regulatory elements and transcription factor binding sites. Science 380 , eabn7930 (2023).

Villar, D. et al. Enhancer evolution across 20 mammalian species. Cell 160 , 554–566 (2015).

Pace, J. K. & Feschotte, C. The evolutionary history of human DNA transposons: evidence for intense activity in the primate lineage. Genome Res. 17 , 422–432 (2007).

Su, M., Han, D., Boyd-Kirkup, J., Yu, X. & Han, J. D. J. Evolution of Alu elements toward enhancers. Cell Rep. 7 , 376–385 (2014).

Thompson, P. J., Macfarlan, T. S. & Lorincz, M. C. Long terminal repeats: from parasitic elements to building blocks of the transcriptional regulatory repertoire. Mol. Cell 62 , 766–776 (2016).

Ito, J. et al. Systematic identification and characterization of regulatory elements derived from human endogenous retroviruses. PLOS Genet 13 , e1006883 (2017).

Payer, L. M. et al. Structural variants caused by Alu insertions are associated with risks for many human diseases. Proc. Natl Acad. Sci. USA 114 , E3984–E3992 (2017).

Gribble, S. M. et al. Cytogenetics of the chronic myeloid leukemia-derived cell line K562: karyotype clarification by multicolor fluorescence in situ hybridization, comparative genomic hybridization, and locus-specific fluorescence in situ hybridization. Cancer Genet. Cytogenet. 118 , 1–8 (2000).

Naumann, S., Reutzel, D., Speicher, M. & Decker, H. J. Complete karyotype characterization of the K562 cell line by combined application of G-banding, multiplex-fluorescence in situ hybridization, fluorescence in situ hybridization, and comparative genomic hybridization. Leuk. Res. 25 , 313–322 (2001).

Zhou, B. et al. Comprehensive, integrated, and phased whole-genome analysis of the primary ENCODE cell line K562. Genome Res. 29 , 472–484 (2019).

Sexton, C. E. & Han, M. V. Paired-end mappability of transposable elements in the human genome. Mob. DNA 10 , 1–11 (2019).

de Koning, A. P. J., Gu, W., Castoe, T. A., Batzer, M. A. & Pollock, D. D. Repetitive elements may comprise over two-thirds of the human genome. PLOS Genet. 7 , e1002384 (2011).

Matsushima, W., Planet, E. & Trono, D. Ancestral genome reconstruction enhances transposable element annotation by identifying degenerate integrants. Cell Genomics 4 , 100497 (2024).

Smit, A., Hubley, R. & Green, P. RepeatMasker Open-4.0. 2013–2015 http://www.repeatmasker.org (2015).

Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26 , 841–842 (2010).

Siepel, A. et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 15 , 1034–1050 (2005).

Kulakovskiy, I. V. et al. HOCOMOCO: towards a complete collection of transcription factor binding models for human and mouse via large-scale ChIP-Seq analysis. Nucleic Acids Res. 46 , D252–D259 (2018).

Needleman, S. B. & Wunsch, C. D. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48 , 443–453 (1970).

Favorov, A. et al. Exploring massive, genome scale datasets with the GenometriCorr package. PLOS Comput. Biol. 8 , e1002529 (2012).

Frankish, A. et al. GENCODE 2021. Nucleic Acids Res. 49 , D916–D923 (2021).

Sherry, S. T., Ward, M. & Sirotkin, K. dbSNP—database for single nucleotide polymorphisms and other classes of minor genetic variation. Genome Res. 9 , 677–679 (1999).

Du, A. Y., Chobirko, J. D., Zhuo, X., Feschotte, C. & Wang, T. Regulatory transposable elements in the encyclopedia of DNA elements. twlab/ENCODE_TE. https://doi.org/10.5281/zenodo.12822146 (2024).

Download references

Acknowledgements

This work was supported by NIH grants R01HG007175 and U01HG009391. A.Y.D. was supported by NHGRI training grant T32HG000045. J.D.C. was supported by NIGMS MIRA (2R35GM122550-06).

Author information

These authors contributed equally: Alan Y. Du, Jason D. Chobirko, Xiaoyu Zhuo.

Authors and Affiliations

Department of Genetics, Washington University School of Medicine, St. Louis, MO, USA

Alan Y. Du, Xiaoyu Zhuo & Ting Wang

The Edison Family Center for Genome Sciences and Systems Biology, Washington University School of Medicine, St. Louis, MO, USA

Department of Molecular Biology and Genetics, Cornell University, Ithaca, NY, USA

Jason D. Chobirko & Cédric Feschotte

McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO, USA

You can also search for this author in PubMed   Google Scholar

Contributions

T.W. and C.F. conceived the project. A.Y.D., J.D.C., and X.Z. designed and performed analysis. All authors took part in writing the manuscript.

Corresponding authors

Correspondence to Cédric Feschotte or Ting Wang .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Peer review

Peer review information.

Nature Communications thanks the anonymous reviewers for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary information, peer review file, description of additional supplementary files, supplementary data 1, supplementary data 2, supplementary data 3, supplementary data 4, supplementary data 5, supplementary data 6, reporting summary, source data, source data, rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/ .

Reprints and permissions

About this article

Cite this article.

Du, A.Y., Chobirko, J.D., Zhuo, X. et al. Regulatory transposable elements in the encyclopedia of DNA elements. Nat Commun 15 , 7594 (2024). https://doi.org/10.1038/s41467-024-51921-6

Download citation

Received : 07 October 2023

Accepted : 16 August 2024

Published : 31 August 2024

DOI : https://doi.org/10.1038/s41467-024-51921-6

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

null and alternative hypothesis goodness of fit

IMAGES

  1. Null vs. Alternative Hypothesis: Key Differences and Examples Explained

    null and alternative hypothesis goodness of fit

  2. Solved Hypotheses: The null hypothesis in a chi-square

    null and alternative hypothesis goodness of fit

  3. Null vs. Alternative Hypothesis: Key Differences and Examples Explained

    null and alternative hypothesis goodness of fit

  4. Null vs. Alternative Hypothesis: Key Differences and Examples Explained

    null and alternative hypothesis goodness of fit

  5. Goodness-of-Fit Test

    null and alternative hypothesis goodness of fit

  6. PPT

    null and alternative hypothesis goodness of fit

VIDEO

  1. Hypothesis Testing: the null and alternative hypotheses

  2. HYPOTHESIS TESTING PROBLEM-4 USING Z TEST VIDEO-7

  3. HYPOTHESIS TESTING PROBLEM-2 USING Z TEST VIDEO-5

  4. Hypothesis Tests| Some Concepts

  5. Null and Alternative Hypothosis

  6. t-TEST PROBLEM 1- HYPOTHESIS TESTING VIDEO-16

COMMENTS

  1. Chi-Square Goodness of Fit Test

    The chi-square goodness of fit test is a hypothesis test. It allows you to draw conclusions about the distribution of a population based on a sample. Using the chi-square goodness of fit test, you can test whether the goodness of fit is "good enough" to conclude that the population follows the distribution. ... Example: Null and alternative ...

  2. Chi-Square Goodness of Fit Test: Definition, Formula, and Example

    A Chi-Square goodness of fit test uses the following null and alternative hypotheses: H 0: ... 0.05, and 0.01) then you can reject the null hypothesis. Chi-Square Goodness of Fit Test: Example. A shop owner claims that an equal number of customers come into his shop each weekday. To test this hypothesis, an independent researcher records the ...

  3. 11.3: Goodness-of-Fit Test

    The test statistic for a goodness-of-fit test is: ∑k (O − E)2 E (11.3.1) (11.3.1) ∑ k ( O − E) 2 E. where: The observed values are the data values and the expected values are the values you would expect to get if the null hypothesis were true. There are n n terms of the form (O−E)2 E ( O − E) 2 E.

  4. Chi-Square Goodness of Fit Test: Uses & Examples

    Null: The sample data follow the hypothesized distribution.; Alternative: The sample data do not follow the hypothesized distribution.; When the p-value for the chi-square goodness of fit test is less than your significance level, reject the null hypothesis.Your data favor the hypothesis that the sample does not follow the hypothesized distribution. Let's work through two examples using the ...

  5. 12.2: A Goodness-of-Fit Test

    The null and alternative hypotheses are: \(H_{0}\): The absent days occur with equal frequencies, that is, they fit a uniform distribution. ... To assess whether a data set fits a specific distribution, you can apply the goodness-of-fit hypothesis test that uses the chi-square distribution. The null hypothesis for this test states that the data ...

  6. 11.2 Goodness-of-Fit Test

    Determine the null and alternative hypotheses needed to conduct a goodness-of-fit test. H 0: Student absenteeism fits faculty perception. The alternative hypothesis is the opposite of the null hypothesis. H a: Student absenteeism does not fit faculty perception.

  7. Goodness of Fit: Definition & Tests

    A goodness of fit test determines whether the differences between your sample data and the distribution are statistically significant. In this context, statistical significance indicates the model does not adequately fit the data. ... Null Hypothesis (H₀): The data follow the specified distribution. Alternative Hypothesis (H A): ...

  8. Chi-Square Goodness of Fit Test

    Null hypothesis: The proportion of rookies, veterans, and All-Stars is 30%, 60% and 10%, respectively. Alternative hypothesis: At least one of the proportions in the null hypothesis is false. Formulate an analysis plan. For this analysis, the significance level is 0.05. Using sample data, we will conduct a chi-square goodness of fit test of the ...

  9. 11.2 Goodness-of-Fit Test

    The test statistic for a goodness-of-fit test is: The observed values are the data values and the expected values are the values you would expect to get if the null hypothesis were true. There are n terms of the form (O−E)2 E ( O − E) 2 E. The number of degrees of freedom is df = (number of categories - 1). The goodness-of-fit test is ...

  10. 10.4 The Goodness-of-Fit Test

    The goodness-of-fit test is a well established process: Write down the null and alternative hypotheses. The null hypothesis is the claim that the categorical variable follows the hypothesized distribution and the alternative hypothesis is the claim that the categorical variable does not follow the hypothesized distribution.

  11. 2.4

    We will use this concept throughout the course as a way of checking the model fit. Like in linear regression, in essence, the goodness-of-fit test compares the observed values to the expected (fitted or predicted) values. A goodness-of-fit statistic tests the following hypothesis: \(H_0\colon\) the model \(M_0\) fits. vs.

  12. 10.2: Goodness-of-Fit

    Step 1: Determine the Hypotheses. The goodness of fit test makes claims about the proportions or probabilities for each outcome of a multinomial experiment. If there are k outcomes per trial, then the null hypothesis would be. H0: p1 = value1, p2 = value2, …,pk = valuek H 0: p 1 = value 1, p 2 = value 2, …, p k = value k.

  13. What is Goodness-of-Fit? A Comprehensive Guide

    Null and alternative hypotheses: Goodness-of-fit tests are based on null and alternative hypotheses. The null hypothesis (H0) typically states no significant difference between the expected values and the observed data based on the model. The alternative hypothesis (H1) contends that there is a significant difference.

  14. Chi-Square Goodness of Fit Test

    The null and alternative hypotheses for a goodness of fit test look different than some of our other hypothesis tests. One reason for this is that a chi-square goodness of fit test is a nonparametric method. This means that our test does not concern a single population parameter. Thus the null hypothesis does not state that a single parameter ...

  15. Goodness-of-Fit Test

    The test statistic for a goodness-of-fit test is: ∑ k (O−E)2 E ∑ k ( O − E) 2 E. where: O = observed values (data) E = expected values (from theory) k = the number of different data cells or categories. The observed values are the data values and the expected values are the values you would expect to get if the null hypothesis were true.

  16. Chi-Square Goodness of Fit Test

    Hypothesis testing: Hypothesis testing is the same as in other tests, like t-test, ANOVA, etc. The calculated value of Chi-Square goodness of fit test is compared with the table value. If the calculated value is greater than the table value, we will reject the null hypothesis and conclude that there is a significant difference between the observed and the expected frequency.

  17. 2.11

    The P-value is smaller than the significance level \(\alpha = 0.05\) — we reject the null hypothesis in favor of the alternative. There is sufficient evidence at the \(\alpha = 0.05\) level to conclude that there is a lack of fit in the simple linear regression model. In light of the scatterplot, the lack of fit test provides the answer we ...

  18. 11.4: Goodness-of-Fit Test

    The test statistic for a goodness-of-fit test is: ∑k (O − E)2 E ∑ k ( O − E) 2 E. where: The observed values are the data values and the expected values are the values you would expect to get if the null hypothesis were true. There are n terms of the form (O−E)2 E ( O − E) 2 E.

  19. hypothesis testing

    So it is always a good idea to read up on the specific goodness of fit test you want to apply to figure out what exactly the null hypothesis is that is being tested. Question 2: To understand this you need to see that a goodness of fit test is just like any other statistical test, and understand exactly what the logic is behind statistical tests.

  20. A comprehensive comparison of goodness-of-fit tests for logistic

    Under the null hypothesis or alternative hypothesis , if Conditions (A1) and (A3 ... N.L.: Goodness-of-fit processes for logistic regression: simulation results. Stat. Med. 21(18), 2723-2738 (2002) Article Google Scholar Hosmer, D.W., Lemesbow, S.: Goodness of fit tests for the multiple logistic regression model. Commun Stat Theory ...

  21. 9.1: Goodness-of-Fit Test

    The test statistic for a goodness-of-fit test is: ∑k (O − E)2 E (9.1.1) (9.1.1) ∑ k ( O − E) 2 E. where: The observed values are the data values and the expected values are the values you would expect to get if the null hypothesis were true.

  22. Six Maxims of Statistical Acumen for Astronomical Data Analysis

    Goodness-of-fit tests are a particular challenge, because, in effect they compare the posited model with a fully flexible model, i.e., a model with a large number of fitted parameters. ... similarly it increases the statistical power to distinguish between the null and the alternative in a hypothesis test. ... the probability of correctly ...

  23. 10.2: Goodness of Fit Test

    This is a test for three or more proportions within a single population, so use the goodness-of-fit test. We will always use a right-tailed χ 2 -test. The hypotheses for this example would be: H0: pA = 0.35,pB = 0.23,pC = 0.25,pD = 0.10,pF = 0.07. H1: At least one proportion is different. Even though there is an inequality in H1, the goodness ...

  24. Regulatory transposable elements in the encyclopedia of DNA ...

    An alternative model is that TEs acquire TFBSs and ... Multinomial tests for goodness of fit ... We shuffled TE coordinates using bedtools shuffle 100 times to constitute the null hypothesis set ...

  25. 12.1: The χ2 Goodness-of-fit Test

    For instance, in the cards example my null hypothesis was that all the four suit probabilities were identical (i.e., P 1 =P 2 =P 3 =P 4 =0.25), but there's nothing special about that hypothesis. I could just as easily have tested the null hypothesis that P 1 =0.7 and P 2 =P 3 =P 4 =0.1 using a goodness of fit test. So it's helpful to the ...