• Skip to secondary menu
  • Skip to main content
  • Skip to primary sidebar

Statistics By Jim

Making statistics intuitive

Interpreting Correlation Coefficients

By Jim Frost 145 Comments

What are Correlation Coefficients?

Correlation coefficients measure the strength of the relationship between two variables. A correlation between variables indicates that as one variable changes in value, the other variable tends to change in a specific direction.  Understanding that relationship is useful because we can use the value of one variable to predict the value of the other variable. For example, height and weight are correlated—as height increases, weight also tends to increase. Consequently, if we observe an individual who is unusually tall, we can predict that his weight is also above the average.

In statistical analysis , correlation coefficients are a quantitative assessment that measures both the direction and the strength of this tendency to vary together. There are different types of correlation coefficients that you can use for different kinds of data . In this post, I cover the most common type of correlation—Pearson’s correlation coefficient.

Before we get into the numbers, let’s graph some data first so we can understand the concept behind what we are measuring.

Graph Your Data to Find Correlations

Scatterplots are a great way to check quickly for correlation between pairs of continuous data. The scatterplot below displays the height and weight of pre-teenage girls. Each dot on the graph represents an individual girl and her combination of height and weight. These data are actual data that I collected during an experiment.

This scatterplot displays a positive correlation between height and weight.

At a glance, you can see that there is a correlation between height and weight. As height increases, weight also tends to increase. However, it’s not a perfect relationship. If you look at a specific height, say 1.5 meters, you can see that there is a range of weights associated with it. You can also find short people who weigh more than taller people. However, the general tendency that height and weight increase together is unquestionably present—a correlation exists.

Pearson’s correlation coefficient takes all of the data points on this graph and represents them as a single number. In this case, the statistical output below indicates that the Pearson’s correlation coefficient is 0.694.

Statistical output that displays Pearson's correlation coefficient and p-value.

What do the Pearson correlation coefficient and p-value mean? We’ll interpret the output soon. First, let’s look at a range of possible correlation coefficients so we can understand how our height and weight example fits in.

Related posts : Using Excel to Calculate Correlation and Guide to Scatterplots

How to Interpret Pearson Correlation Coefficients

Pearson’s correlation coefficient is represented by the Greek letter rho ( ρ ) for the population parameter and r for a sample statistic. This correlation coefficient is a single number that measures both the strength and direction of the linear relationship between two continuous variables. Values can range from -1 to +1.

The greater the absolute value of the Pearson correlation coefficient, the stronger the relationship.

  • The extreme values of -1 and 1 indicate a perfectly linear relationship where a change in one variable is accompanied by a perfectly consistent change in the other. For these relationships, all of the data points fall on a line. In practice, you won’t see either type of perfect relationship.
  • A coefficient of zero represents no linear relationship. As one variable increases, there is no tendency in the other variable to either increase or decrease.
  • When the value is in-between 0 and +1/-1, there is a relationship, but the points don’t all fall on a line. As r approaches -1 or 1, the strength of the relationship increases and the data points tend to fall closer to a line.

The sign of the Pearson correlation coefficient represents the direction of the relationship.

  • Positive coefficients indicate that when the value of one variable increases, the value of the other variable also tends to increase. Positive relationships produce an upward slope on a scatterplot.
  • Negative coefficients represent cases when the value of one variable increases, the value of the other variable tends to decrease. Negative relationships produce a downward slope.

Statisticians consider Pearson’s correlation coefficients to be a standardized effect size because they indicate the strength of the relationship between variables using unitless values that fall within a standardized range of -1 to +1. Effect sizes help you understand how important the findings are in a practical sense. To learn more about unstandardized and standardized effect sizes, read my post about Effect Sizes in Statistics .

Learn how to calculate correlation in my post, Correlation Coefficient Formula Walkthrough .

Covariance is an unstandardized form of correlation. Learn about it in my posts:

  • Covariance: Definition, Formula & Example
  • Covariances vs Correlation: Understanding the Differences

Examples of Positive and Negative Correlation Coefficients

A positive correlation example is the relationship between the speed of a wind turbine and the amount of energy it produces. As the turbine speed increases, electricity production also increases.

A negative correlation example is the relationship between outdoor temperature and heating costs. As the temperature increases, heating costs decrease.

Graphs for Different Correlation Coefficients

Graphs always help bring concepts to life. The scatterplots below represent a spectrum of different Pearson correlation coefficients. I’ve held the horizontal and vertical scales of the scatterplots constant to allow for valid comparisons between them.

This scatterplot displays a perfect positive correlation of +1.

Discussion about the Scatterplots

For the scatterplots above, I created one positive correlation between the variables and one negative relationship between the variables. Then, I varied only the amount of dispersion between the data points and the line that defines the relationship. That process illustrates how correlation measures the strength of the relationship. The stronger the relationship, the closer the data points fall to the line. I didn’t include plots for weaker correlation coefficients that are closer to zero than 0.6 and -0.6 because they start to look like blobs of dots and it’s hard to see the relationship.

A common misinterpretation is assuming that negative Pearson correlation coefficients indicate that there is no relationship. After all, a negative correlation sounds suspiciously like no relationship. However, the scatterplots for the negative correlations display real relationships. For negative correlation coefficients, high values of one variable are associated with low values of another variable. For example, there is a negative correlation coefficient for school absences and grades. As the number of absences increases, the grades decrease.

Earlier I mentioned how crucial it is to graph your data to understand them better. However, a quantitative measurement of the relationship does have an advantage. Graphs are a great way to visualize the data, but the scaling can exaggerate or weaken the appearance of a correlation. Additionally, the automatic scaling in most statistical software tends to make all data look similar .

Fortunately, Pearson’s correlation coefficients are unaffected by scaling issues. Consequently, a statistical assessment is better for determining the precise strength of the relationship.

Graphs and the relevant statistical measures often work better in tandem.

Pearson’s Correlation Coefficients Measure Linear Relationship

Pearson’s correlation coefficients measure only linear relationships. Consequently, if your data contain a curvilinear relationship, the Pearson correlation coefficient will not detect it. For example, the correlation for the data in the scatterplot below is zero. However, there is a relationship between the two variables—it’s just not linear.

Scatterplot displays a curvilinear relationship that has a Pearson's correlation coefficient of 0.

This example illustrates another reason to graph your data! Just because the coefficient is near zero, it doesn’t necessarily indicate that there is no relationship.

Spearman’s correlation is a nonparametric alternative to Pearson’s correlation coefficient. Use Spearman’s correlation for nonlinear, monotonic relationships and for ordinal data. For more information, read my post Spearman’s Correlation Explained !

Hypothesis Test for Correlation Coefficients

Correlation coefficients have a hypothesis test. As with any hypothesis test, this test takes sample data and evaluates two mutually exclusive statements about the population from which the sample was drawn. For Pearson correlations, the two hypotheses are the following:

  • Null hypothesis: There is no linear relationship between the two variables. ρ = 0.
  • Alternative hypothesis: There is a linear relationship between the two variables. ρ ≠ 0.

Correlation coefficients that equal zero indicate no linear relationship exists. If your p-value is less than your significance level , the sample contains sufficient evidence to reject the null hypothesis and conclude that the Pearson correlation coefficient does not equal zero. In other words, the sample data support the notion that the relationship exists in the population.

Related post : Overview of Hypothesis Tests

Interpreting our Height and Weight Correlation Example

Now that we have seen a range of positive and negative relationships, let’s see how our Pearson correlation coefficient of 0.694 fits in. We know that it’s a positive relationship. As height increases, weight tends to increase. Regarding the strength of the relationship, the graph shows that it’s not a very strong relationship where the data points tightly hug a line. However, it’s not an entirely amorphous blob with a very low correlation. It’s somewhere in between. That description matches our moderate correlation coefficient of 0.694.

For the hypothesis test, our p-value equals 0.000. This p-value is less than any reasonable significance level. Consequently, we can reject the null hypothesis and conclude that the relationship is statistically significant. The sample data support the notion that the relationship between height and weight exists in the population of preteen girls.

Correlation Does Not Imply Causation

I’m sure you’ve heard this expression before, and it is a crucial warning. Correlation between two variables indicates that changes in one variable are associated with changes in the other variable. However, correlation does not mean that the changes in one variable actually cause the changes in the other variable.

Sometimes it is clear that there is a causal relationship. For the height and weight data, it makes sense that adding more vertical structure to a body causes the total mass to increase. Or, increasing the wattage of lightbulbs causes the light output to increase.

However, in other cases, a causal relationship is not possible. For example, ice cream sales and shark attacks have a positive correlation coefficient. Clearly, selling more ice cream does not cause shark attacks (or vice versa). Instead, a third variable, outdoor temperatures, causes changes in the other two variables. Higher temperatures increase both sales of ice cream and the number of swimmers in the ocean, which creates the apparent relationship between ice cream sales and shark attacks.

Beware of spurious correlations!

In statistics, you typically need to perform a randomized, controlled experiment to determine that a relationship is causal rather than merely correlation. Conversely, Correlational Studies will find relationships quickly and easily but they are not suitable for establishing causality.

Learn more about Correlation vs. Causation: Understanding the Differences .

Related posts : Using Random Assignment in Experiments and Observational Studies

How Strong of a Correlation is Considered Good?

What is a good correlation? How high should correlation coefficients be? These are commonly asked questions. I have seen several schemes that attempt to classify correlations as strong, medium, and weak.

However, there is only one correct answer. A Pearson correlation coefficient should accurately reflect the strength of the relationship. Take a look at the correlation between the height and weight data, 0.694. It’s not a very strong relationship, but it accurately represents our data. An accurate representation is the best-case scenario for using a statistic to describe an entire dataset.

The strength of any relationship naturally depends on the specific pair of variables. Some research questions involve weaker relationships than other subject areas. Case in point, humans are hard to predict. Studies that assess relationships involving human behavior tend to have correlation coefficients weaker than +/- 0.6.

However, if you analyze two variables in a physical process, and have very precise measurements, you might expect correlations near +1 or -1. There is no one-size fits all best answer for how strong a relationship should be. The correct values for correlation coefficients depend on your study area.

Taking Correlation to the Next Level with Regression Analysis

Wouldn’t it be nice if instead of just describing the strength of the relationship between height and weight, we could define the relationship itself using an equation? Regression analysis does just that. That analysis finds the line and corresponding equation that provides the best fit to our dataset. We can use that equation to understand how much weight increases with each additional unit of height and to make predictions for specific heights. Read my post where I talk about the regression model for the height and weight data .

Regression analysis allows us to expand on correlation in other ways. If we have more variables that explain changes in weight, we can include them in the model and potentially improve our predictions. And, if the relationship is curved, we can still fit a regression model to the data.

Additionally, a form of the Pearson correlation coefficient shows up in regression analysis. R-squared is a primary measure of how well a regression model fits the data. This statistic represents the percentage of variation in one variable that other variables explain. For a pair of variables, R-squared is simply the square of the Pearson’s correlation coefficient. For example, squaring the height-weight correlation coefficient of 0.694 produces an R-squared of 0.482, or 48.2%. In other words, height explains about half the variability of weight in preteen girls.

If you’re learning about statistics and like the approach I use in my blog, check out my Introduction to Statistics book! It’s available at Amazon and other retailers.

Cover of my Introduction to Statistics: An Intuitive Guide ebook.

Share this:

hypothesis in correlation coefficient

Reader Interactions

' src=

August 17, 2024 at 2:43 pm

Great, thank you!

' src=

August 15, 2024 at 9:33 am

Hi Jim. I had a query. Like if we say there is a correlation of 0.68 between x and y variable, then what exactly does this “0.68” as a “number” indicate apart from the fact that we can say there is a moderate association between x and y.

May 7, 2024 at 9:18 am

Is there any benefit to doing both a correlation and a regression test? I don’t think there is – I believe that a regression output will give you the same information a correlation output would plus more. Please could you let me know if that is correct or am I missing something?

' src=

May 7, 2024 at 2:08 pm

Hi Charlotte,

In general, you are correct for simple regression, where you have one independent variable and the dependent variable. The R-square for that model is literally the square of the Pearson’s correlation (r) for those two variables. As you mention, regression gives you additional output along with the strength of the relationship.

But there are a few caveats.

Regression is much more flexible than correlation because it allows you to add other variables, fit curvature and include interaction effects. For example, regression allows you to fit curvature between the two variables using polynomials. So, there are cases where using Pearson’s correlation is inappropriate because the data violate some of the assumptions but regression analysis can handle those data acceptably.

But what you say is correct when you’re looking at a straight line relationship between a pair of variables. In that specific case, simple regression and Pearson’s correlation provide consistent information with regression providing more details.

' src=

March 12, 2024 at 4:11 am

Hi If you are finding the trend between one type of quantitative discrete data and one type of qualitative ordinal data, what correlation test do you use?

' src=

September 9, 2023 at 4:46 am

It could be that the sharks are using ice cream as bait. Maybe the sharks are smarter than we think… Seriously, the ice cream as a cause is not likely, but sometimes a perfectly sensible hypothesis with lots of data behind it can be just plain wrong.

September 9, 2023 at 11:43 pm

It can be wrong in causal sense but if ice cream cones has a non-causal correlation with the number of shark attacks, it can still help you make predictions. Now, if you thought limiting ice cream sales will reduce shark attacks, that’s not going to work!

' src=

June 9, 2023 at 1:56 am

What is to be done when two positive items show a negative correlation within one variable.. e.g increase in house help decreases no interruptions in work?? It’s confusing as both r positive questions

June 10, 2023 at 1:09 am

It’s possibly the result of other variables, known as confounding variables (or confounders) that you might not even have recorded. For example, there might be some other variable that correlates with both “house help” and “interruptions at work” that explain the unexpected negative correlation. Perhaps individuals with house help have more activities occurring throughout the day at home. Those activities would then cause more interruptions. So, you might have chain of correlations where the “home activities” and “house help” have positive correlations. Additionally, “home activities” and “interruptions” might have a negative correlation. Given this arrangement, it wouldn’t be surprising to see a negative correlation between “home activities” and “interruptions.”

It goes to show that you need to understand the larger context when analyzing data. Technically, this phenomenon is known as omitted variable bias . Your model (pairwise correlation) omits an important variable (a confounder) which is biasing the results. Click the link to learn more.

The answer is to identify and record the confounding variables and include them in your model, likely a regression model or partial correlation.

' src=

May 8, 2023 at 12:58 pm

What if my pearson’s r is 0.187 and p-value is 0.001 do i reject the null hypothesis?

May 8, 2023 at 2:56 pm

Yes! That p-value is below any reasonable significance level. Hence, you can reject the null hypothesis. However, be aware that while the correlation is statistically significant, it is so weak that it probably isn’t practically significant in the real world. In other words, it probably exists in the population you’re assessing but it is too weak to be noticeable/meaningful.

November 30, 2022 at 4:53 am

Thank you, Jim. I really appreciate your help. I will read your post about statistical v practical significance – that sounds really useful. I love how you explain things in such an accessible way.

I have one more question that I was hoping you would be able to help me with, please?

If I have done a correlation test and I have found an extremely weak negative relationship (e.g., -.02) but the relationship is not statistically significant, would this mean that although I have found that there is a very weak negative correlation between the variables in the sample data, this would unlikely to be found in the population. Therefore, I would fail to reject the null hypothesis that the correlation in the population equals zero.

Thank you again for your help and for this wonderful blog.

December 1, 2022 at 1:57 am

You’re very welcome!

In the case where the correlation is not significant, it indicates that you have insufficient evidence to conclude that it does not equal zero. That’s a mouthful but there’s a reason for the convoluted wording. Insignificant results don’t prove that there is no effect, it just indicates that your test didn’t detect an effect in the population. It could be that the effect doesn’t exist in the population OR it could be that your sample size was too small or there’s too much variability in the data.

In short, we say that you failed to reject the null hypothesis.

Basically, you can’t prove a negative (no effect). All you can say is that your study didn’t detect an effect. In this case, it didn’t detect a non-zero correlation.

You can read more about the reason behind the wording failing to reject the null hypothesis and what it means precisely.

November 29, 2022 at 12:39 pm

Thank you for this webpage. It is great. I have a question, which I was hoping you’d be able to help me with please.

I have carried out a correlation test, and from my understanding a null hypothesis would be that there is no relationship between the two variables (the variables are independent – there is no correlation).

The p value is statistically significant (.000), and the Pearson correlation result is -.036.

My understanding is that if there is a statically significant relationship then I would reject the null hypothesis (which suggests there is no relationship between the two variables). My issue is then whether -.036 suggests a very weak relationship or no relationship at all given how close to 0 it is. If it is the latter, would I then say I have failed to reject the null hypothesis even though there is a statisicially significant relationship? Or would I say that I have rejected the null hypothesis because there is a statically significant relationship, but the correlation is very weak.

Any help would be appreciated. Kind regards.

November 29, 2022 at 4:10 pm

What you’re seeing is the difference between statistical significance and practically significance. Yes, your results are statistically significant. You can reject the null hypothesis that rho (the correlation in the population) does not equal zero. Your data provide enough evidence to conclude that the negative correlation exists in the population (not just your sample).

However, as you say, it’s an extremely weak relationship. Even though it’s not zero it is essentially zero in a practical sense. Statistically significant results don’t automatically mean that the effect size (correlation is this case) is meaningful in the real-world. When a test has very high statistical power (e.g., sometimes due to a very large sample size), it can detect trivial effects. Those effects are real but they’re small in size.

I write more about this in my post about statistical vs. practical significance . But, in a nutshell, your correlation coefficient is statistically significant, but it is not a meaningful effect in the real world.

' src=

September 28, 2022 at 10:44 am

I have a simple question, only to frame how to use correlation. Imagine a trial with plants, testing different phosphate (Pi) concentrations (like 8) and its effect on plant growth (assessed as mean plant size per Pi concentration, from enough replicates and data validity to perform classical parametric statistics).

In case A, I have a strong (positive) and significant Pearson correlation between these two parameters, and in particular, the 8 average size values show statistical significant differences (ANOVA) between all the Pi concentrations tested.

In case B, I have the same strong (positive) significant Pearson correlation, but there is no any statistical significant difference in term of size between any Pi concentration tested.

My guess is that it may be possible to interpret the case A as Pi is correlated with plant growth; but in case B, no interpretation can be provided given that no significant difference is seen between Pi concentrations on plant size, even if a correlation is obtained. Is this right ? But in this case, if I have 3 out the 8 Pi concentrations which I obtained significant difference on plant size, should I perform correlation only between significant Pi groups or could I still take all the 8 Pi groups to make interpretations ? Thanks in advance !

September 29, 2022 at 7:02 pm

I don’t fully understand your trial. You say that you have a continuous measure of Pi concentration and then average plant sizes. Pearson correlations work with two continuous measures–not a group average. So, you’d need to correlate the Pi concentration with plant size, not average plant size. Or perhaps I’m misunderstanding your description. Please clarify your process. Thanks!

In a more general sense, you have to remember that statistical significance doesn’t necessarily indicate there is a real-world, practical significance to your results. That’s possibly what you’re finding in case B. Although again it’s hard to say if you’re applying correlation to averages.

Statistical significance just indicates that you have reason to believe that a relationship/effect exists in the population. It doesn’t necessarily mean that the effect is large enough to be practically meaningful. For more information, read my post about Practical vs. Statistical Significance .

' src=

August 16, 2022 at 11:16 am

This was very educative and easy to follow through for a statistics noob such as me. Thanks! I like your books. Which one is most suited for a beginner level of knowledge?

August 17, 2022 at 12:20 am

My Introduction to Statistics book is the best to get started with for beginners. Click the link to see a post where I discuss it and included a full table of contents.

After reading that, you’d be ready to read both of my two other books: Hypothesis Testing Regression Analysis

' src=

May 16, 2022 at 2:45 pm

Jim, Nassim Taleb makes the point on YouTube (search for Taleb and correlation) that an r = 0.10 is much closer to zero than to r = 0.20) implying that the distribution function for r is very dependent on the r in the population, and the sample size and that the scale of -1.0 to +1.0 is not a scale separated by equal units. He then warns of significance tests because r is a random variable and subject to sampling fluctuations and r = .25 could easily be zero due to sampling error (especially for small sample sizes). Can you please discuss if the scale of r = -1.0 to 1.0 is set in equidistant units, or units that only superficially look like they are equidistant?

May 16, 2022 at 6:41 pm

I did a quick search and found a video where he’s talking about using correlation in the financial and investment areas. He seems to be saying that correlation is not the correct tool for that context. I can’t talk to that point because I’m not familiar with the context.

However, yes, I can help you out with most of the other points!

I’ll start with the fact that the scale of -1 to +1 is, in some ways, not consistent. To start, correlation coefficients are a standardized effect. As such, they are unitless. You can’t link them to anything real, but they help you compare between disparate types of studies. In other words, they excel at providing a standard basis of comparison between studies. However, they’re not as good for knowing what the statistic actually means, except for a few specific values, -1, +1, and 0. And perhaps that’s why Taleb isn’t fond of them. (At 20 minutes, I didn’t watch the entire video.)

However, we can convert r to R-squared and it becomes more meaningful. R-squared tells us how much of the variance the relationship accounts for. And, as the name implies, you simply square r to get R-squared. It’s in R-squared where you see that the difference between r of 0.1 and 0.2 is different from say 0.8 and 0.9. When you go from 0.1 to 0.2, R-squared increases from 0.01 to 0.04, an increase of 3%. And note that at those correlations, we’re only explaining between 1 – 4% of the variance. Virtually nothing! Now, if we look at going from an r of 0.8 to 0.9, R-squared increases from 0.64 to 0.81, or 17%. So, we have the same size increase in r (0.1) in both cases, but R-squared increases by 3% in one case and 17% in the other. Also, notice how at a r of 0.5, you’re only accounting for 25% of the variance. That’s not very much. You need an r of 0.707 to explain half the variance (50%). Another way to think of it is that the range of r [0, 0.7] accounts for half the variance while r [0.7, 1] accounts for the other half.

I agree with the point that r = 0.1 is virtually nothing. In fact, you need an r of 0.316 to explain even a tenth (10%) of the variability. I also agree that fixed differences in r (e.g., 0.1) indicates different changes in the strength of the relationship, as I illustrate above. I think those points are valid.

Below, I include a graph showing r vs. R-squared and the curved line indicates that the relationship between the two statistics changes (the inconsistency you mention). If the relationship was consistent, it would be a straight line. For me, R-squared is the better statistic, particularly in conjunction with regression analysis, which provides more information about the nature of the relationships. Of course, the negative range of r produces the mirror graph but the same ideas apply.

Graph displaying the relationship between r and R-squared.

I think correlation coefficients (r) have some other shortcomings. They describe the strength of the relationship but not the actual relationship. And they don’t account for other variables. Regression analysis handles those aspects and I generally prefer that methodology. For me, simple correlation just doesn’t provide enough information by itself in most cases. You also typically don’t get residual plots so you can be sure that you’re satisfying the assumptions (Pearson’s correlation (r) is essentially a linear model).

The sample r does depend on the relationship in the population. But that’s true for all sample statistics–as I write in my post, Sample Statistics Are Always Wrong to Some Extent! I don’t think it’s any worse for correlation than other types of sample statistics. As you increase your sample size, the estimate’s precision will increase (i.e., the error bars become smaller).

I think significance tests are valid for correlation. Yes, it’s subject to sampling fluctuations ( sampling error ) but so are all sample based statistics. Hypothesis testing is designed to factor that in. In fact, significance testing specifically helps you distinguish between cases where the sample r = 0.25 might represent 0 in the population vs. cases where that is unlikely. That’s the very intention of significance testing, so I strongly disagree with that point!

' src=

April 9, 2022 at 2:20 am

Thank you for the fast response!! I have alaso read the Spearman’s Rho article (very insightful). In my scatterplot it is suggesting that there is no correlation (completely random distribution). However, I would still like to test the correlation but in the Spearmans’s Rho article you mentioned that if it is there is no correlation, both the spearman’s Rho value and Pearson’s correlation value would be close to zero. Is it also possible that one value is positive and one is negative? My results right now are R2 Linear= 0.003, Pearson correlation= .058, and Spearman’s correlation coefficient= -0.19. Should I base the rejection of either of my hypothesises on Spaerman’s value or Pearson’s value

Thank you so much!!!

April 9, 2022 at 10:42 pm

I’m glad that it was helpful! It’s definitely possible for correlations to switch directions like that. That’s especially true because both correlations are barely different from zero. So, it wouldn’t take much to cause them to be on opposite sides of zero. The R-squared is telling you that the Pearson’s correlation explains hardly any of the variability.

' src=

April 8, 2022 at 7:05 pm

Thank you for this post!! I was wondering, I did a scatterplot which gave me a R2 value of 0.003. The fitline showed a really weak positive correlation which I wanted to test with the Spearmans rho. However, this value is showing a negative value (negative relationship). Do you maybe know why it is showing different correlations since I am using the exact same values?

April 8, 2022 at 7:51 pm

The R-squared value and slope you’re seeing are related to Pearson’s correlation, which differs from Spearmans rho. They’re different statistical measures using different methods, so it’s not surprising that their values can be different. For more information, read my post about Spearman’s Rho .

' src=

April 6, 2022 at 3:37 am

Hi Jim, I had a question. It’s kinda complicated but I try my best to explain it well.

I run a correlation test between objective social isolation and subjective social isolation. To measure OSI, I used an instrument called LSNS-6, while I used R-UCLA Loneliness Scale to measure the SSI. Here is the scoring guide for the instruments: * higher score obtained in LSNS-6 = low objective social isolation * higher score obtained in R-UCLA Loneliness scale = high subjective social isolation

After I run the correlation test, I found the value was r= -.437.

My question is, did the value represents correlation between variables (meaning when someone is objectively isolated, they are less likely to be subjectively isolated and vice versa) OR the value represents correlation between scores of instruments used (meaning when someone score higher in LSNS-6, they will had a lower scores for R-UCLA Loneliness Scale and vice versa)? I had confusions due to the scoring guide. I hope you can help me.

Thank you Jim!

April 8, 2022 at 8:17 pm

This specific correlation is a bit tricky because, based on what you wrote, the LSNS-6 is inverted. High LSNS-6 scores correspond to low objective social isolation. Let’s work through this example.

The negative correlation (-0.437) indicates that high LSNS-6 scores tend to correlate with low R-UCLA scores. Now, if we “translate” the instrument measures into what the scores mean as constructs, low objective social isolation tends to correspond low subjective social isolation.

In other words, there is a negative correlation between the instrument scores. However, there is a positive correlation between the concepts of objective social isolation and subjective isolation, which makes theoretical sense.

The reason why the instrument scores have a negative correlation and the constructs having a positive correlation goes back to the fact that high LSNs-6 scores relate to low objective isolation.

I hope that helps!

' src=

April 2, 2022 at 7:16 am

Thanks so much for the highly helpful statistical resources on this website. I am a bit confused about an analysis I carried out. My scatter plot show a kind of negative relationship between two variables but my Pearson’s correlation coefficient results tend to say something different. r= -0.198 and p-value of 0.082. I would appreciate clarification on this.

April 4, 2022 at 3:56 pm

I’m not sure what is surprising you? Can you be more specific?

It sounds like your scatterplot displays a negative correlation and your negative correlation is also negative, which sounds consistent. It’s a fairly weak correlation. The p-value indicates that your data don’t provide quite enough evidence to conclude that the correlation you see in the sample via the scatterplot and correlation coefficient also exists in the population. It might just be sampling error.

' src=

January 14, 2022 at 8:31 am

Hi Jim, Andrew here.

I am using a Pearson test for two variables: LifeSatisfaction and JobSatisfaction. I have gotten a P-Value 0.000 whilst my R-Value is 0.338. Can you explain to me what relation this is? Am I right in thinking that is strong significance with a weak correlation? And that there is no significant correlation between the two.

January 14, 2022 at 4:59 pm

What you’re running in to is the difference between statistical significance and practical significance in the real world. A statistically significant results, such as your correlation, suggests that the relationship you observe in your sample also exists in the population as a whole. However, statistical significance says nothing about how important that relationship is in a practical sense.

Your correlation results suggest that a positive correlation exists between life satisfaction and job satisfaction amongst the population from which you drew your sample. However, the fairly weak correlation of 0.338 might not be of practical significant. People with satisfying jobs might be a little happier but perhaps not to a noticeable degree.

So, for your correlation, statistical significance–yes! Practical significant–maybe not.

For more information, read my post about statistical significance vs. practical significance where I go into it in more detail.

' src=

January 7, 2022 at 7:07 pm

Thank you, Jim, will do.

' src=

January 7, 2022 at 5:07 pm

Hello Jim, I just came across this website. I have a query.

I wrote the following for a report: Table 5 shows the associations between all the domains. The correlation coefficients between the environment and the economy, social, and culture domains are rs=0.335 (weak), rs=0.427 (low) and rs=0.374 (weak), respectively. The correlation coefficient between the economy and the social and culture domains are rs=0.224 and rs=0.157, respectively and are negligible. The correlation coefficient (rs =0.451) between the social and the culture domains is low, positive, and significant. These weak to low correlation coefficient values imply that changes in one domain are not correlated strongly with changes in the related domain.

The comment I received was: Correlation studies are meant to see relationships- not influence- even if there is a positive correlation between x and y, one can never conclude if x or y is the reason for such correlation. It can never determine which variables have the most influence. Thus the caution and need to re-word for some of the lines above. A correlation study also does not take into account any extraneous variables that might influence the correlation outcome.

I am not sure how I should reword? I have checked several sources and their interpretations are similar to mine, Please advise. Thank you

January 7, 2022 at 9:25 pm

Personally, I think your wording is fine. Appropriately, you don’t suggest that correlation implies causation. You state that there is correlation. So, I’m not sure why the reviewer has an issue with it.

Perhaps the reviewer wants an explicit statement to that effect? “As with all correlation studies, these correlations do not necessarily represent causal relationships.”

The second portion of the review comment about extraneous variables is, in my opinion, more relevant. Pairwise correlations don’t control for the effects of other variables. Omitted variable bias can affect these pairs. I write about this in a post about omitted variable bias . These biases can exaggerate or minimize the apparent strength of pairwise correlations.

You can avoid that problem by using partial correlations or multiple regression analysis. Although, it’s not necessarily a problem. It’s just a possibility.

January 5, 2022 at 8:52 pm

Is it possible to compare two correlation coefficients? For example, let’s say that I have three data points (A, B, and C) for each of 75 subjects. If I run a Pearson’s on the A&B survey points and receive a result of .006, while the Pearson’s on the A&C survey points is .215…although both are not significant, can I say that there is a stronger correlation between A&C than between A&B? thank you!

January 6, 2022 at 8:31 pm

I am not aware of test that will assess whether the difference between two correlation coefficients is statistically significant. I know you can do that with regression coefficients , so you might want to determine whether you can use that approach. Click the link to learn more.

However, I can guess that your two coefficients probably are not significantly different and thus you can’t say one is higher. Each of your hypothesis tests are assessing whether one of the coefficients is significantly different from zero. In both cases (0.006 and 0.215), neither are significantly different from zero. Because both of your coefficients are on the same side of zero (positive) the distance between them is even smaller than your larger coefficients (0.215) distance from zero. Hence, that difference probably is also not statistically significant. However, one muddling issue is that with the two datasets combined you have a larger total sample size than either alone, which might allow a supposed combined test to determine that the smaller difference is significant. But that’s uncertain and probably unlikely.

There’s a more fundamental issue to consider beyond statistical significance . . . practical significance. The correlation of 0.006 is so small it might as well be zero. The other is 0.215 (which according to the hypothesis test, also might as well be zero). However, in practical terms, a correlation of 0.215 is also a very weak correlation. So, even if its hypothesis test said it was statistically significant from zero, it’s a puny correlation that doesn’t provide much predictive power at all. So, you’re looking at the difference between two practically insignificant correlations. Even if the larger sample size for a combined test did indicate the difference is statistically significant, that difference (0.215 – 0.006 = 0.209) almost certainly is not practically significant in a real-world sense.

But, if you really want to know the statistical answer, look into the regression method.

May 16, 2022 at 2:57 pm

JIm – here is a YT purporting to demonstrate how to compare correlation coefficients for statistical significance. I’m not a statistician and cannot vouch for the contents. https://www.youtube.com/watch?v=ipqUoAN2m4g

May 16, 2022 at 7:22 pm

That seems like a very non-standard approach in the YT video. And, with a sample size of 200 (100 males, 100 females), even very small effect sizes should be significant. So, I have some doubts about that process, but I haven’t dug into it. It might be totally valid, but it seems inefficient in terms of statistical power for the sample size.

Here’s how I would’ve done that analysis. Instead of correlation, I’d use regression with an interaction effect. I’d want to model the relationship between the amount time studying for a test and the scores. Additionally, I also gather 100 males and females and want to see if the relationship between time studying and test scores differs between genders. In regression, that’s an interaction effect. It’s the same question the YT video assesses, but using a different approach that provides a whole lot more answers.

To see that approach in action, read my post about Comparing Regression Lines Using Hypothesis Tests . In that post, I refer to comparing the relationships between two conditions, A and B. You can equate those two conditions to gender (male and female). And I look at the relationship between Input and Output, which you can equate to Time Studying and Test Score, respectively. While reading that post, notice how much more information you obtain using that approach than just the two correlation coefficients and whether they’re significantly different.

That’s what I mean by generally preferring regression analysis over simple correlation.

' src=

December 9, 2021 at 7:33 pm

salut Jim merci beaucoup pour cette explication je travaille sur un article et je veux calculer la taille d’echantillon pour critiquer la taille d’echantillon utulisé est ce que c posiible de deduire le P par le graphqiue et puis appliquer la regle pour d”duire N ?

December 12, 2021 at 11:57 pm

Unfortunately, I don’t speak French. However, I used Google Translate and I think I understand your question.

No, you can’t calculate the p-value by looking at a graph. You need the actual data values to do that. However, there is another approach you can use to determine whether they have a reasonable sample size.

You can use power and sample size software (such as the free G*Power ) to determine a good sample size. Keep in mind that the sample size you need depends on the strength of the correlation in the population. If the population has a correlation of 0.3, then you’ll need 67 data points to obtain a statistical power of 0.8. However, if the population correlation is higher, the required sample size declines while maintaining the statistical power of 0.8. For instance, for population correlations of 0.5 and 0.8, you’ll only need sample sizes of 23 and 8, respectively.

Using this approach, you’ll at least be able to determine whether they’re using a reasonable sample size given the size of correlation that they report even though you won’t know the p-value.

Hopefully, the reported the sample size, but, if not, you can just count the number of dots on the scatterplot.

' src=

November 19, 2021 at 4:47 pm

Hi Jim. How do I interpret r(12) = -.792, p < .001 for Pearson Coefficiient Correlation?

' src=

October 26, 2021 at 4:53 am

Hi If the correlation between the two independent constructs/variables and the dependent variable/constructs is medium or large, what must the manager to improve the two independent constructs/variables

' src=

October 7, 2021 at 1:12 am

Hi Jim, First of all thank you, this is an excellent resource and has really helped clarify some queries I had. I have run a Pearson’s r test on some stats software to analyse relationship between increasing age and need for friendship. The return is r = 0.052 and p = 0.381. Am I right in assuming there is a very slight positive correlation between the variables but one that is not statistically significant so the null hypothesis cannot be rejected? Kind regards

October 7, 2021 at 11:26 pm

Hi Victoria,

That correlation is so close to 0 that it essentially means that there is no relationship between your two variables. In fact, it’s so close to zero that calling it a very slight positive correlation might be exaggerating by a bit.

As for the p-value, you’re correct. It’s testing the null hypothesis that the correlation equals zero. Because your p-value is greater than any reasonable significance level, you fail to reject the null. Your data provide insufficient evidence to conclude that the correlation doesn’t equal zero (no effect).

If you haven’t, you should graph your data in a scatterplot. Perhaps there’s a U shaped relationship that Pearson’s won’t detect?

' src=

July 21, 2021 at 11:23 pm

No Jim, I mean to ask, let’s assume correlation between variable x and y is 0.91, how do we interpret the remaining 0.09 assuming correlation at 1 is strong positive linear correlation. ?

Is this because of diversification, correlation residual or any error term?

July 21, 2021 at 11:29 pm

Oh, ok. Basically, you’re asking why it’s not a perfect correlation of 1? What explains that difference of 0.09 between the observed correlation and 1? There are several reasons. The typical reason is that most relationships aren’t perfect. There’s usually a certain amount of inherent uncertainty between two variables. It’s the nature of the relationship. Occasionally, you might find very near perfect correlations for relationships governed by physical laws.

If you were to have pair of variables that should have a perfect correlation for theoretical reasons, you might still observe an imperfect correlation thanks to measurement error.

July 20, 2021 at 12:49 pm

If two variable has a correlation of 0.91 what is 0.09, in the equation?

July 21, 2021 at 10:59 pm

I’d need more information/context to be able to answer that question. Is it a regression coefficient?

' src=

June 30, 2021 at 4:21 pm

You are a great resource. Thank you for being so responsive. I’m sure I’ll be bugging you some more in the future.

June 30, 2021 at 12:48 pm

Jim, using Excel, I just calculated that the correlation between two variables (A and B) is .57, which I believe you would consider to be “moderate.” My question is, how can I translate that correlation into a statement that predicts what would happen to B if A goes up by 1 point. Thanks in advance for your help and most especially for your clarity.

June 30, 2021 at 2:59 pm

Hi Gerry, to get that type of information, you’ll need use regression analysis. Read my post about using Excel to perform regression for details . For your example, be sure to use A as the independent variable and B as the dependent variable. Then look at the regression coefficient for A to get your answer!

' src=

May 24, 2021 at 11:51 pm

Hey Man, I’m taking my stats final this week and I’m so glad I found you! Thank you for saving random college kids like me!

' src=

May 19, 2021 at 8:38 am

Hi, I am Nasib Zaman The Spearman correlation between high temperature and COVID-19 cases was significant ( r = 0.393). Correlation between UV index and COVID-19 cases was also significant ( r = 0.386). Is it true?

May 20, 2021 at 1:31 am

Both suggests that as temperature and UV increase that the number of COVID cases increases. Although it is a weak correlation. I don’t know whether that’s true or not. You’d have to assess the validity of the data to make that determination. Additionally, their might be confounding variables at play, which could bias the correlations. I have no way of knowing.

' src=

April 12, 2021 at 1:49 pm

I am using Pearson’s correlation co-efficient to to express the strength of relationship between my two variables on happiness, would this be an appropriate use?

Happiness Diet RelationshipSatisfaction

Pearson Correlation

Happiness 1.000 .310 . 416 Diet .310 1.000 .193 RelationshipSatisfaction .416 .193 1.000

Sig. (1-tailed) 0.00 0.00 Happiness Diet 0.00 0.00 RelationshipSatisfaction 0.00 0.00

N Happiness 1297 1297 1297 Diet 1297 1297 1297 RelationshipSatisfaction 1297 1297 1297

If so, would I be right to say that because the coefficient was r= (.193), it suggests that there is not too strong a relationship between the two independent variables. Can I use anything else to indicate significance levels?

' src=

March 29, 2021 at 3:12 am

I just want to say that your posts are great, but the QA section in the comments is even greater!

Congrats, Jim.

March 29, 2021 at 2:57 pm

Thanks so much!! 🙂

And, I’m really glad you enjoy the QA in the comments. I always request readers to post their questions in the comments section of the relevant post so the answers benefit everyone!

' src=

March 24, 2021 at 1:16 am

Thank you very much. This question was troubling me since last some days , thanks for helping.

Have a nice day…

March 24, 2021 at 1:34 am

You’re very welcome, Ronak! I’m glad to help!

' src=

March 22, 2021 at 12:56 pm

Nalin here. I found your article to be very clarifying conceptually. I had a doubt.

So there is this dataset I have been working on and I calculated the Pearson correlation coefficient between the target variable and the predictor variables. I found out that none of the predictor variables had a correlation >0.1 and <-0.1 with the target variable, hence indicating that no linear relationship exists between them.

How can I verify whether or not any non-linear relationships exist between these pairs of variables or not? Will a scatterplot confirm my claims?

March 23, 2021 at 3:09 pm

Yes, graphing the data in a scatterplot is always a good idea. While you might not have a linear relationship, you could have a curvilinear relationship. A scatterplot would reveal that.

One other thing to watch out for is omitted variable bias. When you perform correlation on a pair of variables, you’re not factoring in other relevant variables that can be confounding the results. To see what I mean, read my post about omitted variable bias . In it, I start with a correlation that appear to be zero even though there actually is a relationship. After I accounted for another variable, there was a significant relationship between the original pair of variables! Just another thing to watch out for that isn’t obvious!

March 20, 2021 at 3:23 am

Yes, I am also doing well…

I am having some subsequent queries…

By overall trend you mean that correlation coefficient will capture how y is changing with respect to x (means y is increasing or decreasing with increase or decrease in x), am i interpreting correctly ?

hypothesis in correlation coefficient

March 22, 2021 at 12:25 am

This is something should be clear by examining the scatterplot. Will a straight line fit the dots? Do the dots fall randomly about a straight line or are there patterns? If a straight line fits the data, Pearson’s correlation is valid. However, if it does not, then Pearson’s is not valid. Graphing is the best way to make the determination.

Thanks for the image.

March 23, 2021 at 3:41 pm

Hi again Ronak!

On your graph, the data points are the red line (actually lots and lots of data points and not really a line!). And, the green line is the linear fit. You don’t usually think of Pearson’s correlation as modeling the data but it uses a linear fit. So, the green line is how Pearson’s correlation models your data. You can see that the model doesn’t fit the data adequately. There are systematic (i.e., non-random departures) from the data points. Right there you know that Pearson’s correlation is invalid for these data.

Your data has an upward trend. That is, as X increases, Y also increases. And Pearson’s partially captures that trend. Hence, the positive slope for the green line and the positive correlation you calculated. But, it’s not perfect. You need a better model! In terms of correlation, the graph displays a monotonic relationship and Spearman’s correlation would be a good candidate. Or, you could use regression analysis and include a polynomial to model the curvature . Either of these methods will produce a better fit and more accurate results!

March 18, 2021 at 11:01 am

i am ronak from india. how are you?…hoping corona has not troubled you much. you have simplified concept very well. you are doing amazing job ,great work. i have one doubt and want to clarify it.

Question : whenever we talk correlation coefficient we talk in terms of linear relationship. but i have calculated correlation coefficient for relationship Y vs X^3.

X variable : 1 to 10000 Y = X^3

and correlation coefficient is coming around 0.9165. it is strange even relationship is not linear still it is giving me very high correlation coefficient.

March 19, 2021 at 3:53 pm

I’m doing well here. Just hunkering down like everyone else! I hope you’re doing well too! 🙂

For your data, I’d recommend graphing them in a scatterplot and fit a linear trend line. You can do that in Excel. If your data follow an S-shaped cubic relationship, it is still possible to get a relatively strong correlation. You’ll be able to see how that happens in the scatterplot with trend line. There’s an overall trend to the data that your line follows, but it does hug the curves. However, if you fit a model with a cubic term to fit the curves, you’ll get a better model.

So, let’s switch from a correlation to R-squared. Your correlation of 0.9165 corresponds to an R-squared of 0.84. I’m literally squaring your correlation coefficient to get the R-squared value. Now, fit a regression model with the quadratic and cubic terms to fit your data. You’ll find that your R-squared for this model is higher than for the linear model.

In short, the linear correlation is capturing the overall trend in the data but doesn’t fit the data points as well as the model designed for curvilinear data. Your correlation seems good but it doesn’t fully fit the data.

' src=

March 11, 2021 at 10:56 am

Hi Jim Do the partial correlation include the continuous (scale) variables all times? Is it possible to include other types of variables (as nominal or ordinal)? Regards Jagar

March 16, 2021 at 12:30 am

Pearson correlations are for continuous data that follow a linear relationship. If you have ordinal data or continuous data that follow a monotonic relationship, you can use Spearman’s correlation.

There are correlations specifically for nominal data. I need to write a blog post about those!

' src=

March 10, 2021 at 11:45 am

if the correlation coefficient is 0.153 what type of correlation is it?

February 14, 2021 at 1:49 pm

' src=

February 12, 2021 at 8:09 pm

If my r value when finding correlation between two things is -0.0258 what would that be negative weak correlation or something else?

February 14, 2021 at 12:08 am

Hi Dez, your correlation coefficient is essentially zero, which indicates no relationship between the variables. As one variable increases, there is no tendency for the variable to either increase or decrease. There’s just no relationship between them according to your data.

' src=

January 9, 2021 at 12:10 pm

my coefficient correlation between my independent variables (anger, anxiety, happiness, satisfaction) and a dependent variable(entrepreneurial decision making behavior) is 0.401, 0.303, 0.369, 0.384.

what does this mean? how do i interpret explain this? what’s the relationship?

January 10, 2021 at 1:33 am

It means that separately each independent variable (IV) has a positive correlation with the dependent variable (DV). As each IV increases, the DV tends to increase. However, it is a fairly weak correlation. Additionally, these correlations don’t control for confounding variables. You should perform a regression analysis because you have your IVs and DV. Your model will tell how much variability the IVs account for in the DV collectively. And, it will control for the other variables in the model, which can help reduce omitted variable bias.

The information in this post should help you interpret your correlation coefficients. Just read through it carefully.

' src=

January 4, 2021 at 6:20 am

Hello there, If one were to find out the correlation between the average grade and a variable, could this coefficient be used? Thanks!

January 4, 2021 at 4:03 pm

If you mean something like an average grade per student and the other variable is something like the number of hours each student studies, yes, that’s fine. You just need to be sure that the average grade applies to one person and that the other variable applies to the same person. You can’t use a class average and then the other variable is for individuals.

' src=

December 27, 2020 at 8:27 am

I’m helping a friend working on a paper and don’t have the variables. The question centers around the nature of Criterion Referenced Tests, in general, i.e. correlations of CRT vs. Norm Referenced Tests. As you know, Norm Referenced compares students to each other across a wide population. In this paper, the student is creating a teacher made CRT. It is measuring proficiency of students of more similar abilities and smaller population to criteria and not to each other. I suspect, in general, the CRT doesn’t distinguish as well between students with similar abilities and knowledge. Therefore, the reliability coefficients, in general, are less reliable. How does this effect high or low correlations?

December 26, 2020 at 9:40 pm

high or lower correlation on a CRT proficiency test good or bad?

December 27, 2020 at 1:30 am

Hi Raymond, I’d have to know more about the variables to have an idea about what the correlation means.

' src=

December 8, 2020 at 11:02 pm

I have zero statistics experience but I want to spice up a paper that I’m writing with some quants. And so learned the basics about Pearson correlation on SPSS and I plugged in my data. Now, here’s where it gets “interesting.” Two sets of numbers show up: One on the Pearson Correlation row and below that is the Sig. (2-tailed) row.

I’m too embarrassed to ask folks around me (because I should already know this!). So, let me ask you: which of the row of numbers should I use in my analysis about the correlations between two variables? For example, my independent variable correlates with the dependent variable at -.002 on the first (Pearson Correlation) row. But below that is the Sig. (2-tailed) .995. What does that mean? And is it necessary to have both numbers?

I would really appreciate your response … and will acknowledge you (if the paper gets published).

Many thanks from an old-school qualitative researcher struggling in the times of quants! 🙂

December 9, 2020 at 12:32 am

The one you want to use for a measure of association is the Pearson Correlation. The other value is the p-value. The p-value is for a hypothesis test that determines whether your correlation value is significantly different from zero (no correlation).

If we take your -0.002 correlation and it’s p-value (0.995), we’d interpret that as meaning that your sample contains insufficient evidence to conclude that the population correlation is not zero. Given how close the correlation is to zero, that’s not surprising! Zero correlation indicates there is no tendency for one variable to either increase or decrease as the other variable increases. In other words, there is no relationship between them.

' src=

November 24, 2020 at 7:55 am

Thank you for the good explanation. I am looking for the source or an article that states that most correlations regarding human behaviour are around .6. What source did you use?

Kind regards, Amy

' src=

November 13, 2020 at 5:27 am

This is an informative article and I agree with most of what is said, but this particular sentence might be misleading to readers: “R-squared is a primary measure of how well a regression model fits the data.”. R-squared is in fact based on the assumption that the regression model fits the data to a reasonable extent therefore it cannot also simultaneously be a measure of the goodness of said fit.

The rest of the claims regarding R-squared I completely agree with.

Cheers, Georgi

November 13, 2020 at 2:48 pm

Yes, I make that exact point repeatedly throughout multiple blog posts, particularly my post about R-squared .

Additionally, R-squared is a goodness-of-fit measure, so it is not misleading to say that it measures how well the model fits the data. Yes, it is not a 100% informative measure by itself. You’d also need to assess residual plots in conjunction with the R-squared. Again, that’s a point that I make repeatedly.

I don’t mind disagreements, but I do ask that before disagreeing, you read what I write about a topic to understand what I’m saying. In this case, you would’ve found in my various topics about R-squared and residual plots that we’re saying the same thing.

' src=

November 7, 2020 at 12:31 pm

Thank you very much!

November 6, 2020 at 7:34 pm

Hi Jim, I have a question for you – and thank you in advance for responding to it 🙂

Set A has the correlation coefficient of .25 and Set B has the correlation of .9, Which set has the steeper trend line? A or B?

November 6, 2020 at 8:41 pm

Set B has a stronger relationship. However, that’s not quite equivalent to saying it has a steeper trend line. It means the data points fall closer to the line.

If you look at the examples in this post, you’ll notice that all the positive correlations have roughly equal slopes despite having different correlations. Instead, you see the points moving closer to the line as the strength of the relationship increases. The only exception is that a correlation of zero has a slope of zero.

The point being that you can’t tell from the correlation alone which trend line is steeper. However, the relationship in Set B is much stronger than the relationship in Set A.

' src=

October 19, 2020 at 6:33 am

Thank you 😊. Now I understand.

October 11, 2020 at 4:49 am

hi, I’m a little confused.

What does it indicating, If there is positive correlation, but negative coefficient from multiple regression outcome? in this situation, how to interpret? the relationship is negative or positive?

October 13, 2020 at 1:32 pm

This is likely a case of omitted variable bias. A pairwise correlation involves just two variables. Multiple regression analysis involves three variables at a minimum (2 IVs and a DV). Correlation doesn’t control for other variables while regression analysis controls for the other variables in the model. That can explain the different relationships. Omitted variable bias occurs under specific conditions. Click the link to read about when it occurs. I include an example where I first look at a pair of variables and then three variables and shows how that changes the results, similar to your example.

' src=

September 30, 2020 at 4:26 pm

Hi Jim, I have 4 objective in my research and when I did the correlation between first one and others the result is: ob1 with ob2 is (0.87) – ob1 with ob3 is (0.84) – ob1 with ob4 is ( 0.83). My question is what is that meaning and can I do Correlation Coefficient with all of them in one time.

' src=

September 28, 2020 at 4:06 pm

Which best describes the correlation coefficient for r=.08?

September 30, 2020 at 4:29 pm

Hi Jolette,

I’d say that is an extremely weak correlation. I’d want to see its p-value. If it’s not significant, then you can’t conclude that the correlation is different from zero (no correlation). Is there something else particular you want to know about it?

' src=

September 15, 2020 at 11:50 am

Correlation result between Vul and FCV

t = 3.4535, df = 306, p-value = 0.0006314 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.08373962 0.29897226 sample estimates: cor 0.1936854

What does this mean?

September 17, 2020 at 2:53 am

Hi Lakshmi,

It means that your correlation coefficient is ~0.19. That’s the sample estimate. However, because you’re working with a sample, there’s always sample error and so the population correlation is probably not exactly equal to the sample value. The confidence interval indications that you can be 95% confident that the true population correlation falls between ~0.08 and 0.30. The p-value is less than any common significance level. Consequently, you can reject the null hypothesis that the population correlation equals zero and conclude that it does not equal zero. In other words, the correlation you see in the sample is likely to exist in the population.

A correlation of 0.19 is a fairly weak relationship. However, even though it is weak, you have enough evidence to conclude that it exists in the population.

' src=

September 1, 2020 at 8:16 am

Hi Jim Thank you for your support. I have a question that is. Testing criteria for Validity by Pearson correlation, r table determine by formula DF=N-2 – If it is Valid the correlation value less that Pearson correlation value. (Pearson correlation > r table ) – if it is Invalid the correlation value greater that Pearson correlation value. (Pearson correlation < r table ) I got the above information on SPSS tutorial Video about Pearson correlation.

but I didn't get on other literature please can you recommend me some literature that refers about this? or can you clarify more about how to check Validity by Pearson correlation?

' src=

August 31, 2020 at 3:21 am

HI JIM i am zia from pakistan i wanna finding correlation of two factoer i have find 144.6 of 66.93 thats is postive relation?

August 31, 2020 at 12:39 pm

Hi Zia, I’m sorry but I’m not clear about what you’re asking. Correlation coefficients range between -1 and +1, so those two values are not correlation coefficients. Are they regression coefficients?

' src=

August 16, 2020 at 6:47 am

Warmest greetings.

My name is Norshidah Nordin and I am very grateful if you could provide me some answers to the following questions.

1) Can I used two different set of samples (for e.g. students academic performance (CGPA) as dependent variable and teacher’s self efficacy as dependent variable) to run on a Pearson correlation analysis. If yes, could you elaborate on this aspect.

2) what is the minimum sample size to use in multiple regression analysis.

August 17, 2020 at 9:06 pm

Hi Norshidah,

For correlations, you need to have multiple measurements on the same item or person. In your scenario, it sounds like you’re taking different measurements on different people. Pearson’s correlation would not be appropriate.

The minimum sample size for multiple regression depends on the number of terms you need to include in your model. Read my post about overfitting regression models , which occurs when you have too few observations for the number of model terms.

I hope this helps!

' src=

July 29, 2020 at 5:27 pm

Greetings sir, question…. Can you do an accurate regression with a Pearson’s correlation coefficient of 0.10? Why or Why not?

July 31, 2020 at 5:33 pm

Hi Monique,

It is possible. First, you should determine whether that correlation is statistically significant. You’re seeing a correlation in your sample, but you want to be confident that is also exists in the large population you’re studying. There’s a possibility that the correlation only exists in your sample by random chance and does not exist in the population–particularly with such a low coefficient. So, check the p-value for the coefficient. If it’s significant, you have reason to proceed with the regression analysis. Additionally, graph your data. Pearson’s only is for linear relationships. Perhaps your coefficient is low because the relationship is curved?

You can fit the regression model to your data. A correlation of 0.10 equates to an R-squared of only 0.01, which is very low. Perhaps adding more independent variables will increase the R-squared. Even if the r-squared stays very low, if your independent variable is significant, you’re still learning something from your regression model. To understand what you can learn in this situation, read my post about regression models with significant variables and a low R-squared values .

So, it is possible to do a valid regression and learn useful information even when the correlation is so low. But, you need to check for significance along the way.

' src=

July 8, 2020 at 4:55 am

Hello Jim, first and foremost thank you for giving us a comprehensive information regarding this! This totally help me. But I have a question; my pearson results showing that there’s a moderate positive relationship between my variables which is Parasocial Interaction and the fans’ purchase intention.

But the thing is, if I look at the answer majority of my participants are mostly answering Neutral regarding purchase intention.

What does this means? could you help me to figure out this T.T thanks you in advance! I’m a student currently doing thesis from Malaysia.

July 8, 2020 at 4:00 pm

Hi Titania,

Have you graphed your data using a scatterplot? I’d highly recommend that because I think it will probably clarify what your data are telling you. Also, are both of your variables continuous variables? I’m wonder if purchase intention is ordinal if one of the values is Neutral. If that’s the case, you’d need to use Spearman’s Rank Correlation rather than Pearson’s.

' src=

June 18, 2020 at 8:57 am

Hello Jim ! I have a question . I calculated a correlation coefficient between the scale variables and got 0.36, which is relatively weak since it gives a 0.12 if quared. What does the interpretation of correlation concern ? The sample taken or the type of data measurement ? or anything else?

I hope you got my question. Thank you for your help!!

June 18, 2020 at 5:06 pm

I’m not clear what you’re asking exactly. Please clarify. The correlation measures the strength of the relationship between the two continuous variables, as I explain in this article.

Yes, that it is a weak relationship. If you’re going to include this is a regression analysis, you might want to read my article about interpreting low R-squared values .

I’m not sure what you mean by scale variables. However, if these are Likert scale items, you’ll need to use Spearman’s correlation instead of Pearson’s correlation.

' src=

May 26, 2020 at 12:08 am

Hi Jim I am very new to statistics and data analysis. I am doing a quantitative study and my sample size is 200 participants. So far I have only obtained 50 complete responses. . Using G*Power a simple linear regression with a medium effect size, an alpha of .05, and a power level of .80 can I do a data analysis with this small sample.

May 26, 2020 at 3:52 am

Please repost your question in the comments section of the appropriate article. It has nothing to do with correlation coefficients. Use the search bar part way down in the right column and search for power. I have a post about power analysis that is a good fit.

' src=

May 24, 2020 at 9:02 pm

Thank you Mr.Jim, it was a great answer for me!😉 Take good care~

May 24, 2020 at 9:46 am

I am a student from Malaysia.

I have a question to ask Mr.Jim about how to determine the validity (the accurate figure) of the data for analysis purpose base on the table of Pearson’s Correlation Coefficient? Do it has any method?

For example, since the coefficient between one independent variable with the other variable is below 0.7, thus the data valid for analysis purpose.

However, I have read the table there is a figure which more than 0.7. I am not sure about that.

Hope to hearing from Mr.Jim soon. Thank you.

May 24, 2020 at 4:20 pm

Hi, I hope you’re doing well!

There is no single correlation coefficient value that determines whether it is valid to study. It partly depends on your subject area. I low noise physical process might often have a correlation in the very high 0.9s and 0.8 would be considered unacceptable. However, in a study of human behavior, it’s normal and acceptable to have much lower correlations. For example a correlation of 0.5 might be considered very good. Of course, I’m writing the positive values, but the same applies to negative correlations too.

It also depends on what the purpose of your study. If you’re doing something practical, such as describing the relationship between material composition and strength, there might be very specific requirements about how strong that relationship must be for it to be useful. It’s based on real-world practicalities. On the other hand, if you’re just studying something for the sake of science and expanding knowledge, lower correlations might still be interesting.

So, there’s not single answer. It depends on the subject-area you are studying and the purpose of your study.

' src=

February 17, 2020 at 3:49 pm

HI Jim, what could be the implication of my result if I obtained a weak relationship between industry experience and instructional effectiveness? thanks in advance

February 20, 2020 at 11:29 am

The best way to think of it is to look at the graphs in this article and compare the higher correlation graphs to the lower correlation graphs. In the higher correlation graphs, if you know the value of one variable, you have a more precise prediction of the value of the other variable. Look along the x-axis and pick a value. In the higher correlation graphs, the range of y-values that correspond to your x-value is narrower. That range is relatively wide for lower correlations.

For your example, I’ll assume there is a positive correlation. As industry experience increases, instructional effectiveness also increases. However, because that relationship is weak, the range of instructional effectiveness for any given value of industry experience is relatively wide.

' src=

November 25, 2019 at 9:05 pm

if correlation between X and Y is 0.8 .what is the correlation of -X and -Y

November 26, 2019 at 4:59 pm

If you take all the values of X and multiply them by -1 and do the same for Y, your correlation would still be 0.8.

' src=

November 7, 2019 at 3:51 am

This is very helpful, thank you Jim!

' src=

November 6, 2019 at 3:16 am

Hi, My data is continuous – the variables are individual shares volatility and oil prices and they were non-normal. I used Kendall’s Tau and did not rank the data or alter it in any way. Can my results be trusted?

November 6, 2019 at 3:32 pm

Hi Lorraine,

Kendall’s Tau is a correlation coefficient for ranked data. Even though you might not have ranked your data, your statistical software must have created the ranks behind the scenes.

Typically, you’ll use Pearson’s correlation when you have continuous data that have a straight line relationship. If your data are ordinal, ranked, or do not have a straight line relationship, using something other than Pearson’s correlation is necessary.

You mention that your data are nonnormal. Technically, you want to graph your data and look at the shape of the relationship rather than assessing the distribution for each variable. Although, nonnormality can make a linear relationship less likely. So, graph your data on a scatterplot and see what it looks like. If it is close to a straight line, you should probably use Pearson’s correlation. If it’s not a straight line relationship, you might need to use something like Kendall’s Tau or Spearman’s rho coefficient, both of which are based on ranked data. While Spearman’s rho is more commonly used, Kendall’s Tau has preferable statistical properties.

' src=

October 24, 2019 at 11:56 pm

Hi, Jim. If correlations between continuous variables can be measured using Pearson’s, how is correlation between categorical variables measured? Thank you.

October 25, 2019 at 2:38 pm

There are several possible methods, although unlike with continuous data, there doesn’t seem to be a consensus best approach.

But, first off, if you want to determine whether the relationship between categorical variables is statistically significant, use the chi-square test of independence . This test determines whether the relationship between categorical variables is significant, but it does not tell you the degree of correlation.

For the correlation values themselves, there are different methods, such as Goodman and Kruskal’s lambda, Cramér’s V (or phi) for categorical variables with more than 2 levels, and the Phi coefficient for binary data. There are several others that are available as well. Offhand I don’t know the relative pros and cons of each methodology. Perhaps that would be a good post for the future!

' src=

August 29, 2019 at 7:31 pm

Thanks, great explanations.

' src=

April 25, 2019 at 11:58 am

In a multi-variable regression model, is there a method for determining where two predictor variables are correlated in their impact on the outcome variable?

If so, then how is this type of scenario determined, and handled?

Thanks, Curt

April 25, 2019 at 1:27 pm

When predictors are correlated, it’s known as multicollinearity. This condition reduces the precision of the coefficient estimates. I’ve written a post about it: Multicollinearity: Detection, Problems, and Solutions . That post should answer all your questions!

' src=

February 3, 2019 at 6:45 am

Hi Jim: Great explanations. One quick thing, because the probability distribution is asymptotic, there is no p=.000. The probability can never be zero. I see students reporting that or p<.000 all of the time. The actual number may be p <.00000001, so setting a level of p < .001 is usually the best thing to do and seems like journal editors want that when reporting data. Your thoughts?

February 4, 2019 at 12:25 am

Hi Susan, yes, you’re correct about that. You can’t have a p-value that equals zero. Sometimes software will round down when it’s a very small value. The underlying issue is that no matter how large the difference between your sample value and the null hypothesis value, there is a non-zero probability that you’d obtain the observed results when the null is true.

' src=

January 9, 2019 at 6:41 pm

Sir you are love. Such a nice share

' src=

November 21, 2018 at 11:17 am

Awesome stuff, really helpful

' src=

November 9, 2018 at 11:48 am

What do you do when you can’t perform randomized controlled experiments, like in the cases of social science or societal wide health issues? Apropos to gun violence in America, there appears to be correlation between the availability of guns in a society and the number of gun deaths in a society, where as the number of guns in the society goes up the number of gun deaths go up. This is true of individual states in the US where gun availability differs, and also in countries where gun availability differs. But, when/how can you come to a determination that lowering the number of guns available in a society could reasonably be said to lower the number of gun deaths in that society.

November 9, 2018 at 12:20 pm

Hi Patrick,

It is difficult proving causality using observational studies rather than randomized experiments.

In my mind, the following approach can help when you’re trying to use observational studies to show that A causes B.

In observational study, you need to worry about confounding variables because the study is not randomized. These confounding variables can provide alternative explanations for the effect/correlations. If you can include all confounding variables in the analysis, it makes the case stronger because it helps rule out other causes. You must also show that A precedes B. Further, it helps if you can demonstrate the mechanism by which A causes B. That mechanism requires subject-area knowledge beyond just a statistical test.

Those are some ideas that come to my mind after brief reflection. There might well be more and, of course, there will be variations based on the study-area.

' src=

September 19, 2018 at 4:55 am

Thank you so much, I am learning a lot of thing from you!

Please, keep doing this great job!

Best regards

September 19, 2018 at 11:45 pm

You bet, Patrik!

September 18, 2018 at 6:04 am

Another question is: should I consider transform my variable before using person correlation, if they do not follow normal distribution or if the two variable do not have a clear liner relationship? What is the implication of that transformation? How to interpret the relationship if used transformed variable (let“s say log)?

September 18, 2018 at 4:44 pm

Because the data need to follow the bivariate normal distribution to use the hypothesis test, I’d assume the transformation process would be more complex than transforming each variable individually. However, I’m not sure about this.

However, if you just want to make a straight line for the correlation to assess, I’d be careful about that too. The correlation of the transformed data would not apply to the untransformed data. One solution would be to use Spearman’s rank order correlation. Another would be to use regression analysis. In regression analysis, you can fit curves, use transformations, etc., and the assumption is that the residual follow a normal distribution (along with some other assumptions) is easy to check.

If you’re not sure that your data fit the assumptions for Pearson’s correlation, consider using regression instead. There are more tools there for you to use.

September 18, 2018 at 5:36 am

Hi Jim, I am always here following your posts.

I would like if you could clarify something to me, please! What is the assumptions for person correlation that must hold true, in order to apply correlation coefficient?

I have read something on the internet, but there is many confusion. Some people are saying that the dependent variable (if have) must be normally distributed, other saying both (dependent and independent) must be following normal distribution. Therefore, I dont know which one I should follow. I would appreciate a lot your kind contribution. This is something that I am using for my paper.

Thank you in advance!

September 18, 2018 at 4:34 pm

I’m so glad to see that you’re hear reading and learning!

This issue turns out to be a bit complicated!

The assumption is actually that the two variables follow a bivariate normal distribution. I won’t go into that here in much detail, but a bivariate normal distribution is more complex than just each variable following a normal distribution. In a nutshell, if you plot data that follow a bivariate normal distribution on a scatterplot, it’ll appear as an elliptical shape.

In terms of the the correlation coefficient, that simply describes the relationship between the data. It is what it is and the data don’t need to follow a bivariate normal distribution as long as you are assessing a linear relationship.

On the other hand, the hypothesis test of Pearson’s correlation coefficient does assume that the data follow a bivariate normal distribution. If you want to test whether the coefficient equals zero, then you need to satisfy this assumption. However, one thing I’m not sure about is whether the test is robust to departures from normality. For example, a 1-sample t-test assumes normality, but with a large enough sample size you don’t need to satisfy this assumption. I’m not sure if a similar sample size requirement applies to this particular test.

I hope this clarifies this issue a bit!

' src=

August 29, 2018 at 8:04 am

Hello, thanks for the good explanation. Do variables have to be normally distributed to be analyzed in a Pearson’s correlation? Thanks, Moritz

August 30, 2018 at 1:41 pm

No, the variables do not need to follow a normal distribution to use Pearson’s correlation. However, you do need to graph the data on a scatterplot to be sure that the relationship between the variables is linear rather than curved. For curved relationships, consider using Spearman’s rank correlation.

' src=

June 1, 2018 at 9:08 am

Pearson’s correlation measures only linear relationships. But regression can be performed with nonlinear functions, and the software will calculate a value of R^2. What is the meaning of an R^2 value when it accompanies a nonlinear regression?

June 1, 2018 at 9:49 am

Hi Jerry, you raise an important point. R^2 is actually not a valid measure in nonlinear models. To read about why, read my post about R-squared in nonlinear models . In that post, I write about why it’s problematic that many statistical software packages do calculate R-squared values for nonlinear regression. Instead, you should use a different goodness-of-fit measure, such as the standard error of the regression .

' src=

May 30, 2018 at 11:59 pm

Hi, fantastic blog, very helpful. I was hoping I could ask a question? You talk about correlation coefficients but I was wondering if you have a section that talks about the slope of an association? For example, am I right in thinking that the slope is equal to the standardized coefficient from a regression?

I refer to the paper of Cameron et al., (The Aging of Elastic and Muscular Arteries. Diabetes Care 26:2133–2138, 2003) where in table 3 they report a correlation and a slope. Is the correlation the r value and the slope the beta value?

Many thanks, Matt

May 31, 2018 at 12:13 pm

Thanks and I’m glad you found the blog to be helpful!

Typically, you’d use regression analysis to obtain the slope and correlation to obtain the correlation coefficient. These statistics represent fairly different types of information. The correlation coefficient (r) is more closely related to R^2 in simple regression analysis because both statistics measure how close the data points fall to a line. Not surprisingly if you square r, you obtain R^2.

However, you can use r to calculate the slope coefficient. To do that, you’ll need some other information–the standard deviation of the X variable and the standard deviation of the Y variable.

The formula for the slope in simple regression = r(standard deviation of Y/standard deviation of X).

For more information, read my post about slope coefficients and their p-values in regression analysis . I think that will answer a lot of your questions.

' src=

April 12, 2018 at 5:19 am

Nice post ! About pitfalls regarding correlation’s interpretation, here’s a funny database:

http://www.tylervigen.com/spurious-correlations

And a nice and poetic illustration of the concept of correlation:

https://www.youtube.com/watch?v=VFjaBh12C6s&t=0s&index=4&list=PLCkLQOAPOtT1xqDNK8m6IC1bgYCxGZJb_

Have a nice day

April 12, 2018 at 1:57 pm

Thanks for sharing those links! It always fun finding strange correlations like that.

The link for spurious correlations illustrates an important point. Many of those funny correlations are for time series data where both variables have a long-term trend. If you have two variables that you measure over time and they both have long term trends, those two variables will have a strong correlation even if there is no real connection between them!

' src=

April 3, 2018 at 7:05 pm

“In statistics, you typically need to perform a randomized, controlled experiment to determine that a relationship is causal rather than merely correlation.”

Would you please provide an example where you can reasonably conclude that x causes y? And how do you know there isn’t a z that you didn’t control for?

April 3, 2018 at 11:00 pm

That’s a great question. The trick is that when you perform an experiment, you should randomly assign subjects to treatment and control groups. This process randomly distributes any other characteristics that are related to the outcome variable (y). Suppose there is a z that is correlated to the outcome. That z gets randomly distributed between the treatment and control groups. The end result is that z should exist in all groups in roughly equal amounts. This equal distribution should occur even if you don’t know what z is. And, that’s the beautiful thing about random assignment. You don’t need to know everything that can affect the outcome, but random assignment still takes care of it all.

Consequently, if there is a relationship between a treatment and the outcome, you can be pretty certain that the treatment causes the changes in the outcome because all other correlation-only relationships should’ve been randomized away.

I’ll be writing about random assignment in the near future. And, I’ve written about the effectiveness of flu shots , which is based on randomized controlled trials.

Comments and Questions Cancel reply

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base
  • Correlation Coefficient | Types, Formulas & Examples

Correlation Coefficient | Types, Formulas & Examples

Published on August 2, 2021 by Pritha Bhandari . Revised on June 22, 2023.

A correlation coefficient is a number between -1 and 1 that tells you the strength and direction of a relationship between variables .

In other words, it reflects how similar the measurements of two or more variables are across a dataset.

Correlation coefficient value Correlation type Meaning
1 Perfect positive correlation When one variable changes, the other variables change in the same direction.
0 Zero correlation There is no relationship between the variables.
-1 Perfect negative correlation When one variable changes, the other variables change in the opposite direction.

Graphs visualizing perfect positive, zero, and perfect negative correlations

Table of contents

What does a correlation coefficient tell you, using a correlation coefficient, interpreting a correlation coefficient, visualizing linear correlations, types of correlation coefficients, pearson’s r, spearman’s rho, other coefficients, other interesting articles, frequently asked questions about correlation coefficients.

Correlation coefficients summarize data and help you compare results between studies.

Summarizing data

A correlation coefficient is a descriptive statistic . That means that it summarizes sample data without letting you infer anything about the population. A correlation coefficient is a bivariate statistic when it summarizes the relationship between two variables, and it’s a multivariate statistic when you have more than two variables.

If your correlation coefficient is based on sample data, you’ll need an inferential statistic if you want to generalize your results to the population. You can use an F test or a t test to calculate a test statistic that tells you the statistical significance of your finding.

Comparing studies

A correlation coefficient is also an effect size measure, which tells you the practical significance of a result.

Correlation coefficients are unit-free, which makes it possible to directly compare coefficients between studies.

Here's why students love Scribbr's proofreading services

Discover proofreading & editing

In correlational research , you investigate whether changes in one variable are associated with changes in other variables.

After data collection , you can visualize your data with a scatterplot by plotting one variable on the x-axis and the other on the y-axis. It doesn’t matter which variable you place on either axis.

Visually inspect your plot for a pattern and decide whether there is a linear or non-linear pattern between variables. A linear pattern means you can fit a straight line of best fit between the data points, while a non-linear or curvilinear pattern can take all sorts of different shapes, such as a U-shape or a line with a curve.

Inspecting a scatterplot for a linear pattern

There are many different correlation coefficients that you can calculate. After removing any outliers , select a correlation coefficient that’s appropriate based on the general shape of the scatter plot pattern. Then you can perform a correlation analysis to find the correlation coefficient for your data.

You calculate a correlation coefficient to summarize the relationship between variables without drawing any conclusions about causation .

Both variables are quantitative and normally distributed with no outliers, so you calculate a Pearson’s r correlation coefficient .

The value of the correlation coefficient always ranges between 1 and -1, and you treat it as a general indicator of the strength of the relationship between variables.

The sign of the coefficient reflects whether the variables change in the same or opposite directions: a positive value means the variables change together in the same direction, while a negative value means they change together in opposite directions.

The absolute value of a number is equal to the number without its sign. The absolute value of a correlation coefficient tells you the magnitude of the correlation: the greater the absolute value, the stronger the correlation.

There are many different guidelines for interpreting the correlation coefficient because findings can vary a lot between study fields. You can use the table below as a general guideline for interpreting correlation strength from the value of the correlation coefficient.

While this guideline is helpful in a pinch, it’s much more important to take your research context and purpose into account when forming conclusions. For example, if most studies in your field have correlation coefficients nearing .9, a correlation coefficient of .58 may be low in that context.

Correlation coefficient Correlation strength Correlation type
-.7 to -1 Very strong Negative
-.5 to -.7 Strong Negative
-.3 to -.5 Moderate Negative
0 to -.3 Weak Negative
0 None Zero
0 to .3 Weak Positive
.3 to .5 Moderate Positive
.5 to .7 Strong Positive
.7 to 1 Very strong Positive

The correlation coefficient tells you how closely your data fit on a line. If you have a linear relationship, you’ll draw a straight line of best fit that takes all of your data points into account on a scatter plot.

The closer your points are to this line, the higher the absolute value of the correlation coefficient and the stronger your linear correlation.

If all points are perfectly on this line, you have a perfect correlation.

Perfect positive and perfect negative correlations, with all dots sitting on a line

If all points are close to this line, the absolute value of your correlation coefficient is high .

High positive and high negative correlation, where all dots lie close to the line

If these points are spread far from this line, the absolute value of your correlation coefficient is low .

Low positive and low negative correlation, with dots scattered widely around the line

Note that the steepness or slope of the line isn’t related to the correlation coefficient value. The correlation coefficient doesn’t help you predict how much one variable will change based on a given change in the other, because two datasets with the same correlation coefficient value can have lines with very different slopes.

Two positive correlations with the same correlation coefficient but different slopes

Receive feedback on language, structure, and formatting

Professional editors proofread and edit your paper by focusing on:

  • Academic style
  • Vague sentences
  • Style consistency

See an example

hypothesis in correlation coefficient

You can choose from many different correlation coefficients based on the linearity of the relationship, the level of measurement of your variables, and the distribution of your data.

For high statistical power and accuracy, it’s best to use the correlation coefficient that’s most appropriate for your data.

The most commonly used correlation coefficient is Pearson’s r because it allows for strong inferences. It’s parametric and measures linear relationships. But if your data do not meet all assumptions for this test, you’ll need to use a non-parametric test instead.

Non-parametric tests of rank correlation coefficients summarize non-linear relationships between variables. The Spearman’s rho and Kendall’s tau have the same conditions for use, but Kendall’s tau is generally preferred for smaller samples whereas Spearman’s rho is more widely used.

The table below is a selection of commonly used correlation coefficients, and we’ll cover the two most widely used coefficients in detail in this article.

Correlation coefficient Type of relationship Levels of measurement Data distribution
Pearson’s r Linear Two quantitative (interval or ratio) variables Normal distribution
Spearman’s rho Non-linear Two , interval or ratio variables Any distribution
Point-biserial Linear One dichotomous (binary) variable and one quantitative ( or ratio) variable Normal distribution
Cramér’s V (Cramér’s φ) Non-linear Two Any distribution
Kendall’s tau Non-linear Two ordinal, interval or Any distribution

The Pearson’s product-moment correlation coefficient, also known as Pearson’s r, describes the linear relationship between two quantitative variables.

These are the assumptions your data must meet if you want to use Pearson’s r:

  • Both variables are on an interval or ratio level of measurement
  • Data from both variables follow normal distributions
  • Your data have no outliers
  • Your data is from a random or representative sample
  • You expect a linear relationship between the two variables

The Pearson’s r is a parametric test, so it has high power. But it’s not a good measure of correlation if your variables have a nonlinear relationship, or if your data have outliers, skewed distributions, or come from categorical variables. If any of these assumptions are violated, you should consider a rank correlation measure.

The formula for the Pearson’s r is complicated, but most computer programs can quickly churn out the correlation coefficient from your data. In a simpler form, the formula divides the covariance between the variables by the product of their standard deviations .

Formula Explanation

   

= strength of the correlation between variables x and y = sample size = sum of what follows… = every x-variable value = every y-variable value = the product of each x-variable score and the corresponding y-variable score

Pearson sample vs population correlation coefficient formula

When using the Pearson correlation coefficient formula, you’ll need to consider whether you’re dealing with data from a sample or the whole population.

The sample and population formulas differ in their symbols and inputs. A sample correlation coefficient is called r , while a population correlation coefficient is called rho, the Greek letter ρ.

The sample correlation coefficient uses the sample covariance between variables and their sample standard deviations.

Sample correlation coefficient formula Explanation

   

= strength of the correlation between variables x and y ( , ) = covariance of x and y = sample standard deviation of x = sample standard deviation of y

The population correlation coefficient uses the population covariance between variables and their population standard deviations.

Population correlation coefficient formula Explanation

   

= strength of the correlation between variables X and Y ( , ) = covariance of X and Y = population standard deviation of X = population standard deviation of Y

Spearman’s rho, or Spearman’s rank correlation coefficient, is the most common alternative to Pearson’s r . It’s a rank correlation coefficient because it uses the rankings of data from each variable (e.g., from lowest to highest) rather than the raw data itself.

You should use Spearman’s rho when your data fail to meet the assumptions of Pearson’s r . This happens when at least one of your variables is on an ordinal level of measurement or when the data from one or both variables do not follow normal distributions.

While the Pearson correlation coefficient measures the linearity of relationships, the Spearman correlation coefficient measures the monotonicity of relationships.

In a linear relationship, each variable changes in one direction at the same rate throughout the data range. In a monotonic relationship, each variable also always changes in only one direction but not necessarily at the same rate.

  • Positive monotonic: when one variable increases, the other also increases.
  • Negative monotonic: when one variable increases, the other decreases.

Monotonic relationships are less restrictive than linear relationships.

Graphs showing a positive, negative, and zero monotonic relationship

Spearman’s rank correlation coefficient formula

The symbols for Spearman’s rho are ρ for the population coefficient and r s for the sample coefficient. The formula calculates the Pearson’s r correlation coefficient between the rankings of the variable data.

To use this formula, you’ll first rank the data from each variable separately from low to high: every datapoint gets a rank from first, second, or third, etc.

Then, you’ll find the differences (d i ) between the ranks of your variables for each data pair and take that as the main input for the formula.

Spearman’s rank correlation coefficient formula Explanation

   

= strength of the rank correlation between variables = the difference between the x-variable rank and the y-variable rank for each pair of data = sum of the squared differences between x- and y-variable ranks = sample size

If you have a correlation coefficient of 1, all of the rankings for each variable match up for every data pair. If you have a correlation coefficient of -1, the rankings for one variable are the exact opposite of the ranking of the other variable. A correlation coefficient near zero means that there’s no monotonic relationship between the variable rankings.

The correlation coefficient is related to two other coefficients, and these give you more information about the relationship between variables.

Coefficient of determination

When you square the correlation coefficient, you end up with the correlation of determination ( r 2 ). This is the proportion of common variance between the variables. The coefficient of determination is always between 0 and 1, and it’s often expressed as a percentage.

Coefficient of determination Explanation
The correlation coefficient multiplied by itself

The coefficient of determination is used in regression models to measure how much of the variance of one variable is explained by the variance of the other variable.

A regression analysis helps you find the equation for the line of best fit, and you can use it to predict the value of one variable given the value for the other variable.

A high r 2 means that a large amount of variability in one variable is determined by its relationship to the other variable. A low r 2 means that only a small portion of the variability of one variable is explained by its relationship to the other variable; relationships with other variables are more likely to account for the variance in the variable.

The correlation coefficient can often overestimate the relationship between variables, especially in small samples, so the coefficient of determination is often a better indicator of the relationship.

Coefficient of alienation

When you take away the coefficient of determination from unity (one), you’ll get the coefficient of alienation. This is the proportion of common variance not shared between the variables, the unexplained variance between the variables.

Coefficient of alienation Explanation
1 – One minus the coefficient of determination

A high coefficient of alienation indicates that the two variables share very little variance in common. A low coefficient of alienation means that a large amount of variance is accounted for by the relationship between the variables.

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

  • Chi square test of independence
  • Statistical power
  • Descriptive statistics
  • Degrees of freedom
  • Pearson correlation
  • Null hypothesis

Methodology

  • Double-blind study
  • Case-control study
  • Research ethics
  • Data collection
  • Hypothesis testing
  • Structured interviews

Research bias

  • Hawthorne effect
  • Unconscious bias
  • Recall bias
  • Halo effect
  • Self-serving bias
  • Information bias

A correlation reflects the strength and/or direction of the association between two or more variables.

  • A positive correlation means that both variables change in the same direction.
  • A negative correlation means that the variables change in opposite directions.
  • A zero correlation means there’s no relationship between the variables.

A correlation is usually tested for two variables at a time, but you can test correlations between three or more variables.

A correlation coefficient is a single number that describes the strength and direction of the relationship between your variables.

Different types of correlation coefficients might be appropriate for your data based on their levels of measurement and distributions . The Pearson product-moment correlation coefficient (Pearson’s r ) is commonly used to assess a linear relationship between two quantitative variables.

These are the assumptions your data must meet if you want to use Pearson’s r :

Correlation coefficients always range between -1 and 1.

The sign of the coefficient tells you the direction of the relationship: a positive value means the variables change together in the same direction, while a negative value means they change together in opposite directions.

No, the steepness or slope of the line isn’t related to the correlation coefficient value. The correlation coefficient only tells you how closely your data fit on a line, so two datasets with the same correlation coefficient can have very different slopes.

To find the slope of the line, you’ll need to perform a regression analysis .

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

Bhandari, P. (2023, June 22). Correlation Coefficient | Types, Formulas & Examples. Scribbr. Retrieved September 27, 2024, from https://www.scribbr.com/statistics/correlation-coefficient/

Is this article helpful?

Pritha Bhandari

Pritha Bhandari

Other students also liked, correlational research | when & how to use, correlation vs. causation | difference, designs & examples, simple linear regression | an easy introduction & examples, what is your plagiarism score.

12.4 Testing the Significance of the Correlation Coefficient

The correlation coefficient, r , tells us about the strength and direction of the linear relationship between x and y . However, the reliability of the linear model also depends on how many observed data points are in the sample. We need to look at both the value of the correlation coefficient r and the sample size n , together.

We perform a hypothesis test of the "significance of the correlation coefficient" to decide whether the linear relationship in the sample data is strong enough to use to model the relationship in the population.

The sample data are used to compute r , the correlation coefficient for the sample. If we had data for the entire population, we could find the population correlation coefficient. But because we have only sample data, we cannot calculate the population correlation coefficient. The sample correlation coefficient, r , is our estimate of the unknown population correlation coefficient.

  • The symbol for the population correlation coefficient is ρ , the Greek letter "rho."
  • ρ = population correlation coefficient (unknown)
  • r = sample correlation coefficient (known; calculated from sample data)

The hypothesis test lets us decide whether the value of the population correlation coefficient ρ is "close to zero" or "significantly different from zero". We decide this based on the sample correlation coefficient r and the sample size n .

If the test concludes that the correlation coefficient is significantly different from zero, we say that the correlation coefficient is "significant."

  • Conclusion: There is sufficient evidence to conclude that there is a significant linear relationship between x and y because the correlation coefficient is significantly different from zero.
  • What the conclusion means: There is a significant linear relationship between x and y . We can use the regression line to model the linear relationship between x and y in the population.

If the test concludes that the correlation coefficient is not significantly different from zero (it is close to zero), we say that correlation coefficient is "not significant".

  • Conclusion: "There is insufficient evidence to conclude that there is a significant linear relationship between x and y because the correlation coefficient is not significantly different from zero."
  • What the conclusion means: There is not a significant linear relationship between x and y . Therefore, we CANNOT use the regression line to model a linear relationship between x and y in the population.
  • If r is significant and the scatter plot shows a linear trend, the line can be used to predict the value of y for values of x that are within the domain of observed x values.
  • If r is not significant OR if the scatter plot does not show a linear trend, the line should not be used for prediction.
  • If r is significant and if the scatter plot shows a linear trend, the line may NOT be appropriate or reliable for prediction OUTSIDE the domain of observed x values in the data.

PERFORMING THE HYPOTHESIS TEST

  • Null Hypothesis: H 0 : ρ = 0
  • Alternate Hypothesis: H a : ρ ≠ 0

WHAT THE HYPOTHESES MEAN IN WORDS:

  • Null Hypothesis H 0 : The population correlation coefficient IS NOT significantly different from zero. There IS NOT a significant linear relationship (correlation) between x and y in the population.
  • Alternate Hypothesis H a : The population correlation coefficient IS significantly DIFFERENT FROM zero. There IS A SIGNIFICANT LINEAR RELATIONSHIP (correlation) between x and y in the population.

DRAWING A CONCLUSION: There are two methods of making the decision. The two methods are equivalent and give the same result.

  • Method 1: Using the p -value
  • Method 2: Using a table of critical values

In this chapter of this textbook, we will always use a significance level of 5%, α = 0.05

Using the p -value method, you could choose any appropriate significance level you want; you are not limited to using α = 0.05. But the table of critical values provided in this textbook assumes that we are using a significance level of 5%, α = 0.05. (If we wanted to use a different significance level than 5% with the critical value method, we would need different tables of critical values that are not provided in this textbook.)

METHOD 1: Using a p -value to make a decision

Using the ti-83, 83+, 84, 84+ calculator.

To calculate the p -value using LinRegTTEST: On the LinRegTTEST input screen, on the line prompt for β or ρ , highlight " ≠ 0 " The output screen shows the p-value on the line that reads "p =". (Most computer statistical software can calculate the p -value.)

  • Decision: Reject the null hypothesis.
  • Conclusion: "There is sufficient evidence to conclude that there is a significant linear relationship between x and y because the correlation coefficient is significantly different from zero."
  • Decision: DO NOT REJECT the null hypothesis.
  • Conclusion: "There is insufficient evidence to conclude that there is a significant linear relationship between x and y because the correlation coefficient is NOT significantly different from zero."
  • You will use technology to calculate the p -value. The following describes the calculations to compute the test statistics and the p -value:
  • The p -value is calculated using a t -distribution with n - 2 degrees of freedom.
  • The formula for the test statistic is t = r n − 2 1 − r 2 t = r n − 2 1 − r 2 . The value of the test statistic, t , is shown in the computer or calculator output along with the p -value. The test statistic t has the same sign as the correlation coefficient r .
  • The p -value is the combined area in both tails.

An alternative way to calculate the p -value (p) given by LinRegTTest is the command 2*tcdf(abs(t),10^99, n-2) in 2nd DISTR.

  • Consider the third exam/final exam example .
  • The line of best fit is: ŷ = -173.51 + 4.83 x with r = 0.6631 and there are n = 11 data points.
  • Can the regression line be used for prediction? Given a third exam score ( x value), can we use the line to predict the final exam score (predicted y value)?
  • H 0 : ρ = 0
  • H a : ρ ≠ 0
  • The p -value is 0.026 (from LinRegTTest on your calculator or from computer software).
  • The p -value, 0.026, is less than the significance level of α = 0.05.
  • Decision: Reject the Null Hypothesis H 0
  • Conclusion: There is sufficient evidence to conclude that there is a significant linear relationship between the third exam score ( x ) and the final exam score ( y ) because the correlation coefficient is significantly different from zero.

Because r is significant and the scatter plot shows a linear trend, the regression line can be used to predict final exam scores.

METHOD 2: Using a table of Critical Values to make a decision

The 95% Critical Values of the Sample Correlation Coefficient Table can be used to give you a good idea of whether the computed value of r r is significant or not . Compare r to the appropriate critical value in the table. If r is not between the positive and negative critical values, then the correlation coefficient is significant. If r is significant, then you may want to use the line for prediction.

Example 12.7

Suppose you computed r = 0.801 using n = 10 data points. df = n - 2 = 10 - 2 = 8. The critical values associated with df = 8 are -0.632 and + 0.632. If r < negative critical value or r > positive critical value, then r is significant. Since r = 0.801 and 0.801 > 0.632, r is significant and the line may be used for prediction. If you view this example on a number line, it will help you.

Try It 12.7

For a given line of best fit, you computed that r = 0.6501 using n = 12 data points and the critical value is 0.576. Can the line be used for prediction? Why or why not?

Example 12.8

Suppose you computed r = –0.624 with 14 data points. df = 14 – 2 = 12. The critical values are –0.532 and 0.532. Since –0.624 < –0.532, r is significant and the line can be used for prediction

Try It 12.8

For a given line of best fit, you compute that r = 0.5204 using n = 9 data points, and the critical value is 0.666. Can the line be used for prediction? Why or why not?

Example 12.9

Suppose you computed r = 0.776 and n = 6. df = 6 – 2 = 4. The critical values are –0.811 and 0.811. Since –0.811 < 0.776 < 0.811, r is not significant, and the line should not be used for prediction.

Try It 12.9

For a given line of best fit, you compute that r = –0.7204 using n = 8 data points, and the critical value is = 0.707. Can the line be used for prediction? Why or why not?

THIRD-EXAM vs FINAL-EXAM EXAMPLE: critical value method

Consider the third exam/final exam example . The line of best fit is: ŷ = –173.51+4.83 x with r = 0.6631 and there are n = 11 data points. Can the regression line be used for prediction? Given a third-exam score ( x value), can we use the line to predict the final exam score (predicted y value)?

  • Use the "95% Critical Value" table for r with df = n – 2 = 11 – 2 = 9.
  • The critical values are –0.602 and +0.602
  • Since 0.6631 > 0.602, r is significant.
  • Conclusion:There is sufficient evidence to conclude that there is a significant linear relationship between the third exam score ( x ) and the final exam score ( y ) because the correlation coefficient is significantly different from zero.

Example 12.10

Suppose you computed the following correlation coefficients. Using the table at the end of the chapter, determine if r is significant and the line of best fit associated with each r can be used to predict a y value. If it helps, draw a number line.

  • r = –0.567 and the sample size, n , is 19. The df = n – 2 = 17. The critical value is –0.456. –0.567 < –0.456 so r is significant.
  • r = 0.708 and the sample size, n , is nine. The df = n – 2 = 7. The critical value is 0.666. 0.708 > 0.666 so r is significant.
  • r = 0.134 and the sample size, n , is 14. The df = 14 – 2 = 12. The critical value is 0.532. 0.134 is between –0.532 and 0.532 so r is not significant.
  • r = 0 and the sample size, n , is five. No matter what the dfs are, r = 0 is between the two critical values so r is not significant.

Try It 12.10

For a given line of best fit, you compute that r = 0 using n = 100 data points. Can the line be used for prediction? Why or why not?

Assumptions in Testing the Significance of the Correlation Coefficient

Testing the significance of the correlation coefficient requires that certain assumptions about the data are satisfied. The premise of this test is that the data are a sample of observed points taken from a larger population. We have not examined the entire population because it is not possible or feasible to do so. We are examining the sample to draw a conclusion about whether the linear relationship that we see between x and y in the sample data provides strong enough evidence so that we can conclude that there is a linear relationship between x and y in the population.

The regression line equation that we calculate from the sample data gives the best-fit line for our particular sample. We want to use this best-fit line for the sample as an estimate of the best-fit line for the population. Examining the scatterplot and testing the significance of the correlation coefficient helps us determine if it is appropriate to do this.

  • There is a linear relationship in the population that models the average value of y for varying values of x . In other words, the expected value of y for each particular value lies on a straight line in the population. (We do not know the equation for the line for the population. Our regression line from the sample is our best estimate of this line in the population.)
  • The y values for any particular x value are normally distributed about the line. This implies that there are more y values scattered closer to the line than are scattered farther away. Assumption (1) implies that these normal distributions are centered on the line: the means of these normal distributions of y values lie on the line.
  • The standard deviations of the population y values about the line are equal for each value of x . In other words, each of these normal distributions of y values has the same shape and spread about the line.
  • The residual errors are mutually independent (no pattern).
  • The data are produced from a well-designed, random sample or randomized experiment.

This book may not be used in the training of large language models or otherwise be ingested into large language models or generative AI offerings without OpenStax's permission.

Want to cite, share, or modify this book? This book uses the Creative Commons Attribution License and you must attribute OpenStax.

Access for free at https://openstax.org/books/introductory-statistics/pages/1-introduction
  • Authors: Barbara Illowsky, Susan Dean
  • Publisher/website: OpenStax
  • Book title: Introductory Statistics
  • Publication date: Sep 19, 2013
  • Location: Houston, Texas
  • Book URL: https://openstax.org/books/introductory-statistics/pages/1-introduction
  • Section URL: https://openstax.org/books/introductory-statistics/pages/12-4-testing-the-significance-of-the-correlation-coefficient

© Jun 23, 2022 OpenStax. Textbook content produced by OpenStax is licensed under a Creative Commons Attribution License . The OpenStax name, OpenStax logo, OpenStax book covers, OpenStax CNX name, and OpenStax CNX logo are not subject to the Creative Commons license and may not be reproduced without the prior and express written consent of Rice University.

Module 12: Linear Regression and Correlation

Hypothesis test for correlation, learning outcomes.

  • Conduct a linear regression t-test using p-values and critical values and interpret the conclusion in context

The correlation coefficient,  r , tells us about the strength and direction of the linear relationship between x and y . However, the reliability of the linear model also depends on how many observed data points are in the sample. We need to look at both the value of the correlation coefficient r and the sample size n , together.

We perform a hypothesis test of the “ significance of the correlation coefficient ” to decide whether the linear relationship in the sample data is strong enough to use to model the relationship in the population.

The sample data are used to compute  r , the correlation coefficient for the sample. If we had data for the entire population, we could find the population correlation coefficient. But because we only have sample data, we cannot calculate the population correlation coefficient. The sample correlation coefficient, r , is our estimate of the unknown population correlation coefficient.

  • The symbol for the population correlation coefficient is ρ , the Greek letter “rho.”
  • ρ = population correlation coefficient (unknown)
  • r = sample correlation coefficient (known; calculated from sample data)

The hypothesis test lets us decide whether the value of the population correlation coefficient  ρ is “close to zero” or “significantly different from zero.” We decide this based on the sample correlation coefficient r and the sample size n .

If the test concludes that the correlation coefficient is significantly different from zero, we say that the correlation coefficient is “significant.”

  • Conclusion: There is sufficient evidence to conclude that there is a significant linear relationship between x and y because the correlation coefficient is significantly different from zero.
  • What the conclusion means: There is a significant linear relationship between x and y . We can use the regression line to model the linear relationship between x and y in the population.

If the test concludes that the correlation coefficient is not significantly different from zero (it is close to zero), we say that the correlation coefficient is “not significant.”

  • Conclusion: “There is insufficient evidence to conclude that there is a significant linear relationship between x and y because the correlation coefficient is not significantly different from zero.”
  • What the conclusion means: There is not a significant linear relationship between x and y . Therefore, we CANNOT use the regression line to model a linear relationship between x and y in the population.
  • If r is significant and the scatter plot shows a linear trend, the line can be used to predict the value of y for values of x that are within the domain of observed x values.
  • If r is not significant OR if the scatter plot does not show a linear trend, the line should not be used for prediction.
  • If r is significant and if the scatter plot shows a linear trend, the line may NOT be appropriate or reliable for prediction OUTSIDE the domain of observed x values in the data.

Performing the Hypothesis Test

  • Null Hypothesis: H 0 : ρ = 0
  • Alternate Hypothesis: H a : ρ ≠ 0

What the Hypotheses Mean in Words

  • Null Hypothesis H 0 : The population correlation coefficient IS NOT significantly different from zero. There IS NOT a significant linear relationship (correlation) between x and y in the population.
  • Alternate Hypothesis H a : The population correlation coefficient IS significantly DIFFERENT FROM zero. There IS A SIGNIFICANT LINEAR RELATIONSHIP (correlation) between x and y in the population.

Drawing a Conclusion

There are two methods of making the decision. The two methods are equivalent and give the same result.

  • Method 1: Using the p -value
  • Method 2: Using a table of critical values

In this chapter of this textbook, we will always use a significance level of 5%,  α = 0.05

Using the  p -value method, you could choose any appropriate significance level you want; you are not limited to using α = 0.05. But the table of critical values provided in this textbook assumes that we are using a significance level of 5%, α = 0.05. (If we wanted to use a different significance level than 5% with the critical value method, we would need different tables of critical values that are not provided in this textbook).

Method 1: Using a p -value to make a decision

Using the ti-83, 83+, 84, 84+ calculator.

To calculate the  p -value using LinRegTTEST:

  • On the LinRegTTEST input screen, on the line prompt for β or ρ , highlight “≠ 0”
  • The output screen shows the p-value on the line that reads “p =”.
  • (Most computer statistical software can calculate the  p -value).

If the p -value is less than the significance level ( α = 0.05)

  • Decision: Reject the null hypothesis.
  • Conclusion: “There is sufficient evidence to conclude that there is a significant linear relationship between x and y because the correlation coefficient is significantly different from zero.”

If the p -value is NOT less than the significance level ( α = 0.05)

  • Decision: DO NOT REJECT the null hypothesis.
  • Conclusion: “There is insufficient evidence to conclude that there is a significant linear relationship between x and y because the correlation coefficient is NOT significantly different from zero.”

Calculation Notes:

  • You will use technology to calculate the p -value. The following describes the calculations to compute the test statistics and the p -value:
  • The p -value is calculated using a t -distribution with n – 2 degrees of freedom.
  • The formula for the test statistic is [latex]\displaystyle{t}=\dfrac{{{r}\sqrt{{{n}-{2}}}}}{\sqrt{{{1}-{r}^{{2}}}}}[/latex]. The value of the test statistic, t , is shown in the computer or calculator output along with the p -value. The test statistic t has the same sign as the correlation coefficient r .
  • The p -value is the combined area in both tails.

Recall: ORDER OF OPERATIONS

parentheses exponents multiplication division addition subtraction
[latex]( \ )[/latex] [latex]x^2[/latex] [latex]\times \ \mathrm{or} \ \div[/latex] [latex]+ \ \mathrm{or} \ -[/latex]

1st find the numerator:

Step 1: Find [latex]n-2[/latex], and then take the square root.

Step 2: Multiply the value in Step 1 by [latex]r[/latex].

2nd find the denominator: 

Step 3: Find the square of [latex]r[/latex], which is [latex]r[/latex] multiplied by [latex]r[/latex].

Step 4: Subtract this value from 1, [latex]1 -r^2[/latex].

Step 5: Find the square root of Step 4.

3rd take the numerator and divide by the denominator.

An alternative way to calculate the  p -value (p) given by LinRegTTest is the command 2*tcdf(abs(t),10^99, n-2) in 2nd DISTR.

THIRD-EXAM vs FINAL-EXAM EXAM:  p- value method

  • Consider the  third exam/final exam example (example 2).
  • The line of best fit is: [latex]\hat{y}[/latex] = -173.51 + 4.83 x  with  r  = 0.6631 and there are  n  = 11 data points.
  • Can the regression line be used for prediction?  Given a third exam score ( x  value), can we use the line to predict the final exam score (predicted  y  value)?
  • H 0 :  ρ  = 0
  • H a :  ρ  ≠ 0
  • The  p -value is 0.026 (from LinRegTTest on your calculator or from computer software).
  • The  p -value, 0.026, is less than the significance level of  α  = 0.05.
  • Decision: Reject the Null Hypothesis  H 0
  • Conclusion: There is sufficient evidence to conclude that there is a significant linear relationship between the third exam score ( x ) and the final exam score ( y ) because the correlation coefficient is significantly different from zero.

Because  r  is significant and the scatter plot shows a linear trend, the regression line can be used to predict final exam scores.

Method 2: Using a table of Critical Values to make a decision

The 95% Critical Values of the Sample Correlation Coefficient Table can be used to give you a good idea of whether the computed value of r is significant or not . Compare  r to the appropriate critical value in the table. If r is not between the positive and negative critical values, then the correlation coefficient is significant. If  r is significant, then you may want to use the line for prediction.

Suppose you computed  r = 0.801 using n = 10 data points. df = n – 2 = 10 – 2 = 8. The critical values associated with df = 8 are -0.632 and + 0.632. If r < negative critical value or r > positive critical value, then r is significant. Since r = 0.801 and 0.801 > 0.632, r is significant and the line may be used for prediction. If you view this example on a number line, it will help you.

Horizontal number line with values of -1, -0.632, 0, 0.632, 0.801, and 1. A dashed line above values -0.632, 0, and 0.632 indicates not significant values.

r is not significant between -0.632 and +0.632. r = 0.801 > +0.632. Therefore, r is significant.

For a given line of best fit, you computed that  r = 0.6501 using n = 12 data points and the critical value is 0.576. Can the line be used for prediction? Why or why not?

If the scatter plot looks linear then, yes, the line can be used for prediction, because  r > the positive critical value.

Suppose you computed  r = –0.624 with 14 data points. df = 14 – 2 = 12. The critical values are –0.532 and 0.532. Since –0.624 < –0.532, r is significant and the line can be used for prediction

Horizontal number line with values of -0.624, -0.532, and 0.532.

r = –0.624-0.532. Therefore, r is significant.

For a given line of best fit, you compute that  r = 0.5204 using n = 9 data points, and the critical value is 0.666. Can the line be used for prediction? Why or why not?

No, the line cannot be used for prediction, because  r < the positive critical value.

Suppose you computed  r = 0.776 and n = 6. df = 6 – 2 = 4. The critical values are –0.811 and 0.811. Since –0.811 < 0.776 < 0.811, r is not significant, and the line should not be used for prediction.

Horizontal number line with values -0.924, -0.532, and 0.532.

–0.811 <  r = 0.776 < 0.811. Therefore, r is not significant.

For a given line of best fit, you compute that  r = –0.7204 using n = 8 data points, and the critical value is = 0.707. Can the line be used for prediction? Why or why not?

Yes, the line can be used for prediction, because  r < the negative critical value.

THIRD-EXAM vs FINAL-EXAM EXAMPLE: critical value method

Consider the  third exam/final exam example  again. The line of best fit is: [latex]\hat{y}[/latex] = –173.51+4.83 x  with  r  = 0.6631 and there are  n  = 11 data points. Can the regression line be used for prediction?  Given a third-exam score ( x  value), can we use the line to predict the final exam score (predicted  y  value)?

  • Use the “95% Critical Value” table for  r  with  df  =  n  – 2 = 11 – 2 = 9.
  • The critical values are –0.602 and +0.602
  • Since 0.6631 > 0.602,  r  is significant.

Suppose you computed the following correlation coefficients. Using the table at the end of the chapter, determine if  r is significant and the line of best fit associated with each r can be used to predict a y value. If it helps, draw a number line.

  • r = –0.567 and the sample size, n , is 19. The df = n – 2 = 17. The critical value is –0.456. –0.567 < –0.456 so r is significant.
  • r = 0.708 and the sample size, n , is nine. The df = n – 2 = 7. The critical value is 0.666. 0.708 > 0.666 so r is significant.
  • r = 0.134 and the sample size, n , is 14. The df = 14 – 2 = 12. The critical value is 0.532. 0.134 is between –0.532 and 0.532 so r is not significant.
  • r = 0 and the sample size, n , is five. No matter what the dfs are, r = 0 is between the two critical values so r is not significant.

For a given line of best fit, you compute that  r = 0 using n = 100 data points. Can the line be used for prediction? Why or why not?

No, the line cannot be used for prediction no matter what the sample size is.

Assumptions in Testing the Significance of the Correlation Coefficient

Testing the significance of the correlation coefficient requires that certain assumptions about the data are satisfied. The premise of this test is that the data are a sample of observed points taken from a larger population. We have not examined the entire population because it is not possible or feasible to do so. We are examining the sample to draw a conclusion about whether the linear relationship that we see between  x and y in the sample data provides strong enough evidence so that we can conclude that there is a linear relationship between x and y in the population.

The regression line equation that we calculate from the sample data gives the best-fit line for our particular sample. We want to use this best-fit line for the sample as an estimate of the best-fit line for the population. Examining the scatterplot and testing the significance of the correlation coefficient helps us determine if it is appropriate to do this.

The assumptions underlying the test of significance are:

  • There is a linear relationship in the population that models the average value of y for varying values of x . In other words, the expected value of y for each particular value lies on a straight line in the population. (We do not know the equation for the line for the population. Our regression line from the sample is our best estimate of this line in the population).
  • The y values for any particular x value are normally distributed about the line. This implies that there are more y values scattered closer to the line than are scattered farther away. Assumption (1) implies that these normal distributions are centered on the line: the means of these normal distributions of y values lie on the line.
  • The standard deviations of the population y values about the line are equal for each value of x . In other words, each of these normal distributions of y  values has the same shape and spread about the line.
  • The residual errors are mutually independent (no pattern).
  • The data are produced from a well-designed, random sample or randomized experiment.

The left graph shows three sets of points. Each set falls in a vertical line. The points in each set are normally distributed along the line — they are densely packed in the middle and more spread out at the top and bottom. A downward sloping regression line passes through the mean of each set. The right graph shows the same regression line plotted. A vertical normal curve is shown for each line.

The  y values for each x value are normally distributed about the line with the same standard deviation. For each x value, the mean of the y values lies on the regression line. More y values lie near the line than are scattered further away from the line.

  • Provided by : Lumen Learning. License : CC BY: Attribution
  • Testing the Significance of the Correlation Coefficient. Provided by : OpenStax. Located at : https://openstax.org/books/introductory-statistics/pages/12-4-testing-the-significance-of-the-correlation-coefficient . License : CC BY: Attribution . License Terms : Access for free at https://openstax.org/books/introductory-statistics/pages/1-introduction
  • Introductory Statistics. Authored by : Barbara Illowsky, Susan Dean. Provided by : OpenStax. Located at : https://openstax.org/books/introductory-statistics/pages/1-introduction . License : CC BY: Attribution . License Terms : Access for free at https://openstax.org/books/introductory-statistics/pages/1-introduction

Footer Logo Lumen Candela

Privacy Policy

JMP | Statistical Discovery.™ From SAS.

Statistics Knowledge Portal

A free online introduction to statistics

Correlation Coefficient

What is the correlation coefficient.

The correlation coefficient is the specific measure that quantifies the strength of the linear relationship between two variables in a correlation analysis. The coefficient is what we symbolize with the r in a correlation report.

How is the correlation coefficient used?

For two variables, the formula compares the distance of each datapoint from the variable mean and uses this to tell us how closely the relationship between the variables can be fit to an imaginary line drawn through the data. This is what we mean when we say that correlations look at linear relationships.

What are some limitations to consider?

Correlation only looks at the two variables at hand and won’t give insight into relationships beyond the bivariate data. This test won’t detect (and therefore will be skewed by) outliers in the data and can’t properly detect curvilinear relationships.

Correlation coefficient variants

This page focuses on the Pearson product-moment correlation. This is one of the most common types of correlation measures used in practice, but there are others. One closely related variant is the Spearman correlation, which is similar in usage but applicable to ranked data.

What do the values of the correlation coefficient mean?

See how to assess correlations using statistical software.

  • Download JMP to follow along using the sample data included with the software.
  • To see more JMP tutorials, visit the JMP Learning Library .

The correlation coefficient r is a unit-free value between -1 and 1. Statistical significance is indicated with a p-value. Therefore, correlations are typically written with two key numbers: r = and p = .

  • The closer r is to zero, the weaker the linear relationship.
  • Positive r values indicate a positive correlation, where the values of both variables tend to increase together.
  • Negative r values indicate a negative correlation, where the values of one variable tend to increase when the values of the other variable decrease.
  • The values 1 and -1 both represent "perfect" correlations, positive and negative respectively. Two perfectly correlated variables change together at a fixed rate. We say they have a linear relationship; when plotted on a scatterplot, all data points can be connected with a straight line.
  • The p-value helps us determine whether or not we can meaningfully conclude that the population correlation coefficient is different from zero, based on what we observe from the sample.

What is a p-value?

A p-value is a measure of probability used for hypothesis testing. The goal of hypothesis testing is to determine whether there is enough evidence to support a certain hypothesis about your data. Actually, we formulate two hypotheses: the null hypothesis and the alternative hypothesis. In the case of correlation analysis, the null hypothesis is typically that the observed relationship between the variables is the result of pure chance (i.e. the correlation coefficient is really zero — there is no linear relationship). The alternative hypothesis is that the correlation we’ve measured is legitimately present in our data (i.e. the correlation coefficient is different from zero).

The p-value is the probability of observing a non-zero correlation coefficient in our sample data when in fact the null hypothesis is true. A low p-value would lead you to reject the null hypothesis. A typical threshold for rejection of the null hypothesis is a p-value of 0.05. That is, if you have a p-value less than 0.05, you would reject the null hypothesis in favor of the alternative hypothesis—that the correlation coefficient is different from zero.

How do we actually calculate the correlation coefficient?

The sample correlation coefficient can be represented with a formula:

$$ r=\frac{\sum\left[\left(x_i-\overline{x}\right)\left(y_i-\overline{y}\right)\right]}{\sqrt{\mathrm{\Sigma}\left(x_i-\overline{x}\right)^2\ \ast\ \mathrm{\Sigma}(y_i\ -\overline{y})^2}} $$

View Annotated Formula

Let’s step through how to calculate the correlation coefficient using an example with a small set of simple numbers, so that it’s easy to follow the operations.

Let’s imagine that we’re interested in whether we can expect there to be more ice cream sales in our city on hotter days. Ice cream shops start to open in the spring; perhaps people buy more ice cream on days when it’s hot outside. On the other hand, perhaps people simply buy ice cream at a steady rate because they like it so much.

We start to answer this question by gathering data on average daily ice cream sales and the highest daily temperature. Ice Cream Sales and Temperature are therefore the two variables which we’ll use to calculate the correlation coefficient. Sometimes data like these are called bivariate data , because each observation (or point in time at which we’ve measured both sales and temperature) has two pieces of information that we can use to describe it. In other words, we’re asking whether Ice Cream Sales and Temperature seem to move together.

As before, a useful way to take a first look is with a scatterplot:

hypothesis in correlation coefficient

We can also look at these data in a table, which is handy for helping us follow the coefficient calculation for each datapoint. When talking about bivariate data, it’s typical to call one variable X and the other Y (these also help us orient ourselves on a visual plane, such as the axes of a plot). Let’s call Ice Cream Sales X , and Temperature Y .

Notice that each datapoint is paired . Remember, we are really looking at individual points in time, and each time has a value for both sales and temperature.

Ice Cream Sales (X)Temperature °F (Y)
370
675
980

1. Start by finding the sample means

Now that we’re oriented to our data, we can start with two important subcalculations from the formula above: the sample mean , and the difference between each datapoint and this mean (in these steps, you can also see the initial building blocks of standard deviation ).

The sample means are represented with the symbols  x̅ and y̅ , sometimes called “x bar” and “y bar.” The means for Ice Cream Sales ( x̅ ) and Temperature ( y̅ ) are easily calculated as follows:

$$ \overline{x} =\ [3\ +\ 6\ +\ 9] ÷ 3 = 6 $$

$$ \overline{y} =\ [70\ +\ 75\ +\ 80] ÷ 3 = 75 $$

2. Calculate the distance of each datapoint from its mean

With the mean in hand for each of our two variables, the next step is to subtract the mean of Ice Cream Sales (6) from each of our Sales data points ( x i in the formula), and the mean of Temperature (75) from each of our Temperature data points ( y i in the formula). Note that this operation sometimes results in a negative number or zero!

Ice Cream (X)Temperature °F (Y)
$3$$70$$3 - 6 = -3$$70 - 75 = -5$
$6$$75$$6 - 6 = 0$$75 - 75 = 0$
$9$$80$$9 - 6 = 3$$80 - 75 = 5$

3. Complete the top of the coefficient equation

This piece of the equation is called the Sum of Products. A product is a number you get after multiplying, so this formula is just what it sounds like: the sum of numbers you multiply.

$$ \sum[(x_i-\overline{x})(y_i-\overline{y})] $$

We take the paired values from each row in the last two columns in the table above, multiply them (remember that multiplying two negative numbers makes a positive!), and sum those results:

$$ [(-3)(-5)] + [(0)(0)] + [(3)(5)] = 30 $$

How does the Sum of Products relate to the scatterplot?

correlation-sp-regions.png

The Sum of Products calculation and the location of the data points in our scatterplot are intrinsically related.

Notice that the Sum of Products is positive for our data. When the Sum of Products (the numerator of our correlation coefficient equation) is positive, the correlation coefficient r will be positive, since the denominator—a square root—will always be positive. We know that a positive correlation means that increases in one variable are associated with increases in the other (like our Ice Cream Sales and Temperature example), and on a scatterplot, the data points angle upwards from left to right. But how does the Sum of Products capture this?

  • The only way we will get a positive value for the Sum of Products is if the products we are summing tend to be positive.
  • The only way to get a positive value for each of the products is if both values are negative or both values are positive.
  • The only way to get a pair of two negative numbers is if both values are below their means (on the bottom left side of the scatter plot), and the only way to get a pair of two positive numbers is if both values are above their means (on the top right side of the scatter plot).

So, the Sum of Products tells us whether data tend to appear in the bottom left and top right of the scatter plot (a positive correlation), or alternatively, if the data tend to appear in the top left and bottom right of the scatter plot (a negative correlation).

4. Complete the bottom of the coefficient equation

The denominator of our correlation coefficient equation looks like this:

$$ \sqrt{\mathrm{\Sigma}{(x_i\ -\ \overline{x})}^2\ \ast\ \mathrm{\Sigma}(y_i\ -\overline{y})^2} $$

Let's tackle the expressions in this equation separately and drop in the numbers from our Ice Cream Sales example:

$$ \mathrm{\Sigma}{(x_i\ -\ \overline{x})}^2=-3^2+0^2+3^2=9+0+9=18 $$

$$ \mathrm{\Sigma}{(y_i\ -\ \overline{y})}^2=-5^2+0^2+5^2=25+0+25=50 $$

When we multiply the result of the two expressions together, we get:

$$ 18\times50\ =\ 900 $$

This brings the bottom of the equation to:

$$ \sqrt{900}=30 $$

5. Finish the calculation, and compare our result with the scatterplot

Here's our full correlation coefficient equation once again:

Let's pull in the numbers for the numerator and denominator that we calculated above:

$$ r=\frac{30}{30}=1 $$

A perfect correlation between ice cream sales and hot summer days! Of course, finding a perfect correlation is so unlikely in the real world that had we been working with real data, we’d assume we had done something wrong to obtain such a result.

But this result from the simplified data in our example should make intuitive sense based on simply looking at the data points. Let's look again at our scatterplot:

hypothesis in correlation coefficient

Now imagine drawing a line through that scatterplot. Would it look like a perfect linear fit?

hypothesis in correlation coefficient

A picture can be worth 1,000 correlation coefficients!

Scatterplots, and other data visualizations, are useful tools throughout the whole statistical process, not just before we perform our hypothesis tests.

In fact, it’s important to remember that relying exclusively on the correlation coefficient can be misleading—particularly in situations involving curvilinear relationships or extreme outliers. In the scatterplots below, we are reminded that a correlation coefficient of zero or near zero does not necessarily mean that there is no relationship between the variables; it simply means that there is no linear relationship.

Similarly, looking at a scatterplot can provide insights on how outliers—unusual observations in our data—can skew the correlation coefficient. Let’s look at an example with one extreme outlier. The correlation coefficient indicates that there is a relatively strong positive relationship between X and Y. But when the outlier is removed, the correlation coefficient is near zero.

Pearson Correlation

Pearson correlation analysis examines the relationship between two variables. For example, is there a correlation between a person's age and salary?

Pearson Correlation

More specifically, we can use the pearson correlation coefficient to measure the linear relationship between two variables.

Strength and direction of correlation

With a correlation analysis we can determine:

  • How strong the correlation is
  • and in which direction the correlation goes.

We can read the strength and direction of the correlation in the Pearson correlation coefficient r , whose value varies between -1 and 1.

Strength of the correlation

The strength of the correlation can be read in a table. An r between 0 and 0.1 indicates no correlation. An amount of r between 0.7 and 1 is indicates a very strong correlation.

Amount of r Strength of correlation
0.0 < 0.1 no correlation
0.1 < 0.3 low correlation
0.3 < 0.5 medium correlation
0.5 < 0.7 high correlation
0.7 < 1 very high correlation

Direction of the correlation

A positive relationship or correlation exists when large values of one variable are associated with large values of the other variable, or when small values of one variable are associated with small values of the other variable.

Positive Pearson correlation coefficient

A positive correlation results, for example, for height and shoe size. This results in a positive correlation coefficient.

positive correlation coefficient

A negative correlation is when large values of one variable are associated with small values of the other variable and vice versa.

negative Pearson correlation coefficient

A negative correlation is usually found between product price and sales volume. This results in a negative correlation coefficient.

negative correlation coefficient

Calculate Pearson correlation

The Pearson correlation coefficient is calculated using the following equation. Here r is the Pearson correlation coefficient, x i are the individual values of one variable e.g. age, y i are the individual values of the other variable e.g. salary and x̄ and ȳ are the mean values of the two variables respectively.

Equation Pearson Correlation

In the equation, we can see that the respective mean value is first subtracted from both variables.

So in our example, we calculate the mean values of age and salary. We then subtract the mean values from each of age and salary. We then multiply both values. We then sum up the individual results of the multiplication. The expression in the denominator ensures that the correlation coefficient is scaled between -1 and 1.

If we now multiply two positive values we get a positive value. If we multiply two negative values we also get a positive value (minus times minus is plus). So all values that lie in these ranges have a positive influence on the correlation coefficient.

Positive correlation Pearson correlation

If we multiply a positive value and a negative value we get a negative value (minus times plus is minus). So all values that are in these ranges have a negative influence on the correlation coefficient.

negative correlation Pearson correlation

Therefore, if our values are predominantly in the two green areas from previous two figures, we get a positive correlation coefficient and therefore a positive correlation.

If our scores are predominantly in the two red areas from the figures, we get a negative correlation coefficient and thus a negative correlation.

If the points are distributed over all four areas, the positive terms and the negative terms cancel each other out and we might end up with a very small or no correlation.

Testing correlation coefficients for significance

In general, the correlation coefficient is calculated using data from a sample. In most cases, however, we want to test a hypothesis about the population.

Pearson Correlation Sample

In the case of correlation analysis, we then want to know if there is a correlation in the population.

For this, we test whether the correlation coefficient in the sample is statistically significantly different from zero.

Hypotheses in the Pearson Correlation

The null hypothesis and the alternative hypothesis in Pearson correlation are thus:

  • Null hypothesis: The correlation coefficient is not significantly different from zero (There is no linear relationship).
  • Alternative hypothesis: The correlation coefficient deviates significantly from zero (there is a linear correlation).

Attention: It is always tested whether the null hypothesis is rejected or not rejected.

In our example with the salary and the age of a person, we could thus have the question: Is there a correlation between age and salary in the German population (the population)?

To find out, we draw a sample and test whether the correlation coefficient is significantly different from zero in this sample.

  • The null hypothesis is then: There is no correlation between salary and age in the German population.
  • and the alternative hypothesis: There is a correlation between salary and age in the German population.

Significance and the t-test

Whether the Pearson correlation coefficient is significantly different from zero based on the sample surveyed can be checked using a t-test . Here, r is the correlation coefficient and n is the sample size.

Significance of the Pearson correlation

A p-value can then be calculated from the test statistic t . If the p-value is smaller than the specified significance level, which is usually 5%, then the null hypothesis is rejected, otherwise it is not.

Assumptions of the Pearson correlation

But what about the assumptions for a Pearson correlation? Here we have to distinguish whether we just want to calculate the Pearson correlation coefficient, or whether we want to test a hypothesis.

To calculate the Pearson correlation coefficient, only two metric variables must be present. Metric variables are, for example, a person's weight,a person's salary or electricity consumption.

The Pearson correlation coefficient, then tells us how large the linear relationship is. If there is a non-linear correlation, we cannot read it from the Pearson correlation coefficient.

Assumptions of the Pearson correlation

However, if we want to test whether the Pearson correlation coefficient is significantly different from zero in the sample, i.e. we want to test a hypothesis, the two variables must also be normally distributed!

Pearson correlation normal distribution

If this is not given, the calculated test statistic t or the p-value cannot be interpreted reliably. If the assumptions are not met, Spearman's rank correlation can be used.

Calculate Pearson correlation online with DATAtab

If you like, you can of course calculate a correlation analysis online with DATAtab. To do this, simply copy your data into this table in the statistics calculator and click on either the Hypothesis tests or Correlation tab.

If you now look at two metric variables, a Pearson correlation will be calculated automatically. If you don't know exactly how to interpret the results, you can also just click on Summary in words !

Statistics made easy

  • many illustrative examples
  • ideal for exams and theses
  • statistics made easy on 412 pages
  • 5rd revised edition (April 2024)
  • Only 8.99 €

Datatab

"Super simple written"

"It could not be simpler"

"So many helpful examples"

Statistics Calculator

Cite DATAtab: DATAtab Team (2024). DATAtab: Online Statistics Calculator. DATAtab e.U. Graz, Austria. URL https://datatab.net

Teach yourself statistics

Correlation Coefficient

Correlation coefficients measure the strength of association between two variables. The most common correlation coefficient, called the Pearson product-moment correlation coefficient, measures the strength of the linear association between variables measured on an interval or ratio scale.

Note: Your browser does not support HTML5 video. If you view this web page on a different browser (e.g., a recent version of Edge, Chrome, Firefox, or Opera), you can watch a video treatment of this lesson.

In this tutorial, when we speak simply of a correlation coefficient, we are referring to the Pearson product-moment correlation. Generally, the correlation coefficient of a sample is denoted by r , and the correlation coefficient of a population is denoted by ρ or R .

How to Interpret a Correlation Coefficient

The sign and the absolute value of a correlation coefficient describe the direction and the magnitude of the relationship between two variables.

  • The value of a correlation coefficient ranges between -1 and 1.
  • The greater the absolute value of the Pearson product-moment correlation coefficient, the stronger the linear relationship.
  • The strongest linear relationship is indicated by a correlation coefficient of -1 or 1.
  • The weakest linear relationship is indicated by a correlation coefficient equal to 0.
  • A positive correlation means that if one variable gets bigger, the other variable tends to get bigger.
  • A negative correlation means that if one variable gets bigger, the other variable tends to get smaller.

Keep in mind that the Pearson product-moment correlation coefficient only measures linear relationships. Therefore, a correlation of 0 does not mean zero relationship between two variables; rather, it means zero linear relationship. (It is possible for two variables to have zero linear relationship and a strong curvilinear relationship at the same time.)

Scatterplots and Correlation Coefficients

The scatterplots below show how different patterns of data produce different degrees of correlation.

Maximum positive correlation (r = 1.0)

Strong positive correlation (r = 0.80)

Zero correlation (r = 0)

Maximum negative correlation (r = -1.0)

Moderate negative correlation (r = -0.43)

Strong correlation & outlier (r = 0.71)

Several points are evident from the scatterplots.

  • When the slope of the line in the plot is negative, the correlation is negative; and vice versa.
  • The strongest correlations (r = 1.0 and r = -1.0 ) occur when data points fall exactly on a straight line.
  • The correlation becomes weaker as the data points become more scattered.
  • If the data points fall in a random pattern, the correlation is equal to zero.
  • Correlation is affected by outliers . Compare the first scatterplot with the last scatterplot. The single outlier in the last plot greatly reduces the correlation (from 1.00 to 0.71).

How to Calculate a Correlation Coefficient

If you look in different statistics textbooks, you are likely to find different-looking (but equivalent) formulas for computing a correlation coefficient. In this section, we present several formulas that you may encounter.

The most common formula for computing a product-moment correlation coefficient (r) is given below.

Product-moment correlation coefficient. The correlation r between two variables is:

r = Σ (xy) / sqrt [ ( Σ x 2 ) * ( Σ y 2 ) ]

where Σ is the summation symbol, x = x i - x , x i is the x value for observation i, x is the mean x value, y = y i - y , y i is the y value for observation i, and y is the mean y value.

The formula below uses population means and population standard deviations to compute a population correlation coefficient (ρ) from population data.

Population correlation coefficient. The correlation ρ between two variables is:

ρ = [ 1 / N ] * Σ { [ (X i - μ X ) / σ x ] * [ (Y i - μ Y ) / σ y ] }

where N is the number of observations in the population, Σ is the summation symbol, X i is the X value for observation i, μ X is the population mean for variable X, Y i is the Y value for observation i, μ Y is the population mean for variable Y, σ x is the population standard deviation of X, and σ y is the population standard deviation of Y.

The formula below uses sample means and sample standard deviations to compute a sample correlation coefficient (r) from sample data.

Sample correlation coefficient. The correlation r between two variables is:

r = [ 1 / (n - 1) ] * Σ { [ (x i - x ) / s x ] * [ (y i - y ) / s y ] }

where n is the number of observations in the sample, Σ is the summation symbol, x i is the x value for observation i, x is the sample mean of x, y i is the y value for observation i, y is the sample mean of y, s x is the sample standard deviation of x, and s y is the sample standard deviation of y.

The interpretation of the sample correlation coefficient depends on how the sample data are collected. With a large simple random sample , the sample correlation coefficient is an unbiased estimate of the population correlation coefficient.

Each of the latter two formulas can be derived from the first formula. Use the first or second formula when you have data from the entire population. Use the third formula when you only have sample data, but want to estimate the correlation in the population. When in doubt, use the first formula.

Fortunately, you will rarely have to compute a correlation coefficient by hand. Many software packages (e.g., Excel) and most graphing calculators have a correlation function that will do the job for you.

Test Your Understanding

A national consumer magazine reported the following correlations.

  • The correlation between car weight and car reliability is -0.30.
  • The correlation between car weight and annual maintenance cost is 0.20.

Which of the following statements are true?

I. Heavier cars tend to be less reliable. II. Heavier cars tend to cost more to maintain. III. Car weight is related more strongly to reliability than to maintenance cost.

(A) I only (B) II only (C) III only (D) I and II only (E) I, II, and III

The correct answer is (E). The correlation between car weight and reliability is negative. This means that reliability tends to decrease as car weight increases. The correlation between car weight and maintenance cost is positive. This means that maintenance costs tend to increase as car weight increases.

The strength of a relationship between two variables is indicated by the absolute value of the correlation coefficient. The correlation between car weight and reliability has an absolute value of 0.30. The correlation between car weight and maintenance cost has an absolute value of 0.20. Therefore, the relationship between car weight and reliability is stronger than the relationship between car weight and maintenance cost.

Correlation in Psychology: Meaning, Types, Examples & coefficient

Saul McLeod, PhD

Editor-in-Chief for Simply Psychology

BSc (Hons) Psychology, MRes, PhD, University of Manchester

Saul McLeod, PhD., is a qualified psychology teacher with over 18 years of experience in further and higher education. He has been published in peer-reviewed journals, including the Journal of Clinical Psychology.

Learn about our Editorial Process

Olivia Guy-Evans, MSc

Associate Editor for Simply Psychology

BSc (Hons) Psychology, MSc Psychology of Education

Olivia Guy-Evans is a writer and associate editor for Simply Psychology. She has previously worked in healthcare and educational sectors.

On This Page:

Correlation means association – more precisely, it measures the extent to which two variables are related. There are three possible results of a correlational study: a positive correlation, a negative correlation, and no correlation.
  • A positive correlation is a relationship between two variables in which both variables move in the same direction. Therefore, one variable increases as the other variable increases, or one variable decreases while the other decreases. An example of a positive correlation would be height and weight. Taller people tend to be heavier.

positive correlation

  • A negative correlation is a relationship between two variables in which an increase in one variable is associated with a decrease in the other. An example of a negative correlation would be the height above sea level and temperature. As you climb the mountain (increase in height), it gets colder (decrease in temperature).

negative correlation

  • A zero correlation exists when there is no relationship between two variables. For example, there is no relationship between the amount of tea drunk and the level of intelligence.

zero correlation

Scatter Plots

A correlation can be expressed visually. This is done by drawing a scatter plot (also known as a scattergram, scatter graph, scatter chart, or scatter diagram).

A scatter plot is a graphical display that shows the relationships or associations between two numerical variables (or co-variables), which are represented as points (or dots) for each pair of scores.

A scatter plot indicates the strength and direction of the correlation between the co-variables.

Types of Correlations: Positive, Negative, and Zero

When you draw a scatter plot, it doesn’t matter which variable goes on the x-axis and which goes on the y-axis.

Remember, in correlations, we always deal with paired scores, so the values of the two variables taken together will be used to make the diagram.

Decide which variable goes on each axis and then simply put a cross at the point where the two values coincide.

Uses of Correlations

  • If there is a relationship between two variables, we can make predictions about one from another.
  • Concurrent validity (correlation between a new measure and an established measure).

Reliability

  • Test-retest reliability (are measures consistent?).
  • Inter-rater reliability (are observers consistent?).

Theory verification

  • Predictive validity.

Correlation Coefficients

Instead of drawing a scatter plot, a correlation can be expressed numerically as a coefficient, ranging from -1 to +1. When working with continuous variables, the correlation coefficient to use is Pearson’s r.

Correlation Coefficient Interpretation

The correlation coefficient ( r ) indicates the extent to which the pairs of numbers for these two variables lie on a straight line. Values over zero indicate a positive correlation, while values under zero indicate a negative correlation.

A correlation of –1 indicates a perfect negative correlation, meaning that as one variable goes up, the other goes down. A correlation of +1 indicates a perfect positive correlation, meaning that as one variable goes up, the other goes up.

There is no rule for determining what correlation size is considered strong, moderate, or weak. The interpretation of the coefficient depends on the topic of study.

When studying things that are difficult to measure, we should expect the correlation coefficients to be lower (e.g., above 0.4 to be relatively strong). When we are studying things that are easier to measure, such as socioeconomic status, we expect higher correlations (e.g., above 0.75 to be relatively strong).)

In these kinds of studies, we rarely see correlations above 0.6. For this kind of data, we generally consider correlations above 0.4 to be relatively strong; correlations between 0.2 and 0.4 are moderate, and those below 0.2 are considered weak.

When we are studying things that are more easily countable, we expect higher correlations. For example, with demographic data, we generally consider correlations above 0.75 to be relatively strong; correlations between 0.45 and 0.75 are moderate, and those below 0.45 are considered weak.

Correlation vs. Causation

Causation means that one variable (often called the predictor variable or independent variable) causes the other (often called the outcome variable or dependent variable).

Experiments can be conducted to establish causation. An experiment isolates and manipulates the independent variable to observe its effect on the dependent variable and controls the environment in order that extraneous variables may be eliminated.

A correlation between variables, however, does not automatically mean that the change in one variable is the cause of the change in the values of the other variable. A correlation only shows if there is a relationship between variables.

causation correlationg graph

While variables are sometimes correlated because one does cause the other, it could also be that some other factor, a confounding variable , is actually causing the systematic movement in our variables of interest.

Correlation does not always prove causation, as a third variable may be involved. For example, being a patient in a hospital is correlated with dying, but this does not mean that one event causes the other, as another third variable might be involved (such as diet and level of exercise).

“Correlation is not causation” means that just because two variables are related it does not necessarily mean that one causes the other.

A correlation identifies variables and looks for a relationship between them. An experiment tests the effect that an independent variable has upon a dependent variable but a correlation looks for a relationship between two variables.

This means that the experiment can predict cause and effect (causation) but a correlation can only predict a relationship, as another extraneous variable may be involved that it not known about.

1. Correlation allows the researcher to investigate naturally occurring variables that may be unethical or impractical to test experimentally. For example, it would be unethical to conduct an experiment on whether smoking causes lung cancer.

2 . Correlation allows the researcher to clearly and easily see if there is a relationship between variables. This can then be displayed in a graphical form.

Limitations

1 . Correlation is not and cannot be taken to imply causation. Even if there is a very strong association between two variables, we cannot assume that one causes the other.

For example, suppose we found a positive correlation between watching violence on T.V. and violent behavior in adolescence.

It could be that the cause of both these is a third (extraneous) variable – for example, growing up in a violent home – and that both the watching of T.V. and the violent behavior is the outcome of this.

2 . Correlation does not allow us to go beyond the given data. For example, suppose it was found that there was an association between time spent on homework (1/2 hour to 3 hours) and the number of G.C.S.E. passes (1 to 6).

It would not be legitimate to infer from this that spending 6 hours on homework would likely generate 12 G.C.S.E. passes.

How do you know if a study is correlational?

A study is considered correlational if it examines the relationship between two or more variables without manipulating them. In other words, the study does not involve the manipulation of an independent variable to see how it affects a dependent variable.

One way to identify a correlational study is to look for language that suggests a relationship between variables rather than cause and effect.

For example, the study may use phrases like “associated with,” “related to,” or “predicts” when describing the variables being studied.

Another way to identify a correlational study is to look for information about how the variables were measured. Correlational studies typically involve measuring variables using self-report surveys, questionnaires, or other measures of naturally occurring behavior.

Finally, a correlational study may include statistical analyses such as correlation coefficients or regression analyses to examine the strength and direction of the relationship between variables.

Why is a correlational study used?

Correlational studies are particularly useful when it is not possible or ethical to manipulate one of the variables.

For example, it would not be ethical to manipulate someone’s age or gender. However, researchers may still want to understand how these variables relate to outcomes such as health or behavior.

Additionally, correlational studies can be used to generate hypotheses and guide further research.

If a correlational study finds a significant relationship between two variables, this can suggest a possible causal relationship that can be further explored in future research.

What is the goal of correlational research?

The ultimate goal of correlational research is to increase our understanding of how different variables are related and to identify patterns in those relationships.

This information can then be used to generate hypotheses and guide further research aimed at establishing causality.

Print Friendly, PDF & Email

Pardon Our Interruption

As you were browsing something about your browser made us think you were a bot. There are a few reasons this might happen:

  • You've disabled JavaScript in your web browser.
  • You're a power user moving through this website with super-human speed.
  • You've disabled cookies in your web browser.
  • A third-party browser plugin, such as Ghostery or NoScript, is preventing JavaScript from running. Additional information is available in this support article .

To regain access, please make sure that cookies and JavaScript are enabled before reloading the page.

User Preferences

Content preview.

Arcu felis bibendum ut tristique et egestas quis:

  • Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris
  • Duis aute irure dolor in reprehenderit in voluptate
  • Excepteur sint occaecat cupidatat non proident

Keyboard Shortcuts

1.6 - (pearson) correlation coefficient, \(r\).

The correlation coefficient, r, is directly related to the coefficient of determination \(R^{2}\) in an obvious way. If \(R^{2}\) is represented in decimal form, e.g. 0.39 or 0.87, then all we have to do to obtain r is to take the square root of \(R^{2}\):

\(r= \pm \sqrt{R^2}\)

The sign of r depends on the sign of the estimated slope coefficient \(b_{1}\):

  • If \(b_{1}\) is negative, then r takes a negative sign.
  • If \(b_{1}\) is positive, then r takes a positive sign.

That is, the estimated slope and the correlation coefficient r always share the same sign. Furthermore, because \(R^{2}\) is always a number between 0 and 1, the correlation coefficient r is always a number between -1 and 1.

One advantage of r is that it is unitless, allowing researchers to make sense of correlation coefficients calculated on different data sets with different units. The "unitless-ness" of the measure can be seen from an alternative formula for r , namely:

If x is the height of an individual measured in inches and y is the weight of the individual measured in pounds, then the units for the numerator are inches × pounds. Similarly, the units for the denominator are inches × pounds. Because they are the same, the units in the numerator and denominator cancel each other out, yielding a "unitless" measure.

Another formula for r that you might see in the regression literature is one that illustrates how the correlation coefficient r is a function of the estimated slope coefficient \(b_{1}\):

\(r=\dfrac{\sqrt{\sum_{i=1}^{n}(x_i-\bar{x})^2}}{\sqrt{\sum_{i=1}^{n}(y_i-\bar{y})^2}}\times b_1\)

We are readily able to see from this version of the formula that:

  • If the estimated slope \(b_{1}\) of the regression line is 0, then the correlation coefficient r must also be 0.

That's enough with the formulas! As always, we will let statistical software such as Minitab do the dirty calculations for us. Here's what Minitab's output looks like for the skin cancer mortality and latitude example ( Skin Cancer Data ):

Correlation: Mort, Lat

Pearson correlation of Mort and Lat = -0.825

The output tells us that the correlation between skin cancer mortality and latitude is -0.825 for this data set. Note that it doesn't matter the order in which you specify the variables:

Correlation: Lat, Mort

The output tells us that the correlation between skin cancer mortality and latitude is still -0.825. What does this correlation coefficient tell us? That is, how do we interpret the Pearson correlation coefficient r ? In general, there is no nice practical operational interpretation for r as there is for \(r^{2}\). You can only use r to make a statement about the strength of the linear relationship between x and y . In general:

  • If r = -1, then there is a perfect negative linear relationship between x and y .
  • If r = 1, then there is a perfect positive linear relationship between x and y .
  • If r = 0, then there is no linear relationship between x and y .

All other values of r tell us that the relationship between x and y is not perfect. The closer r is to 0, the weaker the linear relationship. The closer r is to -1, the stronger the negative linear relationship. And, the closer r is to 1, the stronger the positive linear relationship. As is true for the \(R^{2}\) value, what is deemed a large correlation coefficient r value depends greatly on the research area.

So, what does the correlation of -0.825 between skin cancer mortality and latitude tell us? It tells us:

  • The relationship is negative. As the latitude increases, the skin cancer mortality rate decreases (linearly).
  • The relationship is quite strong (since the value is pretty close to -1)

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings
  • My Bibliography
  • Collections
  • Citation manager

Save citation to file

Email citation, add to collections.

  • Create a new collection
  • Add to an existing collection

Add to My Bibliography

Your saved search, create a file for external citation management software, your rss feed.

  • Search in PubMed
  • Search in NLM Catalog
  • Add to Search

Correlation Coefficients: Appropriate Use and Interpretation

Affiliation.

  • 1 From the Department of Anesthesiology, VU University Medical Center, Amsterdam, the Netherlands.
  • PMID: 29481436
  • DOI: 10.1213/ANE.0000000000002864

Correlation in the broadest sense is a measure of an association between variables. In correlated data, the change in the magnitude of 1 variable is associated with a change in the magnitude of another variable, either in the same (positive correlation) or in the opposite (negative correlation) direction. Most often, the term correlation is used in the context of a linear relationship between 2 continuous variables and expressed as Pearson product-moment correlation. The Pearson correlation coefficient is typically used for jointly normally distributed data (data that follow a bivariate normal distribution). For nonnormally distributed continuous data, for ordinal data, or for data with relevant outliers, a Spearman rank correlation can be used as a measure of a monotonic association. Both correlation coefficients are scaled such that they range from -1 to +1, where 0 indicates that there is no linear or monotonic association, and the relationship gets stronger and ultimately approaches a straight line (Pearson correlation) or a constantly increasing or decreasing curve (Spearman correlation) as the coefficient approaches an absolute value of 1. Hypothesis tests and confidence intervals can be used to address the statistical significance of the results and to estimate the strength of the relationship in the population from which the data were sampled. The aim of this tutorial is to guide researchers and clinicians in the appropriate use and interpretation of correlation coefficients.

PubMed Disclaimer

Similar articles

  • Unadjusted Bivariate Two-Group Comparisons: When Simpler is Better. Vetter TR, Mascha EJ. Vetter TR, et al. Anesth Analg. 2018 Jan;126(1):338-342. doi: 10.1213/ANE.0000000000002636. Anesth Analg. 2018. PMID: 29189214
  • Estimation of the average correlation coefficient for stratified bivariate data. Rubenstein LM, Davis CS. Rubenstein LM, et al. Stat Med. 1999 Mar 15;18(5):567-80. doi: 10.1002/(sici)1097-0258(19990315)18:5 3.0.co;2-f. Stat Med. 1999. PMID: 10209812
  • An introduction to new robust linear and monotonic correlation coefficients. Tabatabai M, Bailey S, Bursac Z, Tabatabai H, Wilus D, Singh KP. Tabatabai M, et al. BMC Bioinformatics. 2021 Mar 31;22(1):170. doi: 10.1186/s12859-021-04098-4. BMC Bioinformatics. 2021. PMID: 33789571 Free PMC article.
  • Testing the significance of a correlation with nonnormal data: comparison of Pearson, Spearman, transformation, and resampling approaches. Bishara AJ, Hittner JB. Bishara AJ, et al. Psychol Methods. 2012 Sep;17(3):399-417. doi: 10.1037/a0028087. Epub 2012 May 7. Psychol Methods. 2012. PMID: 22563845 Review.
  • Correlation and simple linear regression. Zou KH, Tuncali K, Silverman SG. Zou KH, et al. Radiology. 2003 Jun;227(3):617-22. doi: 10.1148/radiol.2273011499. Radiology. 2003. PMID: 12773666 Review.
  • Validity of the PROMIS ® Early Childhood Physical Activity Scale among toddlers. Kwon S, Armstrong B, Wetoska N, Capan S. Kwon S, et al. Int J Behav Nutr Phys Act. 2024 Sep 26;21(1):106. doi: 10.1186/s12966-024-01655-x. Int J Behav Nutr Phys Act. 2024. PMID: 39327627
  • Post-operative outcomes associated with anterior mesh location after laparoscopic sacrocolpopexy. Habib N, Giorgi M, Tahtouh T, Hamdi A, Centini G, Cannoni A, Bader G. Habib N, et al. Arch Gynecol Obstet. 2024 Sep 25. doi: 10.1007/s00404-024-07719-4. Online ahead of print. Arch Gynecol Obstet. 2024. PMID: 39322854
  • Enhancing chemical synthesis research with NLP: Word embeddings for chemical reagent identification-A case study on nano-FeCu. Cao D, Chan MK. Cao D, et al. iScience. 2024 Aug 29;27(10):110780. doi: 10.1016/j.isci.2024.110780. eCollection 2024 Oct 18. iScience. 2024. PMID: 39319268 Free PMC article.
  • Handgrip Strength and Upper Limb Anthropometric Characteristics among Latin American Female Volleyball Players. Camacho-Villa MA, Hurtado-Alcoser J, Jerez AS, Saavedra JC, Paredes Prada ET, Merchán JA, Millan-Domingo F, Silva-Polanía C, De la Rosa A. Camacho-Villa MA, et al. J Funct Morphol Kinesiol. 2024 Sep 18;9(3):168. doi: 10.3390/jfmk9030168. J Funct Morphol Kinesiol. 2024. PMID: 39311276 Free PMC article.
  • Comparison between Handheld Echocardiography and Cardiac Magnetic Resonance for Stroke Volume and Left Ventricular Ejection Fraction Quantification. de Raat FM, Bouwmeester S, Bouwman RA, Houthuizen P. de Raat FM, et al. J Med Ultrasound. 2024 Aug 28;32(3):215-220. doi: 10.4103/jmu.jmu_52_23. eCollection 2024 Jul-Sep. J Med Ultrasound. 2024. PMID: 39310875 Free PMC article.

Publication types

  • Search in MeSH

Related information

  • Cited in Books

LinkOut - more resources

Full text sources.

  • Ingenta plc
  • Ovid Technologies, Inc.
  • Wolters Kluwer

Other Literature Sources

  • The Lens - Patent Citations
  • scite Smart Citations

full text provider logo

  • Citation Manager

NCBI Literature Resources

MeSH PMC Bookshelf Disclaimer

The PubMed wordmark and PubMed logo are registered trademarks of the U.S. Department of Health and Human Services (HHS). Unauthorized use of these marks is strictly prohibited.

Spearman's Rank-Order Correlation (cont...)

What values can the spearman correlation coefficient, r s , take.

The Spearman correlation coefficient, r s , can take values from +1 to -1. A r s of +1 indicates a perfect association of ranks, a r s of zero indicates no association between ranks and a r s of -1 indicates a perfect negative association of ranks. The closer r s is to zero, the weaker the association between the ranks.

An example of calculating Spearman's correlation

To calculate a Spearman rank-order correlation on data without any ties we will use the following data:

 Marks
English 56 75 45 71 62 64 58 80 76 61
Maths 66 70 40 60 65 56 59 77 67 63

We then complete the following table:

English (mark)Maths (mark)Rank (English)Rank (maths)dd
566694525
75703211
4540101000
71604739
62656511
645659416
58598800
80771100
76672311
61637611

Where d = difference between ranks and d 2 = difference squared.

We then calculate the following:

Spearman Formula

We then substitute this into the main equation with the other information as follows:

Spearman Formula

as n = 10. Hence, we have a ρ (or r s ) of 0.67. This indicates a strong positive relationship between the ranks individuals obtained in the maths and English exam. That is, the higher you ranked in maths, the higher you ranked in English also, and vice versa.

How do you report a Spearman's correlation?

How you report a Spearman's correlation coefficient depends on whether or not you have determined the statistical significance of the coefficient. If you have simply run the Spearman correlation without any statistical significance tests, you are able to simple state the value of the coefficient as shown below:

Spearman Formula

However, if you have also run statistical significance tests, you need to include some more information as shown below:

Spearman Formula

where df = N – 2, where N = number of pairwise cases.

How do you express the null hypothesis for this test?

The general form of a null hypothesis for a Spearman correlation is:

H 0 : There is no [monotonic] association between the two variables [in the population].

Remember, you are making an inference from your sample to the population that the sample is supposed to represent. However, as this a general understanding of an inferential statistical test, it is often not included. A null hypothesis statement for the example used earlier in this guide would be:

H 0 : There is no [monotonic] association between maths and English marks.

How do I interpret a statistically significant Spearman correlation?

It is important to realize that statistical significance does not indicate the strength of Spearman's correlation. In fact, the statistical significance testing of the Spearman correlation does not provide you with any information about the strength of the relationship. Thus, achieving a value of p = 0.001, for example, does not mean that the relationship is stronger than if you achieved a value of p = 0.04. This is because the significance test is investigating whether you can reject or fail to reject the null hypothesis. If you set α = 0.05, achieving a statistically significant Spearman rank-order correlation means that you can be sure that there is less than a 5% chance that the strength of the relationship you found (your ρ coefficient) happened by chance if the null hypothesis were true.

Stack Exchange Network

Stack Exchange network consists of 183 Q&A communities including Stack Overflow , the largest, most trusted online community for developers to learn, share their knowledge, and build their careers.

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Pearson's or Spearman's correlation with non-normal data

I get this question frequently enough in my statistics consulting work, that I thought I'd post it here. I have an answer, which is posted below, but I was keen to hear what others have to say.

Question: If you have two variables that are not normally distributed, should you use Spearman's rho for the correlation?

  • correlation
  • normality-assumption
  • spearman-rho

whuber's user avatar

  • 4 $\begingroup$ Why not calculate and report both (Pearson's r and Spearman's ρ)? Their difference (or lack thereof) will provide additional information. $\endgroup$ –  user14650 Commented Sep 9, 2015 at 7:51
  • $\begingroup$ A question comparing the distributional assumptions made when we test for significance a simple regression coefficient beta and when we test Pearson correlation coefficient (numerically eual to the beta) stats.stackexchange.com/q/181043/3277 . $\endgroup$ –  ttnphns Commented Nov 29, 2015 at 9:36
  • 5 $\begingroup$ Pearson's correlation is linear, Spearman's is monotonic, so they're not normally for the same purpose. The Pearson coefficient doesn't need you to assume normality. There's a test for it that does assume normality, but you don't have only that option. $\endgroup$ –  Glen_b Commented Feb 24, 2020 at 8:03

8 Answers 8

Pearson's correlation is a measure of the linear relationship between two continuous random variables. It does not assume normality although it does assume finite variances and finite covariance. When the variables are bivariate normal, Pearson's correlation provides a complete description of the association.

Spearman's correlation applies to ranks and so provides a measure of a monotonic relationship between two continuous random variables. It is also useful with ordinal data and is robust to outliers (unlike Pearson's correlation).

The distribution of either correlation coefficient will depend on the underlying distribution, although both are asymptotically normal because of the central limit theorem.

Rob Hyndman's user avatar

  • 19 $\begingroup$ Pearson's $\rho$ does not assume normality, but is only an exhaustive measure of association if the joint distribution is multivariate normal. Given the confusion this distinction elicits, you might want to add it to your answer. $\endgroup$ –  user603 Commented Oct 19, 2010 at 7:42
  • 9 $\begingroup$ Is there a source that can be quoted to support the above statement (Person's r does not assume normality)? We're having the same argument in our department at the moment. $\endgroup$ –  user8891 Commented Feb 1, 2012 at 14:52
  • 13 $\begingroup$ "When the variables are bivariate normal, Pearson's correlation provides a complete description of the association." And when the variables are NOT bivariate normal, how useful is Pearson's correlation? $\endgroup$ –  landroni Commented Sep 16, 2014 at 10:07
  • 8 $\begingroup$ This answer seems rather indirect. "When the variables are bivariate normal ..." And when not? This kind of explanation is why I never get statistics. "Rob, how do you like my new dress?" "The dark color emphasizes your light skin." "Sure, Rob, but do you like how it emphasisez my skin?" "Light skin is considered beautiful in many cultures." "I know, Rob, but do you like it?" "I think the dress is beautiful." "I think so, too, Rob, but is it beautiful on me ?" "You always look beautiful to me, honey." sigh $\endgroup$ –  user14650 Commented Aug 12, 2015 at 11:19
  • 5 $\begingroup$ No, we don't. It's quite possible to do inference for Pearson's correlation without assuming bivariate normality, in at least four different ways. (i) use asymptotic results -- already mentioned above; (ii) make some other parametric distributional assumption and derive or simulate the null distribution of the test statistic; (iii) use a permutation test; (iv) use a bootstrap test. There are probably other approaches $\endgroup$ –  Glen_b Commented Oct 6, 2019 at 3:46

Don't forget Kendall's tau ! Roger Newson has argued for the superiority of Kendall's τ a over Spearman's correlation r S as a rank-based measure of correlation in a paper whose full text is now freely available online:

Newson R. Parameters behind "nonparametric" statistics: Kendall's tau,Somers' D and median differences . Stata Journal 2002; 2(1):45-64.

He references (on p47) Kendall & Gibbons (1990) as arguing that "...confidence intervals for Spearman’s r S are less reliable and less interpretable than confidence intervals for Kendall’s τ -parameters, but the sample Spearman’s r S is much more easily calculated without a computer" (which is no longer of much importance of course).

Kendall, M. G. and J. D. Gibbons. 1990. Rank Correlation Methods . 5th ed. London: Griffin.

Christian Herenz's user avatar

  • 7 $\begingroup$ I'm also a big fan of Kendall's tau. Pearson is far too sensitive to influential points/outliers for my taste, and while Spearman doesn't suffer from this problem, I personally find Kendall easier to understand, interpret and explain than Spearman. Of course, your mileage may vary. $\endgroup$ –  Stephan Kolassa Commented Oct 19, 2010 at 7:44
  • 1 $\begingroup$ My recollection from experience is that Kendall's tau still runs a lot slower (in R) than Spearman's. This can be important if your dataset is large. $\endgroup$ –  wordsforthewise Commented May 25, 2018 at 18:07

From an applied perspective, I am more concerned with choosing an approach that summarises the relationship between two variables in a way that aligns with my research question. I think that determining a method for getting accurate standard errors and p-values is a question that should come second. Even if you chose not to rely on asymptotics, there's always the option to bootstrap or change distributional assumptions.

As a general rule, I prefer Pearson's correlation because (a) it generally aligns more with my theoretical interests; (b) it enables more direct comparability of findings across studies, because most studies in my area report Pearson's correlation; and (c) in many settings there is minimal difference between Pearson and Spearman correlation coefficients.

However, there are situations where I think Pearson's correlation on raw variables is misleading.

  • Outliers: Outliers can have great influence on Pearson's correlations. Many outliers in applied settings reflect measurement failures or other factors that the model is not intended to generalise to. One option is to remove such outliers. Univariate outliers do not exist with Spearman's rho because everything is converted to ranks. Thus, Spearman is more robust.
  • Highly skewed variables: When correlating skewed variables, particularly highly skewed variables, a log or some other transformation often makes the underlying relationship between the two variables clearer (e.g., brain size by body weight of animals). In such settings it may be that the raw metric is not the most meaningful metric anyway. Spearman's rho has a similar effect to transformation by converting both variables to ranks. From this perspective, Spearman's rho can be seen as a quick-and-dirty approach (or more positively, it is less subjective) whereby you don't have to think about optimal transformations.

In both cases above, I would advise researchers to either consider adjustment strategies (e.g., transformations, outlier removal/adjustment) before applying Pearson's correlation or use Spearman's rho.

  • $\begingroup$ The problem with transformation is that, in general, it also transforms the errors associated to each point, and thus the weight. And it doesn't solve the outlier's problem. $\endgroup$ –  skan Commented Mar 9, 2015 at 2:09
  • 1 $\begingroup$ The previous comment is puzzling. Transformation often tames outliers. Also, what to think about errors depends on the scale you choose for analysis. If a logarithmic scale makes sense, for example, additive errors on that scale often make sense too. $\endgroup$ –  Nick Cox Commented Mar 15, 2020 at 10:19

The question asks us to choose between Pearson's and Spearman's method when normality is questioned. Restricted to this concern, I think the following paper should inform anyone's decision:

  • On the Effects of Non-Normality on the Distribution of the Sample Product-Moment Correlation Coefficient (Kowalski, 1975)

It's quite nice and provides a survey of the considerable literature, spanning decades, on this topic -- starting from Pearson's "mutilated and distorted surfaces" and robustness of distribution of $r$. At least part of the contradictory nature of the "facts" is that much of this work was done before the advent of computing power -- which complicated things because the type of non-normality had to be considered and was hard to examine without simulations.

Kowalski's analysis concludes that the distribution of $r$ is not robust in the presence of non-normality and recommends alternative procedures. The entire paper is quite informative and recommended reading, but skip to the very short conclusion at the end of the paper for a summary.

If asked to choose between one of Spearman and Pearson when normality is violated, the distribution free alternative is worth advocating, i.e. Spearman's method.

Previously ..

Spearman's correlation is a rank based correlation measure; it's non-parametric and does not rest upon an assumption of normality.

The sampling distribution for Pearson's correlation does assume normality; in particular this means that although you can compute it, conclusions based on significance testing may not be sound.

As Rob points out in the comments, with large sample this is not an issue. With small samples though, where normality is violated, Spearman's correlation should be preferred.

Update Mulling over the comments and the answers, it seems to me that this boils down to the usual non-parametric vs. parametric tests debate. Much of the literature, e.g. in biostatistics, doesn't deal with large samples. I'm generally not cavalier with relying on asymptotics. Perhaps it's justified in this case, but that's not readily apparent to me.

ars's user avatar

  • 3 $\begingroup$ No. Pearson's correlation does NOT assume normality. It is an estimate of the correlation between any two continuous random variables and is a consistent estimator under relatively general conditions. Even tests based on Pearson's correlation do not require normality if the samples are large enough because of the CLT. $\endgroup$ –  Rob Hyndman Commented Oct 19, 2010 at 1:46
  • 2 $\begingroup$ I am under the impression that Pearson is defined as long as the underlying distributions have finite variances and covariances. So, normality is not required. If the underlying distributions are not normal then the test-statistic may have a different distribution but that is a secondary issue and not relevant to the question at hand. Is that not so? $\endgroup$ –  user28 Commented Oct 19, 2010 at 1:47
  • 2 $\begingroup$ @Rob: Yes, we can always come up with workarounds to make things work out roughly the same. Simply to avoid Spearman's method -- which most non-statisticians can handle with a standard command. I guess my advice remains to use Spearman's method for small samples where normality is questionable. Not sure if that's in dispute here or not. $\endgroup$ –  ars Commented Oct 19, 2010 at 23:43
  • 3 $\begingroup$ @ars. I would use Spearman's if I was interested in monotonic rather than linear association, or if there were outliers or high levels of skewness. I would use Pearson's for linear relationships provided there are no outliers. I don't think the sample size is relevant in making the choice. $\endgroup$ –  Rob Hyndman Commented Oct 20, 2010 at 0:32
  • 5 $\begingroup$ @Rob: OK, thanks for the discussion. I agree with the first part, but doubt the last, and would include that size only plays a role because normal asymptotics don't apply. For example, Kowalski 1972 has a pretty good survey of the history around this, and concludes that the Pearson's correlation is not as robust as thought. See: jstor.org/pss/2346598 $\endgroup$ –  ars Commented Oct 20, 2010 at 1:00

I think these figures (of Gross-Error Sensitivity and Asymptotic Variance) and quotation from the below paper will make it a bit clear:

enter image description here

"The Kendall correlation measure is more robust and slightly more efficient than Spearman’s rank correlation, making it the preferable estimator from both perspectives."

Source: Croux, C. and Dehon, C. (2010). Influence functions of the Spearman and Kendall correlation measures. Statistical Methods and Applications, 19, 497-515.

Krishna's user avatar

Even though this is an age old question, I would like to contribute the (cool) observation that Pearson's $\rho$ is nothing but the slope of the trend line between $Y$ and $X$ after means have been removed and the scales are normalized for $\sigma_Y$ , i.e. after removing means and normalizing for $\sigma_Y$ , Pearson's $r$ is the least-squares solution of $\hat Y=X\hat\beta$ where $\hat Y = Y / \sigma_Y$ .

This leads to a quite easy decision rule between the two: Plot $Y$ over the $X$ (simple scatter plot) and add a trend line. If the trend looks out of place then don't use Pearson's $\rho$ . Bonus: you get to visualize your data, which is never a bad thing.

If you aren't comfortable with Pearson's $\rho$ , then Spearman's rank makes this a bit better because it rescales both the x-axis and the y-axis in a non-linear way (rank encoding) and then fits the trend line in the embedded (transformed) space. In practice, this seems to work well and it does improve robustness towards outliers or skew as others have pointed out.

In theory, I do think Spearman's rank is a bit funny though because rank encoding is a transformation that maps real numbers onto a discrete sequence of numbers. Fitting a linear regression onto discrete numbers is non-sense (they are discrete), so what is happening is that we re-embedd the sequence into the real numbers again using their natural embedding and fit a regression in that space instead. Seems to work well enough in practice, but I do find it funny.

Instead of using Spearman's rank, it may be better to just commit to the rank encoding and go with Kendall's $\tau$ instead; even though we lose the relationship with Pearson's $\rho$ .

Pearson's $\rho$ from Least-Squares

We can start with the desire to fit a linear regression model $Y=X\hat\beta + b$ on our observations using least-squares. Here $X$ is a vector of observations and $Y$ is another vector of matching observations. If we are happy to make the assumption that $X$ and $Y$ had their mean removed ( $\mu_X=\mu_Y=0$ , easy enough to do) then we can reduce the model to $Y=X\hat\beta$ . For this, there exists a closed-form solution $\hat\beta=(X^TX)^{-1}X^TY$ .

Under the vector notation $\text{Cov}(X, Y) = E[XY]-E[X]E[Y] = E[XY] = X^TY$ - we removed the means - and similarly $\sigma_X = \text{Var}(X, X) = \text{Cov}(X, X) = X^TX$ . If we now rewrite $\hat\beta$ in terms of $\text{Cov}$ and $\sigma_X$ we get $\hat\beta = \sigma_X^{-1}\text{Cov}(X,Y) = \frac{\text{Cov}(X,Y)}{\sigma_X}$ .

Plugging this back into the model and normalizing for $\sigma_Y$ resuls in $Y/\sigma_Y = \frac{\text{Cov}(X,Y)}{\sigma_X\sigma_Y}X$ , where the slope is exactly Pearson's $\rho$ . $Y/\sigma_Y$ is the expected rescaling of $Y$ , since we are interested in a variance-normalized coefficient.

FirefoxMetzger's user avatar

  • 1 $\begingroup$ Fascinating, I never realized this connection. How does Cov(X,Y) = E[XY] - E[X]E[Y]? $\endgroup$ –  saeranv Commented Jul 22, 2021 at 5:32
  • 1 $\begingroup$ @saeranv It is one of the ways to define covariance (or follows quickly from your chosen definition): en.wikipedia.org/wiki/Covariance#Definition $\endgroup$ –  FirefoxMetzger Commented Jul 22, 2021 at 8:12
  • $\begingroup$ thanks, so obvious, I should have worked it out for myself! I have another thought/question: I am trying to think of an intuitive reason for why $Y$ is equal to $Cov(X,Y)$ normalized by $\sigma_X$ but not $\sigma_Y$? Would it be accurate to say that regression is equal to the $X$ feature vectors shrunk by an "angle factor" between $X$ and $Y$ (since $X \cdot Y = cos(theta_{XY})$, and then scaled by the standard deviation of $Y$? $\endgroup$ –  saeranv Commented Jul 23, 2021 at 19:58
  • 1 $\begingroup$ See stats.stackexchange.com/questions/70969 for a list of equivalent characterizations of correlation. $\endgroup$ –  whuber ♦ Commented May 2, 2023 at 21:45

In order to add something that hasn't been said yet in 14 years but in my opinion should be said, it is in my view generally wrong to say that "model assumptions have to be fulfilled" in order to apply a model-based statistical model. Statistical models are idealisations and their job is not to be "true" in reality; reality doesn't normally behave like formal models. The idea that something that supposedly "assumes" normality cannot be applied if data are not normally distributed is wrong. If this were true, no model-based statistical method could ever be applied, and neither nonparametric methods as long as they still involve formal model assumptions such as independence.

Model assumptions mean that a method can be shown to perform in a desirable or optimal manner in a certain idealised situation. Model-based statements such as about inference (probabilities for type I or type II error, confidence interval coverage probabilities, Bayesian posteriors) can be used as quantitative orientations, but do not refer to what is true in reality, rather to formal models of it.

Of course one can hope that these results give a good and useful orientation at least in situations in which data look similar to data having been generated by the assumed model, and one may be more skeptical if they don't. This makes some sense, but formalising and proving/checking such an intuition in a mathematical way isn't straightforward. It can be done in more than one way, and results are often ambiguous (for example from simulations in which methods are applied to data generated from models that deviate from the assumptions), i.e., the intuition may often but not always hold.

The thing that is most important is ultimately to understand what the considered statistics are actually doing to the data, and to know in what cases it can be misleading. Also it is important to take into account to what extent inference is of interest, and what the consequences can be. If statistical inference is not of interest, computing statistics such as Pearson correlation is possible without any assumptions, but the question is still relevant to what extent the value tells you what you want to know.

Other answers say much about the details (such as that Pearson correlation is strongly affected by outliers, Spearman less so) and I won't repeat things here, but note that Pearson correlation does not "assume" linearity either (standard inference based on it does; but bootstrap and permutation tests are possible that don't). It is true that a value of 1 or -1 is equivalent to perfect linearity, and Pearson correlation can be interpreted as slope of a standardised data summary line. Testing whether this is zero (using for example a permutation test) is not the worst thing to do even when interested in testing independence against not necessarily linear monotonicity. There are better alternatives in many specific situations (with outliers Spearman will normally be better), however a Pearson correlation larger than zero, say, means that there is a tendency of one variable to be larger/smaller that its expected value when the other one is (also measuring to what extent amounts of deviation from the mean in the two variables are proportional in a standardised manner), and this is often a sensible thing to be interested in, whether a relationship is really linear or not.

Christian Hennig's user avatar

Please be aware that "if inferences are drawn about the relationship (e.g. we set a null hypothesis that r = 0; [no correlation]), then the Pearson's correlation coefficient assumes that the joint distribution of X and Y is ‘bivariate normal’. (...) Bivariate normality implies that both X and Y must come from normal distributions, but the reverse is not necessarily true" ( Source ).

So in case of significance testing you must care about normality. At least in my field no one just calculates Pearson's correlation without testing against the Null, so IMO for the most situations the answer is yes, (bivariate) normality is an assumption and if this is violated you can either use Spearman or maybe alternatives as mentioned in other answers.

LulY's user avatar

Not the answer you're looking for? Browse other questions tagged correlation normality-assumption pearson-r spearman-rho faq or ask your own question .

  • Featured on Meta
  • Preventing unauthorized automated access to the network
  • User activation: Learnings and opportunities
  • Join Stack Overflow’s CEO and me for the first Stack IRL Community Event in...

Hot Network Questions

  • What are the games referenced in the banner for the Hooded Horse publisher sale of 2024?
  • Will a car seat fit into a standard economy class seat on a plane?
  • FORTRAN90 test suite for Project Euler
  • Do we have volitional control over our level of skepticism?
  • Problems regressing y on x/y?
  • Does the Rogue's Evasion cost a reaction?
  • On a glassed landmass, how long would it take for plants to grow?
  • All combinations of ascending/descending digits
  • How to make a wall of unicode symbols useful?
  • How can fideism's pursuit of truth through faith be considered a sound epistemology when mutually exclusive religious beliefs are accepted on faith?
  • How do you tell someone to offer something to everyone in a room by taking it physically to everyone in the room so everyone can have it?
  • How do we define a non-standard model of natural numbers in ZFC, including predecessor operations?
  • How many natural operations on subsets are there?
  • Does history possess the epistemological tools to establish the occurrence of an anomaly in the past that defies current scientific models?
  • Java class subset of C++ std::list with efficient std::list::sort()
  • Why do evacuations result in so many injuries?
  • Do mathematicians care about the validity ("truth") of the axioms?
  • What exactly do I buy when I buy an index-following ETF?
  • What cameos did Stan Lee have in the DC universe?
  • Neil Tyson: gravity is the same every where on the geoid
  • What is the criterion for oscillatory motion?
  • Does Voyager send its data on a schedule and hope somebody's listening?
  • White (king and 2 bishops) vs Black (king and 1 knight). White to play and mate in 2
  • Is this circuit worth anything, and what's the role of the R8 resistor?

hypothesis in correlation coefficient

  • Data Visualization
  • Statistics in R
  • Machine Learning in R
  • Data Science in R
  • Packages in R

Testing Various Hypothesis Test for Coefficients in R

Hypothesis testing plays a critical role in statistical modeling, helping us assess whether the independent variables (predictors) in a model significantly impact the dependent variable (outcome). In the context of regression analysis, testing the coefficients allows us to evaluate the significance of each predictor. This article will provide a comprehensive guide on how to perform hypothesis tests for coefficients in R using various methods.

Introduction to Hypothesis Testing in Regression

In regression analysis, hypothesis testing helps us understand whether the relationship between predictors and the outcome variable is statistically significant. The goal is to determine whether the estimated coefficients of the independent variables are significantly different from zero (or any other hypothesized value).

  • Null Hypothesis (H₀) : The coefficient is equal to zero (no effect).
  • Alternative Hypothesis (H₁) : The coefficient is not equal to zero (significant effect).

For each predictor in the model, you can perform hypothesis tests to accept or reject the null hypothesis.

Understanding Regression Coefficients

Regression coefficients represent the change in the dependent variable for a one-unit change in the independent variable, holding all other variables constant. If a coefficient is statistically significant, it indicates that the predictor has a meaningful impact on the outcome variable.

Common Hypothesis Tests for Coefficients

The two most common hypothesis tests used to assess regression coefficients are:

  • t-Test : Used to test the significance of individual coefficients.
  • F-Test : Used to test the overall significance of the regression model.

Let’s explore each of these tests in detail and learn how to perform them in R Programming Language .

Step 1: Fit a Linear Regression Model Using lm()

We will begin by fitting a linear regression model using the lm() function. For demonstration purposes, we’ll use the built-in mtcars dataset, which contains information about various car models:

The summary() function provides a detailed summary of the model, including the estimated coefficients, standard errors, t-values, and p-values for each predictor.

Step 2: Conducting t-Tests for Individual Coefficients

The t-test evaluates whether each coefficient significantly differs from zero. The summary() function provides the t-values and p-values for each coefficient.

  • If the p-value is less than 0.05 (assuming a 95% confidence level), you reject the null hypothesis and conclude that the predictor is statistically significant.
  • If the p-value is greater than 0.05, you fail to reject the null hypothesis, indicating that the predictor is not statistically significant.

Step 3: Conducting an F-Test for Overall Model Significance

The F-test helps determine whether the overall regression model is statistically significant, testing the null hypothesis that all coefficients are equal to zero.

The summary() function in R provides the F-statistic and the corresponding p-value at the bottom of the output.

  • If the p-value is less than 0.05, you reject the null hypothesis, concluding that at least one predictor is statistically significant.
  • If the p-value is greater than 0.05, you fail to reject the null hypothesis, indicating that none of the predictors are statistically significant.

Using the anova() Function to Perform the F-Test

You can also use the anova() function to perform the F-test:

The anova() function provides an F-statistic for the model, which indicates the overall fit.

Step 4: Using Confidence Intervals to Test Coefficients

Confidence intervals provide a range of values within which the true coefficient is likely to fall. If the confidence interval does not contain zero, the coefficient is statistically significant.

  • The intercept and wt have p-values less than 0.05, indicating they are statistically significant.
  • hp and disp have p-values greater than 0.05, indicating they are not statistically significant.

The overall F-test shows whether at least one coefficient is statistically significant, while the t-tests provide insight into the significance of individual predictors.

Testing the coefficients in regression models using hypothesis tests is crucial for understanding the significance of predictors and evaluating the overall fit of the model. In R, you can easily perform t-tests, F-tests, and confidence interval calculations using built-in functions like lm() , summary() , and anova() .

Similar Reads

  • R-Statistics

Please Login to comment...

  • How to Watch NFL on NFL+ in 2024: A Complete Guide
  • Best Smartwatches in 2024: Top Picks for Every Need
  • Top Budgeting Apps in 2024
  • 10 Best Parental Control App in 2024
  • GeeksforGeeks Practice - Leading Online Coding Platform

Improve your Coding Skills with Practice

 alt=

What kind of Experience do you want to share?

IMAGES

  1. How To Conduct Hypothesis Testing For A Population Correlation Coefficient

    hypothesis in correlation coefficient

  2. Correlation Formula

    hypothesis in correlation coefficient

  3. PPT

    hypothesis in correlation coefficient

  4. Correlation and regression

    hypothesis in correlation coefficient

  5. T-Test for a Correlation Coefficient and the Coefficient of Determination

    hypothesis in correlation coefficient

  6. Correlation Coefficient (Definition, Formula)

    hypothesis in correlation coefficient

VIDEO

  1. Testing of hypothesis about correlation coefficient

  2. Correlation Coefficient, Prediction and Hypothesis Testing

  3. Hypothesis Testing based on Correlation

  4. Excel Statistics, Correlation & Regression Analysis, Hypothesis Testing Correlation Coefficient ✏️📐

  5. T Test

  6. Correlation (Parametric and Non-parametric) with hypothesis in SPSS in Bangla

COMMENTS

  1. 11.2: Correlation Hypothesis Test

    11.2: Correlation Hypothesis Test. The correlation coefficient, r, tells us about the strength and direction of the linear relationship between x and y. However, the reliability of the linear model also depends on how many observed data points are in the sample.

  2. 1.9

    Let's perform the hypothesis test on the husband's age and wife's age data in which the sample correlation based on n = 170 couples is r = 0.939. To test H 0: ρ = 0 against the alternative H A: ρ ≠ 0, we obtain the following test statistic: t ∗ = r n − 2 1 − R 2 = 0.939 170 − 2 1 − 0.939 2 = 35.39. To obtain the P -value, we need ...

  3. 12.1.2: Hypothesis Test for a Correlation

    The t-test is a statistical test for the correlation coefficient. It can be used when x x and y y are linearly related, the variables are random variables, and when the population of the variable y y is normally distributed. The formula for the t-test statistic is t = r (n − 2 1 −r2)− −−−−−−−√ t = r (n − 2 1 − r 2).

  4. Pearson Correlation Coefficient (r)

    The Pearson correlation coefficient (r) is the most common way of measuring a linear correlation. It is a number between -1 and 1 that measures the strength and direction of the relationship between two variables. ... Example: Deciding whether to reject the null hypothesis For the correlation between weight and height in a sample of 10 ...

  5. Interpreting Correlation Coefficients

    Hypothesis Test for Correlation Coefficients. Correlation coefficients have a hypothesis test. As with any hypothesis test, this test takes sample data and evaluates two mutually exclusive statements about the population from which the sample was drawn. ... and correlation coefficient is coming around 0.9165. it is strange even relationship is ...

  6. 9.4.1

    The test statistic is: t ∗ = r n − 2 1 − r 2 = (0.711) 28 − 2 1 − 0.711 2 = 5.1556. Next, we need to find the p-value. The p-value for the two-sided test is: p-value = 2 P (T> 5.1556) <0.0001. Therefore, for any reasonable α level, we can reject the hypothesis that the population correlation coefficient is 0 and conclude that it is ...

  7. Correlation Coefficient

    What does a correlation coefficient tell you? Correlation coefficients summarize data and help you compare results between studies.. Summarizing data. A correlation coefficient is a descriptive statistic.That means that it summarizes sample data without letting you infer anything about the population. A correlation coefficient is a bivariate statistic when it summarizes the relationship ...

  8. 12.4 Testing the Significance of the Correlation Coefficient

    The correlation coefficient, r, tells us about the strength and direction of the linear relationship between x and y.However, the reliability of the linear model also depends on how many observed data points are in the sample. We need to look at both the value of the correlation coefficient r and the sample size n, together.. We perform a hypothesis test of the "significance of the correlation ...

  9. Conducting a Hypothesis Test for the Population Correlation Coefficient

    First, we specify the null and alternative hypotheses: Null hypothesis H0: ρ = 0. Alternative hypothesis HA: ρ ≠ 0 or HA: ρ < 0 or HA: ρ > 0. Second, we calculate the value of the test statistic using the following formula: Test statistic: t ∗ = r n − 2 1 − r 2. Third, we use the resulting test statistic to calculate the P -value.

  10. Hypothesis Test for Correlation

    The correlation coefficient, r, tells us about the strength and direction of the linear relationship between x and y.However, the reliability of the linear model also depends on how many observed data points are in the sample. We need to look at both the value of the correlation coefficient r and the sample size n, together.. We perform a hypothesis test of the "significance of the ...

  11. Correlation Coefficient

    The correlation coefficient r is a unit-free value between -1 and 1. Statistical significance is indicated with a p-value. Therefore, correlations are typically written with two key numbers: r = and p = . The closer r is to zero, the weaker the linear relationship.; Positive r values indicate a positive correlation, where the values of both variables tend to increase together.

  12. Everything you need to know about interpreting correlations

    The sample correlation coefficient (r) between x and y is known (can be computed using the formula above) The population correlation coefficient ρ (the greek letter "rho") between x and y is unknown (because we only have sample data) Goal: We want to make an inference about the value of ρ based on r; Performing the hypothesis test step by ...

  13. Pearson Correlation

    and the alternative hypothesis: There is a correlation between salary and age in the German population. Significance and the t-test. Whether the Pearson correlation coefficient is significantly different from zero based on the sample surveyed can be checked using a t-test. Here, r is the correlation coefficient and n is the sample size.

  14. Correlation Coefficient

    The most common formula for computing a product-moment correlation coefficient (r) is given below. Product-moment correlation coefficient. The correlation r between two variables is: r = Σ (xy) / sqrt [ ( Σ x 2 ) * ( Σ y 2 ) ] where Σ is the summation symbol, x = x i - x, x i is the x value for observation i, x is the mean x value, y = y i ...

  15. Correlation: Meaning, Types, Examples & Coefficient

    The correlation coefficient (r) indicates the extent to which the pairs of numbers for these two variables lie on a straight line.Values over zero indicate a positive correlation, while values under zero indicate a negative correlation. A correlation of -1 indicates a perfect negative correlation, meaning that as one variable goes up, the other goes down.

  16. Hypothesis Testing: Correlations

    We perform a hypothesis test of the "significance of the correlation coefficient" to decide whether the linear relationship in the sample data is strong enough to use to model the relationship in the population. The hypothesis test lets us decide whether the value of the population correlation coefficient. \rho ρ.

  17. 1.6

    1.6 - (Pearson) Correlation Coefficient, r. The correlation coefficient, r, is directly related to the coefficient of determination R 2 in an obvious way. If R 2 is represented in decimal form, e.g. 0.39 or 0.87, then all we have to do to obtain r is to take the square root of R 2: r = ± R 2. The sign of r depends on the sign of the estimated ...

  18. Hypothesis Testing for Correlation

    Past Papers. Edexcel. Spanish. Past Papers. CIE. Spanish Language & Literature. Past Papers. Other Subjects. Revision notes on 2.5.2 Hypothesis Testing for Correlation for the AQA A Level Maths: Statistics syllabus, written by the Maths experts at Save My Exams.

  19. Correlation Coefficients: Appropriate Use and Interpretation

    Correlation in the broadest sense is a measure of an association between variables. ... (Spearman correlation) as the coefficient approaches an absolute value of 1. ... Hypothesis tests and confidence intervals can be used to address the statistical significance of the results and to estimate the strength of the relationship in the population ...

  20. Spearman's Rank-Order Correlation

    The Spearman correlation coefficient, r s, can take values from +1 to -1. A r s of +1 indicates a perfect association of ranks, ... The general form of a null hypothesis for a Spearman correlation is: H 0: There is no [monotonic] association between the two variables [in the population].

  21. Pearson's or Spearman's correlation with non-normal data

    The distribution of either correlation coefficient will depend on the underlying distribution, although both are asymptotically normal because of the central limit theorem. ... (e.g. we set a null hypothesis that r = 0; [no correlation]), then the Pearson's correlation coefficient assumes that the joint distribution of X and Y is 'bivariate ...

  22. How to Perform a Correlation Test in Excel (Step-by-Step)

    Step 2: Calculate the Correlation Coefficient. Next, we can use the CORREL () function to calculate the correlation coefficient between the two variables: The correlation coefficient between the two variables turns out to be 0.803702. This is a highly positive correlation coefficient, but to determine if it's statistically significant we need ...

  23. 12.5: Testing the Significance of the Correlation Coefficient

    The p-value is calculated using a t -distribution with n − 2 degrees of freedom. The formula for the test statistic is t = r√n − 2 √1 − r2. The value of the test statistic, t, is shown in the computer or calculator output along with the p-value. The test statistic t has the same sign as the correlation coefficient r.

  24. Testing Various Hypothesis Test for Coefficients in R

    Alternative Hypothesis (H₁): The coefficient is not equal to zero (significant effect). For each predictor in the model, you can perform hypothesis tests to accept or reject the null hypothesis. ... To add correlation coefficient with P-value to a scatter plot, we use the stat_cor() function of the ggpubr package in the R Language. ...

  25. 16.2: Causation and partial correlation

    The difference between correlation and causation, and the danger of confounding variables creating spurious correlations between measured variables. ... .961\), \(t = 21.862\), \(df = 40\), \(p < 2.2 \times 10^{-16}\). So, at first blush looking at the scatterplot and the correlation coefficient, we conclude that there is a significant ...