Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, automatically generate references for free.

  • Knowledge Base
  • Methodology
  • External Validity | Types, Threats & Examples

External Validity | Types, Threats & Examples

Published on 3 May 2022 by Pritha Bhandari . Revised on 18 December 2023.

External validity is the extent to which you can generalise the findings of a study to other situations, people, settings, and measures. In other words, can you apply the findings of your study to a broader context?

The aim of scientific research is to produce generalisable knowledge about the real world. Without high external validity, you cannot apply results from the laboratory to other people or the real world.

In qualitative studies , external validity is referred to as transferability.

Table of contents

Types of external validity, trade-off between external and internal validity, threats to external validity and how to counter them, frequently asked questions about external validity.

There are two main types of external validity: population validity and ecological validity.

External Validity

Population validity

Population validity refers to whether you can reasonably generalise the findings from your sample to a larger group of people (the population).

Population validity depends on the choice of population and on the extent to which the study sample mirrors that population. Non-probability sampling methods are often used for convenience. With this type of sampling, the generalisability of results is limited to populations that share similar characteristics with the sample.

You recruit over 200 participants. They are science and engineering students; most of them are British, male, 18–20 years old, and from a high socioeconomic background. In a laboratory setting, you administer a mathematics and science test and then ask them to rate how well they think performed. You find that the average participant believes they are smarter than 66% of their peers.

Here, your sample is not representative of the whole population of students at your university. The findings can only reasonably be generalised to populations that share characteristics with the participants, e.g. university-educated men studying STEM subjects.

For higher population validity, your sample would need to include people with different characteristics (e.g., women, nonbinary people, and students from different fields, countries, and socioeconomic backgrounds).

Samples like this one, from Western, educated, industrialised, rich, and democratic (WEIRD) countries, are used in an estimated 96% of psychology studies , even though they represent only 12% of the world’s population. As outliers in terms of visual perception, moral reasoning, and categorisation (among many other topics), WEIRD samples limit broad population validity in the social sciences.

Ecological validity

Ecological validity refers to whether you can reasonably generalise the findings of a study to other situations and settings in the ‘real world’.

In a laboratory setting, you set up a simple computer-based task to measure reaction times. Participants are told to imagine themselves driving around the racetrack and double-click the mouse whenever they see an orange cat on the screen. For one round, participants listen to a podcast. In the other round, they do not need to listen to anything.

In the example above, it is difficult to generalise the findings to real-life driving conditions. A computer-based task using a mouse does not resemble real-life driving conditions with a steering wheel. Additionally, a static image of an orange cat may not represent common real-life hurdles when driving.

To improve ecological validity in a lab setting, you could use an immersive driving simulator with a steering wheel and foot pedal instead of a computer and mouse. This increases psychological realism by more closely mirroring the experience of driving in the real world.

Alternatively, for higher ecological validity, you could conduct the experiment using a real driving course.

Prevent plagiarism, run a free check.

Internal validity is the extent to which you can be confident that the causal relationship established in your experiment cannot be explained by other factors.

There is an inherent trade-off between external and internal validity ; the more applicable you make your study to a broader context, the less you can control extraneous factors in your study.

Threats to external validity are important to recognise and counter in a research design for a robust study.

Participants are given a pretest and a post-test measuring how often they experienced anxiety in the past week. During the study, all participants are given an individual mindfulness training and asked to practise mindfulness daily for 15 minutes in the morning.

How to counter threats to external validity

There are several ways to counter threats to external validity:

  • Replications counter almost all threats by enhancing generalisability to other settings, populations and conditions.
  • Field experiments counter testing and situation effects by using natural contexts.
  • Probability sampling counters selection bias by making sure everyone in a population has an equal chance of being selected for a study sample.
  • Recalibration or reprocessing also counters selection bias using algorithms to correct weighting of factors (e.g., age) within study samples.

The external validity of a study is the extent to which you can generalise your findings to different groups of people, situations, and measures.

The two types of external validity are population validity (whether you can generalise to other groups of people) and ecological validity (whether you can generalise to other situations and settings).

There are seven threats to external validity : selection bias , history, experimenter effect, Hawthorne effect , testing effect, aptitude-treatment, and situation effect.

Attrition bias can skew your sample so that your final sample differs significantly from your original sample. Your sample is biased because some groups from your population are underrepresented.

With a biased final sample, you may not be able to generalise your findings to the original population that you sampled from, so your external validity is compromised.

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the ‘Cite this Scribbr article’ button to automatically add the citation to our free Reference Generator.

Bhandari, P. (2023, December 18). External Validity | Types, Threats & Examples. Scribbr. Retrieved 2 April 2024, from https://www.scribbr.co.uk/research-methods/external-validity-explained/

Is this article helpful?

Pritha Bhandari

Pritha Bhandari

Other students also liked, the 4 types of validity | types, definitions & examples, internal validity | definition, threats & examples, reliability vs validity in research | differences, types & examples.

  • Privacy Policy

Buy Me a Coffee

Research Method

Home » External Validity – Threats, Examples and Types

External Validity – Threats, Examples and Types

Table of Contents

External Validity

External Validity

Definition:

External validity refers to the extent to which the results of a study can be generalized or applied to a larger population, settings, or conditions beyond the specific context of the study. It is a measure of how well the findings of a study can be considered representative of the real world.

How To Increase External Validity

To increase external validity in research, researchers can employ several strategies to enhance the generalizability of their findings. Here are some common approaches:

Representative Sampling

Ensure that the sample used in the study is representative of the target population of interest. Random sampling techniques, such as simple random sampling or stratified sampling, can help reduce sampling bias and increase the likelihood of obtaining a representative sample.

Diverse Participant Characteristics

Include participants with diverse demographic characteristics, such as age, gender, socioeconomic status, and cultural backgrounds. This helps to ensure that the findings are applicable to a wider range of individuals.

Multiple Settings

Conduct the study in multiple settings or contexts to assess the robustness of the findings across different environments. This could involve replicating the study in different geographical locations, institutions, or organizations.

Large Sample Size

Increasing the sample size can improve the statistical power of the study and enhance the reliability of the findings. Larger samples are generally more representative of the population, making it easier to generalize the results.

Longitudinal Studies

Consider conducting longitudinal studies that span a longer duration. By observing changes and trends over time, researchers can provide a more comprehensive understanding of the phenomenon under investigation and increase the applicability of their findings.

Real-world Conditions

Strive to create conditions in the study that closely resemble real-world situations. This can be achieved by conducting field experiments, using naturalistic observation, or implementing interventions in real-life settings.

External Validation of Measures

Use established and validated measurement instruments to assess variables of interest. By employing recognized measures, researchers increase the likelihood that their findings can be compared and replicated in other studies.

Meta-Analysis

Conducting a meta-analysis, which involves systematically analyzing and combining the results of multiple studies on the same topic, can provide a more comprehensive view and increase the external validity by pooling findings from various sources.

Replication

Encourage replication of the study by other researchers. When multiple studies yield similar results, it strengthens the external validity of the findings.

Transparent Reporting

Clearly document the study design, methodology, and limitations in research publications. Transparent reporting allows readers to evaluate the study’s external validity and consider the potential generalizability of the findings.

Threats to External Validity

There are several threats to external validity that researchers should be aware of when interpreting the generalizability of their findings. These threats include:

Selection Bias

Participants in a study may not be representative of the target population due to the way they were selected or recruited. This can limit the generalizability of the findings to the broader population.

Sampling Bias

Even with random sampling techniques, there is a possibility of sampling bias. This occurs when certain segments of the population are underrepresented or overrepresented in the sample, leading to a skewed representation of the population.

Reactive or Interaction Effects of Testing

The act of participating in a study or being exposed to a specific experimental condition can influence participants’ behaviors or responses. This can lead to artificial results that may not occur in natural settings.

Experimental Setting

The controlled environment of a laboratory or research setting may differ significantly from real-world situations, potentially influencing participant behavior and limiting the generalizability of the findings.

Demand Characteristics

Participants may alter their behavior based on their perception of the study’s purpose or the researcher’s expectations. This can introduce biases and limit the external validity of the findings.

Novelty Effects

Participants may respond differently to novel or unusual conditions in a study, which may not accurately reflect their behavior in everyday life.

Hawthorne Effect

Participants may change their behavior simply because they are aware they are being observed. This effect can distort the findings and limit generalizability.

Experimenter Bias

The actions or behaviors of the researchers conducting the study can inadvertently influence participant responses or outcomes, impacting the generalizability of the findings.

Time-related Threats

The passage of time can affect the external validity of findings. Social, cultural, or technological changes that occur between the study and the application of the findings may limit their relevance.

Specificity of the Intervention or Treatment

If the study involves a specific intervention or treatment, the findings may be limited to that particular intervention and may not generalize to other similar interventions or treatments.

Publication Bias

The tendency of researchers or journals to publish studies with significant or positive findings can introduce a bias in the literature and limit the generalizability of research findings.

Types of External Validity

Types of External Validity are as follows:

Population Validity

Population validity refers to the extent to which the findings of a study can be generalized to the larger target population from which the study sample was drawn. If the sample is representative of the population in terms of relevant characteristics, such as age, gender, socioeconomic status, and ethnicity, the study’s findings are more likely to have high population validity.

Ecological Validity

Ecological validity refers to the extent to which the findings of a study can be generalized to real-world settings or conditions. It assesses whether the experimental conditions and procedures accurately represent the complexity and dynamics of the natural environment. High ecological validity suggests that the findings are applicable to everyday situations.

Temporal Validity

Temporal validity, also known as historical validity or generalizability over time, refers to the extent to which the findings of a study can be generalized across different time periods. It assesses whether the relationships or effects observed during the study remain consistent or change over time.

Cross-Cultural Validity

Cross-cultural validity refers to the extent to which the findings of a study can be generalized to different cultural contexts or populations. It examines whether the relationships or effects observed in one culture hold true in other cultures. Conducting research in multiple cultural settings can help establish cross-cultural validity.

Setting Validity

Setting validity refers to the extent to which the findings of a study can be generalized to different settings or environments. It assesses whether the relationships or effects observed in one specific setting can be replicated in other similar settings.

Task Validity

Task validity refers to the extent to which the findings of a study can be generalized to different tasks or activities. It examines whether the relationships or effects observed during a specific task are applicable to other tasks that share similar characteristics.

Measurement Validity

Measurement validity refers to the extent to which the chosen measurements or instruments accurately capture the constructs or variables of interest. It examines whether the relationships or effects observed are robust across different measurement tools or techniques.

Examples of External Validity

Here are some real-time examples of external validity:

Medical Research: A pharmaceutical company conducts a clinical trial to test the efficacy of a new drug on a specific population group (e.g., adults with diabetes). To ensure external validity, the company includes participants from diverse backgrounds, ages, and geographical locations to ensure that the results can be generalized to a broader population.

Educational Research: A study examines the effectiveness of a teaching method in improving student performance in mathematics. Researchers choose a sample of schools from different regions, representing various socioeconomic backgrounds, to ensure the findings can be applied to a wider range of schools and students.

Opinion Polls: A polling agency conducts a survey to understand public opinion on a particular political issue. To ensure external validity, the agency ensures a representative sample of respondents, considering factors such as age, gender, ethnicity, education level, and geographic location. This approach allows the findings to be generalized to the broader population.

Social Science Research: A study investigates the impact of a social intervention program on reducing crime rates in a specific neighborhood. To enhance external validity, researchers select neighborhoods that represent diverse socio-economic conditions and urban and rural settings. This approach increases the likelihood that the findings can be applied to similar neighborhoods in other locations.

Psychological Research: A psychology study examines the effects of a therapy technique on reducing anxiety levels in individuals. To improve external validity, the researchers recruit a diverse sample of participants, including individuals of different ages, genders, and cultural backgrounds. This ensures that the findings can be applicable to a broader range of individuals experiencing anxiety.

Applications of External Validity

External validity has several practical applications across various fields. Here are some specific applications of external validity:

Policy Development:

External validity helps policymakers make informed decisions by considering research findings from different contexts and populations. By examining the external validity of studies, policymakers can determine the applicability and generalizability of research results to their target population and policy goals.

Program Evaluation:

External validity is crucial in evaluating the effectiveness of programs or interventions. By assessing the external validity of evaluation studies, policymakers and program administrators can determine if the findings are applicable to their target population and whether similar interventions can be implemented in different settings.

Market Research:

External validity is essential in market research to understand consumer behavior and preferences. By conducting studies with representative samples, companies can extrapolate the findings to the broader consumer population, allowing them to make informed marketing and product development decisions.

Health Interventions:

External validity plays a significant role in healthcare research. It helps researchers and healthcare practitioners understand the generalizability of treatment outcomes to diverse patient populations. By considering external validity, healthcare providers can determine if a specific treatment or intervention will be effective for their patients.

Education and Training:

External validity is important in educational research to ensure that instructional methods, educational interventions, and training programs are effective across diverse student populations and different educational settings. It helps educators and trainers make evidence-based decisions about instructional strategies that are likely to have positive outcomes in different contexts.

Public Opinion Research:

External validity is crucial in public opinion research, such as political polling or survey research. By ensuring a representative sample and considering external validity, researchers can generalize their findings to the larger population, providing insights into public sentiment and informing decision-making processes.

Advantages of External Validity

Here are some advantages of external validity:

  • Generalizability: External validity allows researchers to generalize their findings to broader populations, settings, or conditions. It enables them to make inferences about how the results of a study might hold true in real-world situations beyond the controlled environment of the study.
  • Real-world applicability: When a study has high external validity, the findings are more likely to be applicable and relevant to real-world scenarios. This is particularly important in fields such as medicine, psychology, and social sciences, where the goal is often to understand and improve human behavior and well-being.
  • Increased confidence in findings: Studies with high external validity provide stronger evidence and increase confidence in the findings. When the results can be generalized to diverse populations or different contexts, it suggests that the observed effects are more robust and reliable.
  • Enhanced ecological validity: External validity enhances ecological validity, which refers to the degree to which a study reflects real-life situations. When a study has good external validity, it increases the likelihood that the findings accurately represent the complexities and nuances of the real world.
  • Policy implications: Research findings with high external validity are more likely to have practical implications for policy-making. Policymakers are interested in studies that can inform decisions and interventions on a larger scale. Studies with strong external validity provide a basis for making informed decisions and implementing effective policies.
  • Replication and meta-analysis: External validity facilitates replication studies and meta-analyses, which involve combining the results of multiple studies. When studies have high external validity, it becomes easier to replicate the findings in different contexts or conduct meta-analyses to examine the overall effects across a range of studies.
  • Improved understanding of causal relationships: External validity allows researchers to test the generalizability of causal relationships. By replicating studies in different settings or populations, researchers can examine whether the causal relationships observed in one context hold true in other contexts, providing a more comprehensive understanding of the phenomenon under investigation.

Limitations of External Validity

While external validity offers several advantages, it also has limitations that researchers need to consider. Here are some limitations of external validity:

  • Specificity of conditions: The specific conditions and settings of a study may limit the generalizability of the findings. Factors such as the time period, location, and sample characteristics can influence the results. For example, cultural, socioeconomic, or geographical differences between the study sample and the target population may affect the generalizability of the findings.
  • Selection bias: In many studies, participants are recruited through convenience sampling or other non-random methods, which can introduce selection bias. This means that the sample may not be representative of the larger population, reducing the external validity of the findings. Selection bias can limit the generalizability of the results to other populations or contexts.
  • Artificiality of experimental settings: Studies conducted in controlled laboratory or experimental settings may lack ecological validity. The artificial conditions and controlled variables may not accurately reflect real-world complexities. Participants’ behavior in a laboratory setting may differ from their behavior in naturalistic settings, leading to limited generalizability.
  • Novelty and awareness effects: Participants in research studies may behave differently simply because they are aware they are being studied. This awareness can lead to the novelty effect or demand characteristics, where participants alter their behavior in response to the study context or the researchers’ expectations. As a result, the observed effects may not accurately represent real-world behavior.
  • Time-dependent effects: The relevance and applicability of research findings can change over time due to societal, technological, or cultural shifts. What may be true and valid today may not hold true in the future. Therefore, the external validity of a study’s findings may diminish as time progresses.
  • Lack of contextual variation: Studies often focus on a narrow range of contexts or populations, limiting the understanding of how findings may vary across different contexts. The external validity of a study may be compromised if it fails to account for contextual variations that can influence the generalizability of the results.
  • Replication challenges: While replication is important for assessing the external validity of a study, it can be challenging to replicate studies in different contexts or with diverse populations. Replication studies may encounter practical constraints, such as resource limitations, time constraints, or ethical considerations, which can limit the ability to establish external validity.

Also see Validity

About the author

' src=

Muhammad Hassan

Researcher, Academic Writer, Web developer

You may also like

Validity

Validity – Types, Examples and Guide

Alternate Forms Reliability

Alternate Forms Reliability – Methods, Examples...

Construct Validity

Construct Validity – Types, Threats and Examples

Internal Validity

Internal Validity – Threats, Examples and Guide

Reliability Vs Validity

Reliability Vs Validity

Internal_Consistency_Reliability

Internal Consistency Reliability – Methods...

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base

Methodology

  • The 4 Types of Validity in Research | Definitions & Examples

The 4 Types of Validity in Research | Definitions & Examples

Published on September 6, 2019 by Fiona Middleton . Revised on June 22, 2023.

Validity tells you how accurately a method measures something. If a method measures what it claims to measure, and the results closely correspond to real-world values, then it can be considered valid. There are four main types of validity:

  • Construct validity : Does the test measure the concept that it’s intended to measure?
  • Content validity : Is the test fully representative of what it aims to measure?
  • Face validity : Does the content of the test appear to be suitable to its aims?
  • Criterion validity : Do the results accurately measure the concrete outcome they are designed to measure?

In quantitative research , you have to consider the reliability and validity of your methods and measurements.

Note that this article deals with types of test validity, which determine the accuracy of the actual components of a measure. If you are doing experimental research, you also need to consider internal and external validity , which deal with the experimental design and the generalizability of results.

Table of contents

Construct validity, content validity, face validity, criterion validity, other interesting articles, frequently asked questions about types of validity.

Construct validity evaluates whether a measurement tool really represents the thing we are interested in measuring. It’s central to establishing the overall validity of a method.

What is a construct?

A construct refers to a concept or characteristic that can’t be directly observed, but can be measured by observing other indicators that are associated with it.

Constructs can be characteristics of individuals, such as intelligence, obesity, job satisfaction, or depression; they can also be broader concepts applied to organizations or social groups, such as gender equality, corporate social responsibility, or freedom of speech.

There is no objective, observable entity called “depression” that we can measure directly. But based on existing psychological research and theory, we can measure depression based on a collection of symptoms and indicators, such as low self-confidence and low energy levels.

What is construct validity?

Construct validity is about ensuring that the method of measurement matches the construct you want to measure. If you develop a questionnaire to diagnose depression, you need to know: does the questionnaire really measure the construct of depression? Or is it actually measuring the respondent’s mood, self-esteem, or some other construct?

To achieve construct validity, you have to ensure that your indicators and measurements are carefully developed based on relevant existing knowledge. The questionnaire must include only relevant questions that measure known indicators of depression.

The other types of validity described below can all be considered as forms of evidence for construct validity.

Receive feedback on language, structure, and formatting

Professional editors proofread and edit your paper by focusing on:

  • Academic style
  • Vague sentences
  • Style consistency

See an example

external validity in research methodology

Content validity assesses whether a test is representative of all aspects of the construct.

To produce valid results, the content of a test, survey or measurement method must cover all relevant parts of the subject it aims to measure. If some aspects are missing from the measurement (or if irrelevant aspects are included), the validity is threatened and the research is likely suffering from omitted variable bias .

A mathematics teacher develops an end-of-semester algebra test for her class. The test should cover every form of algebra that was taught in the class. If some types of algebra are left out, then the results may not be an accurate indication of students’ understanding of the subject. Similarly, if she includes questions that are not related to algebra, the results are no longer a valid measure of algebra knowledge.

Face validity considers how suitable the content of a test seems to be on the surface. It’s similar to content validity, but face validity is a more informal and subjective assessment.

You create a survey to measure the regularity of people’s dietary habits. You review the survey items, which ask questions about every meal of the day and snacks eaten in between for every day of the week. On its surface, the survey seems like a good representation of what you want to test, so you consider it to have high face validity.

As face validity is a subjective measure, it’s often considered the weakest form of validity. However, it can be useful in the initial stages of developing a method.

Criterion validity evaluates how well a test can predict a concrete outcome, or how well the results of your test approximate the results of another test.

What is a criterion variable?

A criterion variable is an established and effective measurement that is widely considered valid, sometimes referred to as a “gold standard” measurement. Criterion variables can be very difficult to find.

What is criterion validity?

To evaluate criterion validity, you calculate the correlation between the results of your measurement and the results of the criterion measurement. If there is a high correlation, this gives a good indication that your test is measuring what it intends to measure.

A university professor creates a new test to measure applicants’ English writing ability. To assess how well the test really does measure students’ writing ability, she finds an existing test that is considered a valid measurement of English writing ability, and compares the results when the same group of students take both tests. If the outcomes are very similar, the new test has high criterion validity.

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

  • Normal distribution
  • Degrees of freedom
  • Null hypothesis
  • Discourse analysis
  • Control groups
  • Mixed methods research
  • Non-probability sampling
  • Quantitative research
  • Ecological validity

Research bias

  • Rosenthal effect
  • Implicit bias
  • Cognitive bias
  • Selection bias
  • Negativity bias
  • Status quo bias

Face validity and content validity are similar in that they both evaluate how suitable the content of a test is. The difference is that face validity is subjective, and assesses content at surface level.

When a test has strong face validity, anyone would agree that the test’s questions appear to measure what they are intended to measure.

For example, looking at a 4th grade math test consisting of problems in which students have to add and multiply, most people would agree that it has strong face validity (i.e., it looks like a math test).

On the other hand, content validity evaluates how well a test represents all the aspects of a topic. Assessing content validity is more systematic and relies on expert evaluation. of each question, analyzing whether each one covers the aspects that the test was designed to cover.

A 4th grade math test would have high content validity if it covered all the skills taught in that grade. Experts(in this case, math teachers), would have to evaluate the content validity by comparing the test to the learning objectives.

Criterion validity evaluates how well a test measures the outcome it was designed to measure. An outcome can be, for example, the onset of a disease.

Criterion validity consists of two subtypes depending on the time at which the two measures (the criterion and your test) are obtained:

  • Concurrent validity is a validation strategy where the the scores of a test and the criterion are obtained at the same time .
  • Predictive validity is a validation strategy where the criterion variables are measured after the scores of the test.

Convergent validity and discriminant validity are both subtypes of construct validity . Together, they help you evaluate whether a test measures the concept it was designed to measure.

  • Convergent validity indicates whether a test that is designed to measure a particular construct correlates with other tests that assess the same or similar construct.
  • Discriminant validity indicates whether two tests that should not be highly related to each other are indeed not related. This type of validity is also called divergent validity .

You need to assess both in order to demonstrate construct validity. Neither one alone is sufficient for establishing construct validity.

The purpose of theory-testing mode is to find evidence in order to disprove, refine, or support a theory. As such, generalizability is not the aim of theory-testing mode.

Due to this, the priority of researchers in theory-testing mode is to eliminate alternative causes for relationships between variables . In other words, they prioritize internal validity over external validity , including ecological validity .

It’s often best to ask a variety of people to review your measurements. You can ask experts, such as other researchers, or laypeople, such as potential participants, to judge the face validity of tests.

While experts have a deep understanding of research methods , the people you’re studying can provide you with valuable insights you may have missed otherwise.

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

Middleton, F. (2023, June 22). The 4 Types of Validity in Research | Definitions & Examples. Scribbr. Retrieved April 2, 2024, from https://www.scribbr.com/methodology/types-of-validity/

Is this article helpful?

Fiona Middleton

Fiona Middleton

Other students also liked, reliability vs. validity in research | difference, types and examples, construct validity | definition, types, & examples, external validity | definition, types, threats & examples, unlimited academic ai-proofreading.

✔ Document error-free in 5minutes ✔ Unlimited document corrections ✔ Specialized in correcting academic texts

Popular searches

  • How to Get Participants For Your Study
  • How to Do Segmentation?
  • Conjoint Preference Share Simulator
  • MaxDiff Analysis
  • Likert Scales
  • Reliability & Validity

Request consultation

Do you need support in running a pricing or product study? We can help you with agile consumer research and conjoint analysis.

Looking for an online survey platform?

Conjointly offers a great survey tool with multiple question types, randomisation blocks, and multilingual support. The Basic tier is always free.

Research Methods Knowledge Base

  • Navigating the Knowledge Base
  • Foundations

External Validity

  • Sampling Terminology
  • Statistical Terms in Sampling
  • Probability Sampling
  • Nonprobability Sampling
  • Measurement
  • Research Design
  • Table of Contents

Fully-functional online survey tool with various question types, logic, randomisation, and reporting for unlimited number of surveys.

Completely free for academics and students .

External validity is related to generalizing. That’s the major thing you need to keep in mind. Recall that validity refers to the approximate truth of propositions, inferences, or conclusions. So, external validity refers to the approximate truth of conclusions the involve generalizations. Put in more pedestrian terms, external validity is the degree to which the conclusions in your study would hold for other persons in other places and at other times.

In science there are two major approaches to how we provide evidence for a generalization. I’ll call the first approach the Sampling Model . In the sampling model, you start by identifying the population you would like to generalize to. Then, you draw a fair sample from that population and conduct your research with the sample. Finally, because the sample is representative of the population, you can automatically generalize your results back to the population. There are several problems with this approach. First, perhaps you don’t know at the time of your study who you might ultimately like to generalize to. Second, you may not be easily able to draw a fair or representative sample. Third, it’s impossible to sample across all times that you might like to generalize to (like next year).

I’ll call the second approach to generalizing the Proximal Similarity Model . ‘Proximal’ means ’nearby’ and ‘similarity’ means… well, it means ‘similarity’. The term proximal similarity was suggested by Donald T. Campbell as an appropriate relabeling of the term external validity (although he was the first to admit that it probably wouldn’t catch on!). Under this model, we begin by thinking about different generalizability contexts and developing a theory about which contexts are more like our study and which are less so. For instance, we might imagine several settings that have people who are more similar to the people in our study or people who are less similar. This also holds for times and places. When we place different contexts in terms of their relative similarities, we can call this implicit theoretical a gradient of similarity . Once we have developed this proximal similarity framework, we are able to generalize. How? We conclude that we can generalize the results of our study to other persons, places or times that are more like (that is, more proximally similar) to our study. Notice that here, we can never generalize with certainty – it is always a question of more or less similar.

Threats to External Validity

A threat to external validity is an explanation of how you might be wrong in making a generalization. For instance, you conclude that the results of your study (which was done in a specific place, with certain types of people, and at a specific time) can be generalized to another context (for instance, another place, with slightly different people, at a slightly later time). There are three major threats to external validity because there are three ways you could be wrong – people, places or times. Your critics could come along, for example, and argue that the results of your study are due to the unusual type of people who were in the study. Or, they could argue that it might only work because of the unusual place you did the study in (perhaps you did your educational study in a college town with lots of high-achieving educationally-oriented kids). Or, they might suggest that you did your study in a peculiar time. For instance, if you did your smoking cessation study the week after the Surgeon General issues the well-publicized results of the latest smoking and cancer studies, you might get different results than if you had done it the week before.

Improving External Validity

How can we improve external validity? One way, based on the sampling model, suggests that you do a good job of drawing a sample from a population. For instance, you should use random selection, if possible, rather than a nonrandom procedure. And, once selected, you should try to assure that the respondents participate in your study and that you keep your dropout rates low. A second approach would be to use the theory of proximal similarity more effectively. How? Perhaps you could do a better job of describing the ways your contexts and others differ, providing lots of data about the degree of similarity between various groups of people, places, and even times. You might even be able to map out the degree of proximal similarity among various contexts with a methodology like concept mapping . Perhaps the best approach to criticisms of generalizations is simply to show them that they’re wrong – do your study in a variety of places, with different people and at different times. That is, your external validity (ability to generalize) will be stronger the more you replicate your study.

Cookie Consent

Conjointly uses essential cookies to make our site work. We also use additional cookies in order to understand the usage of the site, gather audience analytics, and for remarketing purposes.

For more information on Conjointly's use of cookies, please read our Cookie Policy .

Which one are you?

I am new to conjointly, i am already using conjointly.

  • Skip to main content
  • Skip to primary sidebar
  • Skip to footer
  • QuestionPro

survey software icon

  • Solutions Industries Gaming Automotive Sports and events Education Government Travel & Hospitality Financial Services Healthcare Cannabis Technology Use Case NPS+ Communities Audience Contactless surveys Mobile LivePolls Member Experience GDPR Positive People Science 360 Feedback Surveys
  • Resources Blog eBooks Survey Templates Case Studies Training Help center

external validity in research methodology

Home Market Research Research Tools and Apps

External Validity: Types, Research Methods & Examples

External validity is how well the results of a study can be applied to people outside of the study. Learn everything about it in this article.

External validity is one of the main goals of researchers who want to find reliable cause-and-effect relationships in qualitative research.

When research has this validity, the results can be used with other people in different situations or places. Because without this validity, analysis can’t be generalized, and researchers can’t apply the results of studies to the real world. So, psychology research needs to be conducted outside a lab setting.

Still, sometimes they prefer to research how variables cause each other instead of being able to generalize the results.

In this article, we’ll talk about what external validity means, its types, and its research design methods.

LEARN ABOUT: Theoretical Research

What is external validity?

External validity describes how effectively the findings of an experiment may be generalized to different people, places, or times. Most scientific investigations do not intend to obtain outcomes that only apply to the few persons who participated in the study.

Instead, researchers want to be able to take the results of an experiment and use them with a larger group of people. It is a big part of what inferential statistics try to do.

For example, if you’re looking at a new drug or educational program, you don’t want to know that it works for only a few people. You want to use those results outside the experiment and beyond those participating. It is called “generalizability,” the essential part of this validity.

Types of external validity

Generally, there are three main types of this validity. We’ll discuss each one below and give examples to help you understand.

Population validity

Population validity is a kind of external validity that looks at how well the study’s results applied to a larger group of people. In this case, “population” refers to the group of people about whom a researcher is trying to conclude. On the other hand, a sample is a particular group of people who participate in the research.

If the results from the sample can apply to a larger group of people, then the study is valid for a large population.

Example: low population validity

You want to test the theory about how exercise and sleep are linked. You think that adults will sleep better when they do physical activities regularly. Your target group is adults in the United States, but your sample comprises about 300 college students. 

Even though they are all adults, it might be hard to ensure the population validity in this case because the sampling model of students only represents some adults in the US.

So, your study has a limited amount of population validity, and you can only apply the results to some of the population.

Ecological validity

Ecological validity is another type of external validity that shows how well the research results can be used in different situations. In simple terms, ecological validity is about whether or not your results can be used in the real world.

So, if a study has a lot of ecological validity, the results can be used in the real world. On the other hand, low validity means that the results can’t be used outside the experiment.

Example: low ecological validity

The Milgram Experiment is a classic example of low ecological validity.

Stanley Milgram studied authority in the 1960s. He randomly chose participants and directed them to employ higher and higher voltage shocks to penalize wrong-answering actors. The study showed great obedience to authorities despite fake shock and victim behaviors.

The results of this study are revolutionary for the field of social psychology. However, it is often criticized because it has little ecological validity. Milgram’s set-up was not like real-life situations.

In the experiment, he set up a situation where the participants couldn’t avoid obeying the rules. But the reality of the issue can be very different.

Temporal validity

When figuring out external validity, time is just as important as the number of people involved and confusing factors.

The concept of temporal validity refers to how findings evolve. Particularly, this form of validity refers to how well the research results can be extended to another period.

High temporal validity means that research results can be used correctly in different times and places and that factors will be important in the future.

Imagine that you’re a psychologist, and you’re studying how people act the same.

You found out that social pressure from the majority group has a big effect on the choices of the minority. Because of this, people act similarly. Even though famous psychologist Solomon Asch did this research in the 1950s, the results can still be used in the real world today. 

This study, therefore, has temporal validity even after nearly a century.

Research methods of external validity

There are a lot of methods you can do to improve the external validity of your research. Some things that can improve are given below:

Field experiments

Field experiments are like conducting research outside rather than in a controlled environment like a laboratory.

Criteria for inclusion and exclusion

Establishing criteria for who can participate in the research and ensuring that the group being examined is properly identified

Realism in psychology

If you want the participants to believe that the events that take place throughout the study are true, you should provide them with a cover story regarding the purpose of the research. So that they don’t behave any differently than they would in real life based on the fact.

Replication

Doing the study again with different samples or in a different place to see if you get the same results. When many studies have been done on the same topic, a meta-analysis can be used to see if the effect of an independent variable can be repeated to make it more reliable.

Reprocessing

It is like using statistical methods to fix problems with external validity, like reweighting groups if they were different in a certain way, such as age.

LEARN ABOUT: 12 Best Tools for Researchers

As stated in the article, the ability to replicate the results of an experiment is a key component of its external validity. Using the sampling methods the external validity can be improved in the research.

Researchers compare the results to other relevant data to determine the external validity. They can also do the research with more people from the target population. It’s hard to figure out external validity in research, but it’s important to use the results in the future.

The QuestionPro research suite is an enterprise-level research tool that can help you with your research process and surveys.

We at QuestionPro provide tools for data collection, such as our survey software, and a library of insights for any lengthy study. If you’re interested in seeing a demo or learning more, visit the Insight Hub.

LEARN MORE         FREE TRIAL

MORE LIKE THIS

customer experience automation

Customer Experience Automation: Benefits and Best Tools

Apr 1, 2024

market segmentation tools

7 Best Market Segmentation Tools in 2024

in-app feedback tools

In-App Feedback Tools: How to Collect, Uses & 14 Best Tools

Mar 29, 2024

Customer Journey Analytics Software

11 Best Customer Journey Analytics Software in 2024

Other categories.

  • Academic Research
  • Artificial Intelligence
  • Assessments
  • Brand Awareness
  • Case Studies
  • Communities
  • Consumer Insights
  • Customer effort score
  • Customer Engagement
  • Customer Experience
  • Customer Loyalty
  • Customer Research
  • Customer Satisfaction
  • Employee Benefits
  • Employee Engagement
  • Employee Retention
  • Friday Five
  • General Data Protection Regulation
  • Insights Hub
  • Life@QuestionPro
  • Market Research
  • Mobile diaries
  • Mobile Surveys
  • New Features
  • Online Communities
  • Question Types
  • Questionnaire
  • QuestionPro Products
  • Release Notes
  • Research Tools and Apps
  • Revenue at Risk
  • Survey Templates
  • Training Tips
  • Uncategorized
  • Video Learning Series
  • What’s Coming Up
  • Workforce Intelligence
  • Open access
  • Published: 06 April 2022

Identification of tools used to assess the external validity of randomized controlled trials in reviews: a systematic review of measurement properties

  • Andres Jung   ORCID: orcid.org/0000-0003-0201-0694 1 ,
  • Julia Balzer   ORCID: orcid.org/0000-0001-7139-229X 2 ,
  • Tobias Braun   ORCID: orcid.org/0000-0002-8851-2574 3 , 4 &
  • Kerstin Luedtke   ORCID: orcid.org/0000-0002-7308-5469 1  

BMC Medical Research Methodology volume  22 , Article number:  100 ( 2022 ) Cite this article

5245 Accesses

4 Citations

9 Altmetric

Metrics details

Internal and external validity are the most relevant components when critically appraising randomized controlled trials (RCTs) for systematic reviews. However, there is no gold standard to assess external validity. This might be related to the heterogeneity of the terminology as well as to unclear evidence of the measurement properties of available tools. The aim of this review was to identify tools to assess the external validity of RCTs. It was further, to evaluate the quality of identified tools and to recommend the use of individual tools to assess the external validity of RCTs in future systematic reviews.

A two-phase systematic literature search was performed in four databases: PubMed, Scopus, PsycINFO via OVID, and CINAHL via EBSCO. First, tools to assess the external validity of RCTs were identified. Second, studies investigating the measurement properties of these tools were selected. The measurement properties of each included tool were appraised using an adapted version of the COnsensus based Standards for the selection of health Measurement INstruments (COSMIN) guidelines.

38 publications reporting on the development or validation of 28 included tools were included. For 61% (17/28) of the included tools, there was no evidence for measurement properties. For the remaining tools, reliability was the most frequently assessed property. Reliability was judged as “ sufficient ” for three tools (very low certainty of evidence). Content validity was rated as “ sufficient ” for one tool (moderate certainty of evidence).

Conclusions

Based on these results, no available tool can be fully recommended to assess the external validity of RCTs in systematic reviews. Several steps are required to overcome the identified difficulties to either adapt and validate available tools or to develop a better suitable tool.

Trial registration

Prospective registration at Open Science Framework (OSF): https://doi.org/10.17605/OSF.IO/PTG4D .

Peer Review reports

Systematic reviews are powerful research formats to summarize and synthesize the evidence from primary research in health sciences [ 1 , 2 ]. In clinical practice, their results are often applied for the development of clinical guidelines and treatment recommendations [ 3 ]. Consequently, the methodological quality of systematic reviews is of great importance. In turn, the informative value of systematic reviews depends on the overall quality of the included controlled trials [ 3 , 4 ]. Accordingly, the evaluation of the internal and external validity is considered a key step in systematic review methodology [ 4 , 5 ].

Internal validity relates to the systematic error or bias in clinical trials [ 6 ] and expresses how methodologically robust the study was conducted. External validity is the inference about the extent to which “a causal relationship holds over variations in persons, settings, treatments and outcomes” [ 7 , 8 ]. There are plenty of definitions for external validity and a variety of different terms. Hence, external validity, generalizability, applicability, and transferability, among others, are used interchangeably in the literature [ 9 ]. Schünemann et al. [ 10 ] suggest that: (1) generalizability “may refer to whether or not the evidence can be generalized from the population from which the actual research evidence is obtained to the population for which a healthcare answer is required”; (2) applicability may be interpreted as “whether or not the research evidence answers the healthcare question asked by a clinician or public health practitioner” and (3) transferability is often interpreted as to “whether research evidence can be transferred from one setting to another”. Four essential dimensions are proposed to evaluate the external validity of controlled clinical trials in systematic reviews: patients, treatment (including comparator) variables, settings, and outcome modalities [ 4 , 11 ]. Its evaluation depends on the specificity of the reviewers´ research question, the review´s inclusion and exclusion criteria compared to the trial´s population, the setting of the study, as well as the quality of reporting these four dimensions.

In health research, however, external validity is often neglected when critically appraising clinical studies [ 12 , 13 ]. One possible explanation might be the lack of a gold standard for assessing the external validity of clinical trials. Systematic and scoping reviews examined published frameworks and tools for assessing the external validity of clinical trials in health research [ 9 , 12 , 14 – 18 ]. A substantial heterogeneity of terminology and criteria as well as a lack of guidance on how to assess the external validity of intervention studies was found [ 9 , 12 , 15 – 18 ]. The results and conclusions of previous reviews were based on descriptive as well as content analysis of frameworks and tools on external validity [ 9 , 14 – 18 ]. Although the feasibility of some frameworks and tools was assessed [ 12 ], none of the previous reviews evaluated the quality regarding the development and validation processes of the used frameworks and tools.

RCTs are considered the most suitable research design for investigating cause and effect mechanisms of interventions [ 19 ]. However, the study design of RCTs is susceptible to a lack of external validity due to the randomization, the use of exclusion criteria and poor willingness of eligible participants to participate [ 20 , 21 ]. There is evidence that the reliability of external validity evaluations with the same measurement tool differed between randomized and non-randomized trials [ 22 ]. In addition, due to differences in requested information from reporting guidelines (e.g. consolidated standards of reporting trials (CONSORT) statement, strengthening the reporting of observational studies in Epidemiology (STROBE) statement), respective items used for assessing the external validity vary between research designs. Acknowledging the importance of RCTs in the medical field, this review focused only on tools developed to assess the external validity of RCTs. The aim was to identify tools to assess the external validity of RCTs in systematic reviews and to evaluate the quality of evidence regarding their measurement properties. Objectives: (1) to identify published measurement tools to assess the external validity of RCTs in systematic reviews; (2) to evaluate the quality of identified tools; (3) to recommend the use of tools to assess the external validity of RCTs in future systematic reviews.

This systematic review was reported in accordance with the Preferred Reporting Items for Systematic reviews and Meta-Analyses (PRISMA) 2020 Statement [ 23 ] and used an adapted version of the PRISMA flow diagram to illustrate the systematic search strategy used to identify clinimetric papers [ 24 ]. This study was conducted according to an adapted version of the COnsensus-based Standards for the selection of health Measurement INstruments (COSMIN) methodology for systematic reviews of measurement instruments in health sciences [ 25 – 27 ] and followed recommendations of the JBI manual for systematic reviews of measurement properties [ 28 ]. The COSMIN methodology was chosen since this method is comprehensive and validation processes do not differ substantially between patient-reported outcome measures (PROMs) and measurement instruments of other latent constructs. According to the COSMIN authors, it is acceptable to use this methodology for non-PROMs [ 26 ]. Furthermore, because of its flexibility, it has already been used in systematic reviews assessing measurement tools which are not health measurement instruments [ 29 – 31 ]. However, adaptations or modifications may be necessary [ 26 ]. The type of measurement instrument of interest for the current study were reviewer-reported measurement tools. Pilot tests and adaptation-processes of the COSMIN methodology are described below (see section “Quality assessment and evidence synthesis”). The definition of each measurement property evaluated in the present review is based on COSMIN´s taxonomy, terminology and definition of measurement properties [ 32 ]. The review protocol was prospectively registered on March 6, 2020 in the Open Science Framework (OSF) with the registration DOI: https://doi.org/10.17605/OSF.IO/PTG4D [ 33 ].

Deviations from the preregistered protocol

One of the aims listed in the review protocol was to evaluate the characteristics and restrictions of measurement tools in terms of terminology and criteria for assessing external validity. This issue has been addressed in two recent reviews with a similar scope [ 9 , 17 ]. Although our eligibility criteria differed, it was concluded that no novel data was available for the present review to extract, since authors of included tools did not describe the definition or construct of interest or cited the same reports. Therefore, this objective was omitted.

Literature search and screening

A search of the literature was conducted in four databases: PubMed, Scopus, PsycINFO via OVID, and CINAHL via EBSCO. The eligibility criteria and search strategy were predefined in collaboration with a research librarian and is detailed in Table S1 (see Additional file 1 ). The search strategy was designed according to the COSMIN methodology and consists of the following four key elements: (1) construct (external validity of RCTs from the review authors´perspective), (2) population(s) (RCTs), (3) type of instrument(s) (measurement tools, checklists, surveys etc.), and (4) measurement properties (e.g. validity and reliability) [ 34 ]. The four key elements were divided into two main searches (adapted from previous reviews [ 24 , 35 , 36 ]): the phase 1 search contained the first three key elements to identify measurement tools to assess the external validity of RCTs. The phase 2 search aimed to identify studies evaluating the measurement properties of each tool, which was identified and included during phase 1. For this second search, a sensitive PubMed search filter developed by Terwee et al. [ 37 ] was applied. Translations of this filter for the remaining databases were taken from the COSMIN website and from other published COSMIN reviews [ 38 , 39 ] with permission from the authors. Both searches were conducted until March 2021 without restriction regarding the time of publication (databases were searched from inception). In addition, forward citation tracking with Scopus (which is a specialized citation database) was conducted in phase 2 using the ‘cited by’-function. The Scopus search filter was then entered into the ‘search within results’-function. The results from the forward citation tracking with Scopus were added to the database search results into the Rayyan app for screening. Reference lists of the retrieved full-text articles and forward citations with PubMed were scanned manually for any additional studies by one reviewer (AJ) and checked by a second reviewer (KL).

Title and abstract screening for both searches and the full-text screening during phase 2 were performed independently by at least two out of three involved researchers (AJ, KL & TB). For pragmatic reasons, full-text screening and tool/data extraction in phase 1 was performed by one reviewer (AJ) and checked by a second reviewer (TB). This screening method is acceptable for full-text screening as well as data extraction [ 40 ]. Data extraction for both searches was performed with a predesigned extraction sheet based on the recommendations of the COSMIN user manual [ 34 ]. The Rayyan Qatar Computing Research Institute (QCRI) web app [ 41 ] was used to facilitate the screening process (both searches) according to a priori defined eligibility criteria. A pilot test was conducted for both searches in order to reach agreement between the reviewers during the screening process. For this purpose, the first 100 records in phase 1 and the first 50 records in phase 2 (sorted by date) in the Rayyan app were screened by two reviewers independently and subsequently, issues regarding the feasibility of screening methods were discussed in a meeting.

Eligibility criteria

Phase 1 search (identification of tools).

Records were considered for inclusion based on their title and abstract according to the following criteria: (1) records that described the development and or implementation (application), e.g. manual or handbook, of any tool to assess the external validity of RCTs; (2) systematic reviews that applied tools to assess the external validity of RCTs and which explicitly mentioned the tool in the title or abstract; (3) systematic reviews or any other publication potentially using a tool for external validity assessment, but the tool was not explicitly mentioned in the title or abstract; (4) records that gave other references to, or dealt with, tools for the assessment of external validity of RCTs, e.g. method papers, commentaries.

The full-text screening was performed to extract or to find references to potential tools. If a tool was cited, but not presented or available in the full-text version, the internet was searched for websites on which this tool was presented, to extract and review for inclusion. Potential tools were extracted and screened for eligibility as follows: measurement tools aiming to assess the external validity of RCTs and designed for implementation in systematic reviews of intervention studies. Since the terms external validity, applicability, generalizability, relevance and transferability are used interchangeably in the literature [ 10 , 11 ], tools aiming to assess one of these constructs were eligible. Exclusion criteria: (1) The multidimensional tool included at least one item related to external validity, but it was not possible to assess and interpret external validity separately. (2) The tool was developed exclusively for study designs other than RCTs. (3) The tool contained items assessing information not requested in the CONSORT-Statement [ 42 ] (e.g. cost-effectiveness of the intervention, salary of health care provider) and these items could not be separated from items on external validity. (4) The tool was published in a language other than English or German. (5) The tool was explicitly designed for a specific medical profession or field and cannot be used in other medical fields.

Phase 2 search (identification of reports on the measurement properties of included tools)

For the phase 2 search, records evaluating the measurement properties of at least one of the included measurement tools were selected. Reports only using the measurement tool as an outcome measure without the evaluation of at least one measurement property were excluded. If a report did not evaluate the measurement properties of a tool, it was also excluded. Hence, reports providing data on the validity or the reliability of sum-scores of multidimensional tools, only, were excluded if the dimension “external validity” was not evaluated separately.

If there was missing data or information (phase 1 or phase 2), the corresponding authors were contacted.

Quality assessment and evidence synthesis

All included reports were systematically evaluated: (1) for their methodological quality by using the adapted COSMIN Risk of Bias (RoB) checklist [ 25 ] and (2) against the updated criteria for good measurement properties [ 26 , 27 ]. Subsequently, all available evidence for each measurement property for the individual tool were summarized and rated against the updated criteria for good measurement properties and graded for their certainty of evidence, according to COSMIN´s modified GRADE approach [ 26 , 27 ]. The quality assessment was performed by two independent reviewers (AJ & JB). In case of irreconcilable disagreement, a third reviewer (TB) was consulted to reach consensus.

The COSMIN RoB checklist is a tool [ 25 , 27 , 32 , 43 ] designed for the systematic evaluation of the methodological quality of studies assessing the measurement properties of health measurement instruments [ 25 ]. Although this checklist was specifically developed for systematic reviews of PROMs, it can also be used for reviews of non-PROMs [ 26 ] or measurement tools of other latent constructs [ 28 , 29 ]. As mentioned in the COSMIN user manual, adaptations for some items in the COSMIN RoB checklist might be necessary, in relation to the construct being measured [ 34 ]. Therefore, pilot tests were performed for the assessment of measurement properties of tools assessing the quality of RCTs before data extraction, aiming to ensure feasibility during the planned evaluation of the included tools. The pilot tests were performed with a random sample of publications on measurement instruments of potentially relevant tools. After each pilot test, results and problems regarding the comprehensibility, relevance and feasibility of the instructions, items, and response options in relation to the construct of interest were discussed. Where necessary, adaptations and/or supplements were added to the instructions of the evaluation with the COSMIN RoB checklist. Saturation was reached after two rounds of pilot testing. Substantial adaptations or supplements were required for Box 1 (‘development process’) and Box 10 (‘responsiveness’) of the COSMIN RoB checklist. Minor adaptations were necessary for the remaining boxes. The specification list, including the adaptations, can be seen in Table S2 (see Additional file 2 ). The methodological quality of included studies was rated via the four-point rating scale of the COSMIN RoB checklist as “inadequate”, “doubtful”, “adequate”, or “very good” [ 25 ]. The lowest score of any item in a box is taken to determine the overall rating of the methodological quality of each single study on a measurement property [ 25 ].

After the RoB-assessment, the result of each single study on a measurement property was rated against the updated criteria for good measurement properties for content validity [ 27 ] and for the remaining measurement properties [ 26 ] as “sufficient” (+), “insufficient” (-), or “indeterminate” (?). These ratings were summarized and an overall rating for each measurement property was given as “sufficient” (+), “insufficient” (-), “inconsistent” (±), or “indeterminate” (?). However, the overall rating criteria for good content validity was adapted to the research topic of the present review. This method usually requires an additional subjective judgement from reviewers [ 44 ]. Since one of the biggest limitations within this field of research is the lack of consensus on terminology and criteria as well as on how to assess the external validity [ 9 , 12 ], a reviewers’ subjective judgement was considered inappropriate. After this issue was also discussed with one leading member of the COSMIN steering committee, the reviewers’ rating was omitted. A “sufficient” (+) overall rating was given if there was evidence of face or content validity of the final version of the measurement tool assessed by a user or expert panel. Otherwise, the rating “indeterminate” (?) or “insufficient” (-) was used for the content validity.

The summarized evidence for each measurement property for the individual tool was graded using COSMIN´s modified GRADE approach [ 26 , 27 ]. The certainty (quality) of evidence was graded as “high”, “moderate”, “low”, or “very low” according to the approach for content validity [ 27 ] and for the remaining measurement properties [ 26 ]. COSMIN´s modified GRADE approach distinguishes between four factors influencing the certainty of evidence: risk of bias, inconsistency, indirectness, and imprecision. The starting point for all measurement properties is high certainty of evidence and is subsequently downgraded by one to three levels per factor when there is risk of bias, (unexplained) inconsistency, imprecision (not considered for content validity [ 27 ]), or indirect results [ 26 , 27 ]. If there is no study on the content validity of a tool, the starting point for this measurement property is “moderate” and is subsequently downgraded depending on the quality of the development process [ 27 ]. The grading process according to COSMIN [ 26 , 27 ] is described in Table S4. Selective reporting bias or publication bias is not taken into account in COSMIN´s modified GRADE approach, because of a lack of registries for studies on measurement properties [ 26 ].

The evidence synthesis was performed qualitatively according to the COSMIN methodology [ 26 ]. If several reports revealed homogenous quantitative data (e.g. same statistics, population) on internal consistency, reliability, measurement error or hypotheses testing of a measurement tool, pooling the results was considered using generic inverse variance (random effects) methodology and weighted means as well as 95% confidence intervals for each measurement property [ 34 ]. No subgroup analysis was planned. However, statistical pooling was not possible in the present review.

We used three criteria for the recommendation of a measurement tool in accordance with the COSMIN manual: (A) “Evidence for sufficient content validity (any level) and at least low-quality evidence for sufficient internal consistency” for a tool to be recommended; (B) tool “categorized not in A or C” and further research on the quality of this tool is required to be recommended; and (C) tool with “high quality evidence for an insufficient psychometric property” and this tool should not be recommended [ 26 ].

Literature search and selection process

Figure  1 shows the selection process. In the phase 1 search, from 5397 non-duplicate records, 5020 irrelevant records were excluded. 377 reports were screened, and 74 potential tools were extracted. After reaching consensus, 46 tools were excluded (reasons for exclusion are presented in Table S3 (see Additional file 3 )) and finally 28 were included. Any disagreements during the screening process were resolved through discussion. There was one case during the full-text screening process in the phase 1 search, in which the whole review team was involved to reach consensus about the inclusion/exclusion of two tools (Agency for Healthcare Research and Quality (AHRQ) criteria for applicability and TRANSFER approach, both listed in Table S 3 ).

In the phase 2 search, 2191 non-duplicate records were screened for title and abstract. 2146 records were excluded as they did not assess any measurement property of the included tools. Of 45 reports, 8 reports were included. The most common reason for exclusion was that reports evaluating the measurement properties of multidimensional tools did not evaluate external validity as a separate dimension. For example, one study assessing the interrater reliability of the GRADE method [ 45 ] was identified during full-text screening, but had to be excluded, since it did not provide separate data on the reliability of the indirectness domain (representing external validity). Two additional reports were included during reference screening. Any disagreements during the screening process were resolved through discussion.

Thirty-eight publications on the development or evaluation of the measurement properties of 28 included tools were included for quality appraisal according to the adapted COSMIN guidelines.

figure 1

Flow diagram “of systematic search strategy used to identify clinimetric papers”[ 24 ]

We contacted the corresponding authors of three reports [ 46 – 48 ] for additional information. One corresponding author did reply [ 48 ].

Methods to assess the external validity of RCTs

During full-text screening in phase 1, several concepts to assess the external validity of RCTs were found (Table  1 ). Two main concepts were identified: experimental/statistical methods and non-experimental methods. The experimental/statistical methods were summarized and collated into five subcategories giving a descriptive overview of the different approaches used to assess the external validity. However, according to our eligibility criteria, these methods were excluded, since they were not developed for the use in systematic reviews of interventions. In addition, a comparison of these methods as well as appraisal of risk of bias with the COSMIN RoB checklist would not have been feasible. Therefore, the experimental/statistical methods described below were not included for further evaluation.

Characteristics of included measurement tools

The included tools and their characteristics are listed in Table  2 . Overall, the tools were heterogenous with respect to the number of items or dimensions, response options and development processes. The number of items varied between one and 26 items and the response options varied between 2-point-scales to 5-point-scales. Most tools used a 3-point-scale ( n  = 20/28, 71%). For 14/28 (50%) of the tools, the development was not described in detail [ 63 – 76 ]. Seven review authors appear to have developed their own tool but did not provide any information on the development process [ 63 – 68 , 71 ].

The constructs aimed to be measured by the tools or dimensions of interest are diverse. Two of the tools focused on the characterization of RCTs on an efficacy-effectiveness continuum [ 47 , 86 ], three tools focused predominantly on the report quality of factors essential to external validity [ 69 , 75 , 88 ] (rather than the external validity itself), 18 tools aimed to assess the representativeness, generalizability or applicability of population, setting, intervention, and/or outcome measure to usual practice [ 22 , 63 – 65 , 70 , 71 , 73 , 74 , 76 – 78 , 81 – 83 , 92 , 94 , 100 ], and five tools seemed to measure a mixture of these different constructs related to external validity [ 66 , 68 , 72 , 79 , 98 ]. However, the construct of interest of most tools was not described adequately (see below).

  • Measurement properties

The results of the methodological quality assessment according to the adapted COSMIN RoB checklist are detailed in Table 3 . If all data on hypotheses testing in an article had the same methodological quality rating, they were combined and summarized in Table 3  in accordance with the COSMIN manual [ 34 ]. The results of the ratings against the updated criteria for good measurement properties and the overall certainty of evidence, according to the modified GRADE approach, can be seen in Table 4 . The detailed grading is described in Table S4 (see Additional file 4 ). Disagreements between reviewers during the quality assessment were resolved through discussion.

Content validity

The methodological quality of the development process was “inadequate” for 19/28 (68%) of the included tools [ 63 – 66 , 68 – 74 , 76 , 78 , 81 , 88 , 98 , 100 ]. This was mainly due to insufficient description of the construct to be measured, the target population, or missing pilot tests. Six development studies had a “doubtful” methodological quality [ 22 , 75 , 77 , 79 , 82 , 83 ] and three had an “adequate” methodological quality [ 47 , 48 , 94 ].

There was evidence for content validation of five tools [ 22 , 47 , 79 , 81 , 98 ]. However, the methodological quality of the content validity studies was “adequate” and “very good” only for the Rating of Included Trials on the Efficacy-Effectiveness Spectrum (RITES) tool [ 47 ] and “doubtful” for Cho´s Clinical Relevance Instrument [ 79 ], the “external validity”-dimension of the Downs & Black-checklist [ 22 ], the “Selection Bias”-dimension of the Effective Public Health Practice Project (EPHPP) tool [ 98 ], and the “Clinical Relevance” tool [ 81 ]. The overall certainty of evidence for content validity was “very low” for 19 tools (mainly due to very serious risk of bias and serious indirectness) [ 63 – 76 , 78 , 82 , 86 , 88 , 100 ], “low” for three tools (mainly due to serious risk of bias or serious indirectness) [ 77 , 83 , 94 ] and “moderate” for six tools (mainly due to serious risk of bias or serious indirectness) [ 22 , 47 , 79 , 81 , 92 , 98 ]. All but one tool had an “indeterminate” content validity. The RITES tool [ 47 ] had “moderate” certainty of evidence for “sufficient” content validity.

Internal consistency

One study assessed the internal consistency for one tool (“external validity”-dimension of the Downs & Black-checklist) [ 22 ]. The methodological quality of this study was “doubtful” due to a lack of evidence on unidimensionality (or structural validity). Thus, this tool had a “very low” certainty of evidence for “indeterminate” internal consistency. Reasons for downgrading were a very serious risk of bias and imprecision.

Reliability

Out of 13 studies assessing the reliability of 9 tools, eleven evaluated the interrater reliability [ 80 ,  84 , 86 , 87 , 90 , 93 – 95 , 97 , 99 ], one the test-retest reliability [ 98 ], and one evaluated both [ 22 ]. Two studies had an “inadequate” [ 93 , 101 ], two had a “doubtful” [ 98 , 99 ], three had an “adequate” [ 80 ,  91 , 94 , 95 ], and six had a “very good” methodological quality [ 22 , 84 , 86 , 87 ]. The overall certainty of evidence was “very low” for five tools (reasons for downgrading please refer to Table S 4 ) [ 47 , 73 , 88 , 92 , 94 ]. The certainty of evidence was “low” for the “Selection Bias”-dimension of the EPHPP tool (due to serious risk of bias and imprecision) [ 98 ] and “moderate” for Gartlehner´s tool [ 86 ], the “external validity”-dimension of the Downs & Black-checklist [ 22 ], as well as the clinical relevance instrument [ 79 ] (due to serious risk of bias and indirectness).

Out of nine evaluated tools, the Downs & Black-checklist [ 22 ] had “inconsistent” results on reliability. The Clinical Relevance Instrument [ 79 ], Gartlehner´s tool [ 86 ], the “Selection Bias”-dimension of the EPHPP [ 98 ], the indirectness-dimension of the GRADE handbook [ 92 ] and the modified indirectness-checklist [ 94 ] had an “insufficient” rating for reliability. Green & Glasgow´s tool [ 88 ], the external validity dimension of the U.S. Preventive Services Task Force (USPSTF) manual [ 73 ] and the RITES tool [ 47 ] had a “very low” certainty of evidence for “sufficient” reliability.

Measurement error

Measurement error was reported for three tools. Two studies on measurement error of Gartlehner´s tool [ 86 ] and Loyka´s external validity framework [ 75 ], had an “adequate” methodological quality. Two studies on measurement error of the external validity dimension of the Downs & Black-checklist [ 22 ] had an “inadequate” methodological quality. However, all three tools had a “very low” certainty of evidence for “indeterminate” measurement error. Reasons for downgrading were risk of bias, indirectness, and imprecision due to small sample sizes.

Criterion validity

Criterion validity was reported only for Gartlehner´s tool [ 86 ]. Although there was no gold standard available to assess the criterion validity of this tool, the authors used expert opinion as the reference standard. The study assessing this measurement property had an “adequate” methodological quality. The overall certainty of evidence was “very low” for “sufficient” criterion validity due to risk of bias, imprecision, and indirectness.

Construct validity (hypotheses testing)

Five studies [ 22 , 90 , 91 , 97 , 98 ] reported on the construct validity of four tools. Three studies had a “doubtful” [ 90 , 91 , 98 ], one had an “adequate” [ 22 ] and one had a “very good” [ 97 ] methodological quality. The overall certainty of evidence was “very low” for three tools (mainly due to serious risk of bias, imprecision and serious indirectness) [ 22 , 88 , 98 ] and “low” for one tool (due to imprecision and serious indirectness) [ 47 ]. The “Selection-Bias”-dimension of the EPHPP tool [ 98 ] had “very low” certainty of evidence for “sufficient” construct validity and the RITES tool [ 47 ] had “low” certainty of evidence for “sufficient” construct validity. Both, the Green & Glasgow´s tool [ 88 ] and the Downs & Black-checklist [ 22 ], had “very low” certainty of evidence for “insufficient” construct validity.

Structural validity and cross-cultural validity were not assessed in any of the included studies.

Summary and interpretation of results

To our knowledge this is the first systematic review identifying and evaluating the measurement properties of tools to assess the external validity of RCTs. A total of 28 tools were included. Overall, for more than half (n = 17/28, 61%) of the included tools the measurement properties were not reported. Only five tools had at least one “sufficient” measurement property. Moreover, the development process was not described in 14/28 (50%) of the included tools. Reliability was assessed most frequently (including inter-rater and/or test-retest reliability). Only three of the included tools had “sufficient” reliability (“very low” certainty of evidence) [ 47 , 73 , 88 ]. Hypotheses testing was evaluated in four tools, with half of them having “sufficient” construct validity (“low” and “very low” certainty of evidence) [ 47 , 98 ]. Measurement error was evaluated in three tools, all with an “indeterminate” quality rating (“low” and “very low” certainty of evidence) [ 22 , 75 , 86 ]. Criterion validity was evaluated for one tool, having “sufficient” with “very low” certainty of evidence [ 86 ]. The RITES tool [ 47 ] was the measurement tool with the strongest evidence for validity and reliability. Its content validity, based on international expert-consensus, was “sufficient” with “moderate” certainty of evidence, while reliability and construct validity were rated as “sufficient” with “very low” and “low” certainty of evidence, respectively.

Following the three criteria for the recommendation of a measurement tool, all included tools were categorized as ‘B’. Hence, further research will be required for the recommendation for or against any of the included tools [ 26 ]. Sufficient internal consistency may not be relevant for the assessment of external validity, as the measurement models might not be fully reflective. However, none of the authors/developers did specify the measurement model of their measurement tool.

Specification of the measurement model is considered a requirement of the appropriateness for the latent construct of interest during scale or tool development [ 102 ]. It could be argued that researchers automatically expect their tool to be a reflective measurement model. E.g., Downs and Black [ 22 ] assessed internal consistency without prior testing for unidimensionality or structural validity of the tool. Structural validity or unidimensionality is a prerequisite for internal consistency [ 26 ] and both measurement properties are only relevant for reflective measurement models [ 103 , 104 ]. Misspecification as well as lack of specification of the measurement model can lead to potential limitations when developing and validating a scale or tool [ 102 , 105 ]. Hence, the specification of measurement models should be considered in future research.

Content validity is the most important measurement property of health measurement instruments [ 27 ] and a lack of face validity is considered a strong argument for not using or to stop further evaluation of a measurement instrument [ 106 ]. Only the RITES tool [ 47 ] had evidence of “sufficient” content validity. Nevertheless, this tool does not directly measure the external validity of RCTs. The RITES tool [ 47 ] was developed to classify RCTs on an efficacy-effectiveness continuum. An RCT categorized as highly pragmatic or as having a “strong emphasis on effectiveness” [ 47 ] implies that the study design provides rather applicable results, but it does not automatically imply high external validity or generalizability of a trial´s characteristics to other specific contexts and settings [ 107 ]. Even a highly pragmatic/effectiveness study might have little applicability or generalizability to a specific research question of review authors. An individual assessment of external validity may still be needed by review authors in accordance with the research question and other contextual factors.

Another tool which might have some degree of content or face validity is the indirectness-dimension of the GRADE method [ 92 ]. This method is a widely used and accepted method in research synthesis in health science [ 108 ]. It has been evolved over the years based on work from the GRADE Working Group and on feedback from users worldwide [ 108 ]. Thus, it might be assumed that this method has a high degree of face validity, although it has not been systematically tested for content validity.

If all tools are categorized as ‘B’ in a review, the COSMIN guidelines suggests that the measurement instrument “with the best evidence for content validity could be the one to be provisionally recommended for use, until further evidence is provided” [ 34 ]. In accordance with this suggestions, the use of the RITES tool [ 47 ] as an provisionally solution might therefore be justified until more research on this topic is available. However, users should be aware of its limitations, as described above.

Implication for future research

This study affirms and supplements what is already known from previous reviews [ 9 , 12 , 14 – 18 ]. The heterogeneity of characteristics of tools included in those reviews was also observed in the present review. Although Dyrvig et al. [ 16 ] did not assess the measurement properties of available tools, they reported a lack of empirical support of items included in measurement tools. The authors of previous reviews could not recommend a measurement tool. Although their conclusions were mainly based on descriptive analysis rather than the assessment of quality of the tools, the conclusion of the present systematic review is consistent with them.

One major challenge on this topic is the serious heterogeneity regarding the terminology, criteria and guidance to assess the external validity of RCTs. Development of new tools and/or further revision (and validation) of available tools may not be appropriate before consensus-based standards are developed. Generally, it may be argued whether these methods to assess the external validity in systematic reviews of interventions are suitable [ 9 , 12 ]. The experimental/statistical methods presented in Table  1 may offer a more objective approach to evaluate the external validity of RCTs. However, they are not feasible to implement in the conduction of systematic reviews. Furthermore, they focus mainly on the characteristics and generalizability of the study populations, which is insufficient to assess the external validity of clinical trials [ 109 ], since they do not consider other relevant dimensions of external validity such as intervention settings or treatment variables etc. [ 4 , 109 ].

The methodological possibilities in tool/scale development and validation regarding this topic have not been exploited, yet. More than 20 years ago, there was no consensus regarding the definition of quality of RCTs. In 1998, Verhagen et al. [ 110 ] performed a Delphi study to achieve consensus regarding the definition of quality of RCTs and to create a quality criteria list. Until now, these criteria list has been a guidance in tool development and their criteria are still being implemented in methodological quality or risk of bias assessment tools (e.g. the Cochrane Collaboration risk of bias tool 1 & 2.0, the Physiotherapy Evidence Database (PEDro) scale etc.). Consequently, it seems necessary to seek consensus in order to overcome the issues regarding the external validity of RCTs in a similar way. After reaching consensus, further development and validation is needed following standard guidelines for scale/tool development (e.g. de Vet et al. [ 106 ]; Streiner et al. [ 111 ]; DeVellis [ 112 ]). Since the assessment of external validity seems highly context-dependent [ 9 , 12 ], this should be taken into account in future research. A conventional checklist approach seems inappropriate [ 9 , 12 , 109 ] and a more comprehensive but flexible approach might be necessary. The experimental/statistical methods (Table  1 ) may offer a reference standard for convergent validity testing of the dimension “patient population” in future research.

This review has highlighted the necessity for more research in this area. Published studies and evaluation tools are important sources of information and should inform the development of a new tool or approach.

Strengths and limitations

One strength of the present review is the two-phase search method. With this method we believe that the likelihood of missing relevant studies was addressed adequately. The forward citation tracking using Scopus is another strength of the present review. The quality of the included measurement tools was assessed with an adapted and comprehensive methodology (COSMIN). None of the previous reviews has attempted such an assessment.

There are some limitations of the present review. First, a search for grey literature was not performed. Second, we focused on RCTs only and did not include assessment tools for non-randomized or other observational study design. Third, due to heterogeneity in terminology, we might have missed some tools with our electronic literature search strategy. Furthermore, it was challenging to find studies on measurement properties of some included tools, that did not have a specific name or abbreviation (such as EVAT). We tried to address this potential limitation by performing a comprehensive reference screening and snowballing (including forward citation screening).

Based on the results of this review, no available measurement tool can be fully recommended for the use in systematic reviews to assess the external validity of RCTs. Several steps are required to overcome the identified difficulties before a new tool is developed or available tools are further revised and validated.

Availability of data and materials

All data generated or analyzed during this study are included in this published article (and its supplementary information files).

Abbreviations

 Critical Appraisal Skills Programme

 Cochrane Collaboration Back Review Group

controlled clinical trial

COnsensus based Standards for the selection of health Measurement Instruments

Effective Public Health Practice Project

External Validity Assessment Tool

Feasibility, Appropriateness, Meaningfulness and Effectiveness

Graphical Appraisal Tool for Epidemiological Studies

Generalizability, Applicability and Predictability

Grading of Recommendations Assessment, Development and Evaluation

Health Technology Assessment

intraclass correlation

Let Evidence Guide Every New Decision

National Institute for Health and Care Excellence

Physiotherapy Evidence Database

PRagmatic Explanatory Continuum Indicator Summary

randomized controlled trial

Rating of Included Trials on the Efficacy-Effectiveness Spectrum

Transparent Reporting of Evaluations with Nonrandomized Designs

U.S. Preventive Services Task Force

Bastian H, Glasziou P, Chalmers I. Seventy-five trials and eleven systematic reviews a day: how will we ever keep up? PLoS Med. 2010;7:e1000326.

PubMed   PubMed Central   Google Scholar  

Aromataris E, Munn Z (eds). JBI Manual for Evidence Synthesis. JBI Man Evid Synth. 2020.  https://doi.org/10.46658/jbimes-20-01

Knoll T, Omar MI, Maclennan S, et al. Key Steps in Conducting Systematic Reviews for Underpinning Clinical Practice Guidelines: Methodology of the European Association of Urology. Eur Urol. 2018;73:290–300.

PubMed   Google Scholar  

Jüni P, Altman DG, Egger M. Systematic reviews in health care: Assessing the quality of controlled clinical trials. BMJ. 2001;323:42–6.

Büttner F, Winters M, Delahunt E, Elbers R, Lura CB, Khan KM, Weir A, Ardern CL. Identifying the ’incredible’! Part 1: assessing the risk of bias in outcomes included in systematic reviews. Br J Sports Med. 2020;54:798–800.

Boutron I, Page MJ, Higgins JPT, Altman DG, Lundh A, Hróbjartsson A, Group CBM. Considering bias and conflicts of interest among the included studies. Cochrane Handb. Syst. Rev. Interv. 2021; version 6.2 (updated Febr. 2021)

Cook TD, Campbell DT, Shadish W. Experimental and quasi-experimental designs for generalized causal inference. Boston: Houghton Mifflin; 2002.

Google Scholar  

Avellar SA, Thomas J, Kleinman R, Sama-Miller E, Woodruff SE, Coughlin R, Westbrook TR. External Validity: The Next Step for Systematic Reviews? Eval Rev. 2017;41:283–325.

Weise A, Büchter R, Pieper D, Mathes T. Assessing context suitability (generalizability, external validity, applicability or transferability) of findings in evidence syntheses in healthcare-An integrative review of methodological guidance. Res Synth Methods. 2020;11:760–79.

Schunemann HJ, Tugwell P, Reeves BC, Akl EA, Santesso N, Spencer FA, Shea B, Wells G, Helfand M. Non-randomized studies as a source of complementary, sequential or replacement evidence for randomized controlled trials in systematic reviews on the effects of interventions. Res Synth Methods. 2013;4:49–62.

Atkins D, Chang SM, Gartlehner G, Buckley DI, Whitlock EP, Berliner E, Matchar D. Assessing applicability when comparing medical interventions: AHRQ and the Effective Health Care Program. J Clin Epidemiol. 2011;64:1198–207.

Burchett HED, Blanchard L, Kneale D, Thomas J. Assessing the applicability of public health intervention evaluations from one setting to another: a methodological study of the usability and usefulness of assessment tools and frameworks. Heal Res policy Syst. 2018;16:88.

Dekkers OM, von Elm E, Algra A, Romijn JA, Vandenbroucke JP. How to assess the external validity of therapeutic trials: a conceptual approach. Int J Epidemiol. 2010;39:89–94.

CAS   PubMed   Google Scholar  

Burchett H, Umoquit M, Dobrow M. How do we know when research from one setting can be useful in another? A review of external validity, applicability and transferability frameworks. J Health Serv Res Policy. 2011;16:238–44.

Cambon L, Minary L, Ridde V, Alla F. Transferability of interventions in health education: a review. BMC Public Health. 2012;12:497.

Dyrvig A-K, Kidholm K, Gerke O, Vondeling H. Checklists for external validity: a systematic review. J Eval Clin Pract. 2014;20:857–64.

Munthe-Kaas H, Nøkleby H, Nguyen L. Systematic mapping of checklists for assessing transferability. Syst Rev. 2019;8:22.

Nasser M, van Weel C, van Binsbergen JJ, van de Laar FA. Generalizability of systematic reviews of the effectiveness of health care interventions to primary health care: concepts, methods and future research. Fam Pract. 2012;29(Suppl 1):i94–103.

Hariton E, Locascio JJ. Randomised controlled trials - the gold standard for effectiveness research: Study design: randomised controlled trials. BJOG. 2018;125:1716.

Pressler TR, Kaizar EE. The use of propensity scores and observational data to estimate randomized controlled trial generalizability bias. Stat Med. 2013;32:3552–68.

Rothwell PM. External validity of randomised controlled trials: “to whom do the results of this trial apply?” Lancet. 2005;365:82–93.

Downs SH, Black N. The feasibility of creating a checklist for the assessment of the methodological quality both of randomised and non-randomised studies of health care interventions. J Epidemiol Community Health. 1998;52:377–84.

CAS   PubMed   PubMed Central   Google Scholar  

Page MJ, Moher D, Bossuyt PM, et al. PRISMA 2020 explanation and elaboration: updated guidance and exemplars for reporting systematic reviews. BMJ. 2021;372:n160.

Clark R, Locke M, Hill B, Wells C, Bialocerkowski A. Clinimetric properties of lower limb neurological impairment tests for children and young people with a neurological condition: A systematic review. PLoS One. 2017;12:e0180031.

Mokkink LB, de Vet HCW, Prinsen CAC, Patrick DL, Alonso J, Bouter LM, Terwee CB. COSMIN Risk of Bias checklist for systematic reviews of Patient-Reported Outcome Measures. Qual Life Res. 2018;27:1171–9.

Prinsen CAC, Mokkink LB, Bouter LM, Alonso J, Patrick DL, de Vet HCW, Terwee CB. COSMIN guideline for systematic reviews of patient-reported outcome measures. Qual Life Res. 2018;27:1147–57.

Terwee CB, Prinsen CAC, Chiarotto A, Westerman MJ, Patrick DL, Alonso J, Bouter LM, de Vet HCW, Mokkink LB. COSMIN methodology for evaluating the content validity of patient-reported outcome measures: a Delphi study. Qual Life Res. 2018;27:1159–70.

Stephenson M, Riitano D, Wilson S, Leonardi-Bee J, Mabire C, Cooper K, Monteiro da Cruz D, Moreno-Casbas MT, Lapkin S. Chap. 12: Systematic Reviews of Measurement Properties. JBI Man Evid Synth. 2020  https://doi.org/10.46658/JBIMES-20-13

Glover PD, Gray H, Shanmugam S, McFadyen AK. Evaluating collaborative practice within community-based integrated health and social care teams: a systematic review of outcome measurement instruments. J Interprof Care. 2021;1–15.  https://doi.org/10.1080/13561820.2021.1902292 . Epub ahead of print.

Maassen SM, Weggelaar Jansen AMJW, Brekelmans G, Vermeulen H, van Oostveen CJ. Psychometric evaluation of instruments measuring the work environment of healthcare professionals in hospitals: a systematic literature review. Int J Qual Heal care J Int Soc Qual Heal Care. 2020;32:545–57.

Jabri Yaqoob MohammedAl, Kvist F, Azimirad T, Turunen M. A systematic review of healthcare professionals’ core competency instruments. Nurs Health Sci. 2021;23:87–102.

Mokkink LB, Terwee CB, Patrick DL, Alonso J, Stratford PW, Knol DL, Bouter LM, de Vet HCW. The COSMIN study reached international consensus on taxonomy, terminology, and definitions of measurement properties for health-related patient-reported outcomes. J Clin Epidemiol. 2010;63:737–45.

Jung A, Balzer J, Braun T, Luedtke K. Psychometric properties of tools to measure the external validity of randomized controlled trials: a systematic review protocol. 2020;  https://doi.org/10.17605/OSF.IO/PTG4D

Mokkink LB, Prinsen CAC, Patrick DL, Alonso J, Bouter LM, de Vet HCW, Terwee CB COSMIN manual for systematic reviews of PROMs, user manual. 2018;1–78. https://www.cosmin.nl/wp-content/uploads/COSMIN-syst-review-for-PROMs-manual_version-1_feb-2018-1.pdf . Accessed 3 Feb 2020.

Bialocerkowski A, O’shea K, Pin TW. Psychometric properties of outcome measures for children and adolescents with brachial plexus birth palsy: a systematic review. Dev Med Child Neurol. 2013;55:1075–88.

Matthews J, Bialocerkowski A, Molineux M. Professional identity measures for student health professionals - a systematic review of psychometric properties. BMC Med Educ. 2019;19:308.

Terwee CB, Jansma EP, Riphagen II, De Vet HCW. Development of a methodological PubMed search filter for finding studies on measurement properties of measurement instruments. Qual Life Res. 2009;18:1115–23.

Sierevelt IN, Zwiers R, Schats W, Haverkamp D, Terwee CB, Nolte PA, Kerkhoffs GMMJ. Measurement properties of the most commonly used Foot- and Ankle-Specific Questionnaires: the FFI, FAOS and FAAM. A systematic review. Knee Surg Sports Traumatol Arthrosc. 2018;26:2059–73.

van der Hout A, Neijenhuijs KI, Jansen F, et al. Measuring health-related quality of life in colorectal cancer patients: systematic review of measurement properties of the EORTC QLQ-CR29. Support Care Cancer. 2019;27:2395–412.

Whiting P, Savović J, Higgins JPT, Caldwell DM, Reeves BC, Shea B, Davies P, Kleijnen J, Churchill R. ROBIS: A new tool to assess risk of bias in systematic reviews was developed. J Clin Epidemiol. 2016;69:225–34.

Ouzzani M, Hammady H, Fedorowicz Z, Elmagarmid A. Rayyan-a web and mobile app for systematic reviews. Syst Rev. 2016;5:210.

Moher D, Hopewell S, Schulz KF, Montori V, Gøtzsche PC, Devereaux PJ, Elbourne D, Egger M, Altman DG. CONSORT 2010 explanation and elaboration: updated guidelines for reporting parallel group randomised trials. Int J Surg. 2012;10:28–55.

Mokkink LB, Terwee CB. The COSMIN checklist for assessing the methodological quality of studies on measurement properties of health status measurement instruments: an international Delphi study. 2010;539–549

Terwee CB, Prinsen CA, Chiarotto A, De Vet H, Bouter LM, Alonso J, Westerman MJ, Patrick DL, Mokkink LB. COSMIN methodology for assessing the content validity of PROMs–user manual. Amsterdam VU Univ. Med. Cent. 2018;  https://cosmin.nl/wp-content/uploads/COSMIN-methodology-for-content-validity-user-manual-v1.pdf . Accessed 3 Feb 2020.

Mustafa RA, Santesso N, Brozek J, et al. The GRADE approach is reproducible in assessing the quality of evidence of quantitative evidence syntheses. J Clin Epidemiol. 2013;66:735–6.

Jennings H, Hennessy K, Hendry GJ. The clinical effectiveness of intra-articular corticosteroids for arthritis of the lower limb in juvenile idiopathic arthritis: A systematic review. Pediatr Rheumatol. 2014. https://doi.org/10.1186/1546-0096-12-23 .

Article   Google Scholar  

Wieland LS, Berman BM, Altman DG, et al. Rating of Included Trials on the Efficacy-Effectiveness Spectrum: development of a new tool for systematic reviews. J Clin Epidemiol. 2017;84:95–104.

Atkins D, Briss PA, Eccles M, et al. Systems for grading the quality of evidence and the strength of recommendations II: pilot study of a new system. BMC Health Serv Res. 2005;5:25.

Abraham NS, Wieczorek P, Huang J, Mayrand S, Fallone CA, Barkun AN. Assessing clinical generalizability in sedation studies of upper GI endoscopy. Gastrointest Endosc. 2004;60:28–33.

Arabi YM, Cook DJ, Zhou Q, et al. Characteristics and Outcomes of Eligible Nonenrolled Patients in a Mechanical Ventilation Trial of Acute Respiratory Distress Syndrome. Am J Respir Crit Care Med. 2015;192:1306–13.

Williams AC, de Nicholas C, Richardson MK, de Pither PH, FAC. Generalizing from a controlled trial: The effects of patient preference versus randomization on the outcome of inpatient versus outpatient chronic pain management. Pain. 1999;83:57–65.

De Jong Z, Munneke M, Jansen LM, Ronday K, Van Schaardenburg DJ, Brand R, Van Den Ende CHM, Vliet Vlieland TPM, Zuijderduin WM, Hazes JMW. Differences between participants and nonparticipants in an exercise trial for adults with rheumatoid arthritis. Arthritis Care Res. 2004;51:593–600.

Hordijk-Trion M, Lenzen M, Wijns W, et al. Patients enrolled in coronary intervention trials are not representative of patients in clinical practice: Results from the Euro Heart Survey on Coronary Revascularization. Eur Heart J. 2006;27:671–8.

Wilson A, Parker H, Wynn A, Spiers N. Performance of hospital-at-home after a randomised controlled trial. J Heal Serv Res Policy. 2003;8:160–4.

Smyth B, Haber A, Trongtrakul K, Hawley C, Perkovic V, Woodward M, Jardine M. Representativeness of Randomized Clinical Trial Cohorts in End-stage Kidney Disease: A Meta-analysis. JAMA Intern Med. 2019;179:1316–24.

Leinonen A, Koponen M, Hartikainen S. Systematic Review: Representativeness of Participants in RCTs of Acetylcholinesterase Inhibitors. PLoS One. 2015;10:e0124500–e0124500.

Chari A, Romanus D, Palumbo A, Blazer M, Farrelly E, Raju A, Huang H, Richardson P. Randomized Clinical Trial Representativeness and Outcomes in Real-World Patients: Comparison of 6 Hallmark Randomized Clinical Trials of Relapsed/Refractory Multiple Myeloma. Clin Lymphoma Myeloma Leuk. 2020;20:8.

Susukida R, Crum RM, Ebnesajjad C, Stuart EA, Mojtabai R. Generalizability of findings from randomized controlled trials: application to the National Institute of Drug Abuse Clinical Trials Network. Addiction. 2017;112:1210–9.

Zarin DA, Young JL, West JC. Challenges to evidence-based medicine: a comparison of patients and treatments in randomized controlled trials with patients and treatments in a practice research network. Soc Psychiatry Psychiatr Epidemiol. 2005;40:27–35.

Gheorghe A, Roberts T, Hemming K, Calvert M. Evaluating the Generalisability of Trial Results: Introducing a Centre- and Trial-Level Generalisability Index. Pharmacoeconomics. 2015;33:1195–214.

He Z, Wang S, Borhanian E, Weng C. Assessing the Collective Population Representativeness of Related Type 2 Diabetes Trials by Combining Public Data from ClinicalTrials.gov and NHANES. Stud Health Technol Inform. 2015;216:569–73.

Schmidt AF, Groenwold RHH, van Delden JJM, van der Does Y, Klungel OH, Roes KCB, Hoes AW, van der Graaf R. Justification of exclusion criteria was underreported in a review of cardiovascular trials. J Clin Epidemiol. 2014;67:635–44.

Carr DB, Goudas LC, Balk EM, Bloch R, Ioannidis JP, Lau J. Evidence report on the treatment of pain in cancer patients. J Natl Cancer Inst Monogr. 2004;32:23–31.

Clegg A, Bryant J, Nicholson T, et al. Clinical and cost-effectiveness of donepezil, rivastigmine and galantamine for Alzheimer’s disease: a rapid and systematic review. Health Technol Assess (Rockv). 2001;5:1–136.

Foy R, Hempel S, Rubenstein L, Suttorp M, Seelig M, Shanman R, Shekelle PG. Meta-analysis: effect of interactive communication between collaborating primary care physicians and specialists. Ann Intern Med. 2010;152:247–58.

Haraldsson BG, Gross AR, Myers CD, Ezzo JM, Morien A, Goldsmith C, Peloso PM, Bronfort G. Massage for mechanical neck disorders. Cochrane database Syst Rev. 2006.  https://doi.org/10.1002/14651858.CD004871.pub3 .

Hawk C, Khorsan R, AJ L, RJ F. Chiropractic care for nonmusculoskeletal conditions: a systematic review with implications for whole systems research. J Altern Complement Med. 2007;13:491–512.

Karjalainen K, Malmivaara A, van Tulder M, et al. Multidisciplinary rehabilitation for fibromyalgia and musculoskeletal pain in working age adults. Cochrane Database Syst Rev. 2000. https://doi.org/10.1002/14651858.CD001984 .

Article   PubMed   Google Scholar  

Liberati A, Himel HN, Chalmers TC. A quality assessment of randomized control trials of primary treatment of breast cancer. J Clin Oncol. 1986;4:942–51.

Averis A, Pearson A. Filling the gaps: identifying nursing research priorities through the analysis of completed systematic reviews. Jbi Reports. 2003;1:49–126.

Sorg C, Schmidt J, Büchler MW, Edler L, Märten A. Examination of external validity in randomized controlled trials for adjuvant treatment of pancreatic adenocarcinoma. Pancreas. 2009;38:542–50.

National Institute for Health and Care Excellence. Methods for the development of NICE public health guidance, Third edit. National Institute for Health and Care Excellence. 2012;  https://www.nice.org.uk/process/pmg4/chapter/introduction . Accessed 15 Apr 2020

U.S. Preventive Services Task Force. Criteria for Assessing External Validity (Generalizability) of Individual Studies. US Prev Serv Task Force Appendix VII. 2017;  https://uspreventiveservicestaskforce.org/uspstf/about-uspstf/methods-and-processes/procedure-manual/procedure-manual-appendix-vii-criteria-assessing-external-validity-generalizability-individual-studies . Accessed 15 Apr 2020.

National Health and Medical Research Council NHMRC handbooks. https://www.nhmrc.gov.au/about-us/publications/how-prepare-and-present-evidence-based-information-consumers-health-services#block-views-block-file-attachments-content-block-1 . Accessed 15 Apr 2020.

Loyka CM, Ruscio J, Edelblum AB, Hatch L, Wetreich B, Zabel Caitlin M. Weighing people rather than food: A framework for examining external validity. Perspect Psychol Sci. 2020;15:483–96.

Fernandez-Hermida JR, Calafat A, Becoña E, Tsertsvadze A, Foxcroft DR. Assessment of generalizability, applicability and predictability (GAP) for evaluating external validity in studies of universal family-based prevention of alcohol misuse in young people: systematic methodological review of randomized controlled trials. Addiction. 2012;107:1570–9.

Clark E, Burkett K, Stanko-Lopp D. Let Evidence Guide Every New Decision (LEGEND): an evidence evaluation system for point-of-care clinicians and guideline development teams. J Eval Clin Pract. 2009;15:1054–60.

Bornhöft G, Maxion-Bergemann S, Wolf U, Kienle GS, Michalsen A, Vollmar HC, Gilbertson S, Matthiessen PF. Checklist for the qualitative evaluation of clinical studies with particular focus on external validity and model validity. BMC Med Res Methodol. 2006;6:56.

Cho MK, Bero LA. Instruments for assessing the quality of drug studies published in the medical literature. JAMA J Am Med Assoc. 1994;272:101–4.

CAS   Google Scholar  

Cho MK, Bero LA. The quality of drug studies published in symposium proceedings. Ann Intern Med 1996;124:485–489

van Tulder M, Furlan A, Bombardier C, Bouter L. Updated method guidelines for systematic reviews in the cochrane collaboration back review group. Spine (Phila Pa 1976). 2003;28:1290–9.

Estrada F, Atienzo EE, Cruz-Jiménez L, Campero L. A Rapid Review of Interventions to Prevent First Pregnancy among Adolescents and Its Applicability to Latin America. J Pediatr Adolesc Gynecol. 2021;34:491–503.

Khorsan R, Crawford C. How to assess the external validity and model validity of therapeutic trials: a conceptual approach to systematic review methodology. Evid Based Complement Alternat Med. 2014;2014:694804.

O’Connor SR, Tully MA, Ryan B, Bradley JM, Baxter GD, McDonough SM. Failure of a numerical quality assessment scale to identify potential risk of bias in a systematic review: a comparison study. BMC Res Notes. 2015;8:224.

Chalmers TC, Smith H, Blackburn B, Silverman B, Schroeder B, Reitman D, Ambroz A. A method for assessing the quality of a randomized control trial. Control Clin Trials. 1981;2:31–49.

Gartlehner G, Hansen RA, Nissman D, Lohr KN, Carey TS. A simple and valid tool distinguished efficacy from effectiveness studies. J Clin Epidemiol. 2006;59:1040–8.

Zettler LL, Speechley MR, Foley NC, Salter KL, Teasell RW. A scale for distinguishing efficacy from effectiveness was adapted and applied to stroke rehabilitation studies. J Clin Epidemiol. 2010;63:11–8.

Green LW, Glasgow RE. Evaluating the relevance, generalization, and applicability of research: issues in external validation and translation methodology. Eval Health Prof. 2006;29:126–53.

Glasgow RE, Vogt TM, Boles SM. Evaluating the public health impact of health promotion interventions: the RE-AIM framework. Am J Public Health. 1999;89:1322–7.

Mirza NA, Akhtar-Danesh N, Staples E, Martin L, Noesgaard C. Comparative Analysis of External Validity Reporting in Non-randomized Intervention Studies. Can J Nurs Res. 2014;46:47–64.

Laws RA, St George AB, Rychetnik L, Bauman AE. Diabetes prevention research: a systematic review of external validity in lifestyle interventions. Am J Prev Med. 2012;43:205–14.

Schünemann H, Brożek J, Guyatt G, Oxman A. Handbook for grading the quality of evidence and the strength of recommendations using the GRADE approach (updated October 2013). GRADE Work. Gr. 2013;  https://gdt.gradepro.org/app/handbook/handbook.html . Accessed 15 Apr 2020.

Wu XY, Chung VCH, Wong CHL, Yip BHK, Cheung WKW, Wu JCY. CHIMERAS showed better inter-rater reliability and inter-consensus reliability than GRADE in grading quality of evidence: A randomized controlled trial. Eur J Integr Med. 2018;23:116–22.

Meader N, King K, Llewellyn A, Norman G, Brown J, Rodgers M, Moe-Byrne T, Higgins JPT, Sowden A, Stewart G. A checklist designed to aid consistency and reproducibility of GRADE assessments: Development and pilot validation. Syst Rev. 2014. https://doi.org/10.1186/2046-4053-3-82 .

Article   PubMed   PubMed Central   Google Scholar  

Llewellyn A, Whittington C, Stewart G, Higgins JP, Meader N. The Use of Bayesian Networks to Assess the Quality of Evidence from Research Synthesis: 2. Inter-Rater Reliability and Comparison with Standard GRADE Assessment. PLoS One. 2015;10:e0123511.

Jackson R, Ameratunga S, Broad J, Connor J, Lethaby A, Robb G, Wells S, Glasziou P, Heneghan C. The GATE frame: critical appraisal with pictures. Evid Based Med 2006;11:35 LP– 38

Aves T. The Role of Pragmatism in Explaining Heterogeneity in Meta-Analyses of Randomized Trials: A Methodological Review. 2017; McMaster University. http://hdl.handle.net/11375/22212 . Accessed 12 Jan 2021.

Thomas BH, Ciliska D, Dobbins M, Micucci S. A process for systematically reviewing the literature: providing the research evidence for public health nursing interventions. Worldviews Evidence-Based Nurs. 2004;1:176–84.

Armijo-Olivo S, Stiles CR, Hagen NA, Biondo PD, Cummings GG. Assessment of study quality for systematic reviews: a comparison of the Cochrane Collaboration Risk of Bias Tool and the Effective Public Health Practice Project Quality Assessment Tool: methodological research. J Eval Clin Pract. 2012;18:12–8.

Critical Appraisal Skills Programme. CASP Randomised Controlled Trial Standard Checklist. 2020;  https://casp-uk.net/casp-tools-checklists/ . Accessed 10 Dec 2020.

Aves T, Allan KS, Lawson D, Nieuwlaat R, Beyene J, Mbuagbaw L. The role of pragmatism in explaining heterogeneity in meta-analyses of randomised trials: a protocol for a cross-sectional methodological review. BMJ Open. 2017;7:e017887.

Diamantopoulos A, Riefler P, Roth KP. Advancing formative measurement models. J Bus Res. 2008;61:1203–18.

Fayers PM, Hand DJ. Factor analysis, causal indicators and quality of life. Qual Life Res. 1997. https://doi.org/10.1023/A:1026490117121 .

Streiner DL. Being Inconsistent About Consistency: When Coefficient Alpha Does and Doesn’t Matter. J Pers Assess. 2003;80:217–22.

MacKenzie SB, Podsakoff PM, Jarvis CB. The Problem of Measurement Model Misspecification in Behavioral and Organizational Research and Some Recommended Solutions. J Appl Psychol. 2005;90:710–30.

De Vet HCW, Terwee CB, Mokkink LB, Knol DL. Measurement in medicine: a practical guide. 2011;  https://doi.org/10.1017/CBO9780511996214

Dekkers OM, Bossuyt PM, Vandenbroucke JP. How trial results are intended to be used: is PRECIS-2 a step forward? J Clin Epidemiol. 2017;84:25–6.

Brozek JL, Canelo-Aybar C, Akl EA, et al. GRADE Guidelines 30: the GRADE approach to assessing the certainty of modeled evidence-An overview in the context of health decision-making. J Clin Epidemiol. 2021;129:138–50.

Burchett HED, Kneale D, Blanchard L, Thomas J. When assessing generalisability, focusing on differences in population or setting alone is insufficient. Trials. 2020;21:286.

Verhagen AP, de Vet HCW, de Bie RA, Kessels AGH, Boers M, Bouter LM, Knipschild PG. The Delphi List: A Criteria List for Quality Assessment of Randomized Clinical Trials for Conducting Systematic Reviews Developed by Delphi Consensus. J Clin Epidemiol. 1998;51:1235–41.

Streiner DL, Norman GR, Cairney J. Health measurement scales: a practical guide to their development and use, Fifth edit. Oxford: Oxford University Press; 2015.

DeVellis RF. Scale development: Theory and applications, Fourth edi. Los Angeles: Sage publications; 2017.

Download references

Acknowledgements

We would like to thank Sven Bossmann and Sarah Tiemann for their assistance with the elaboration of the search strategy.

Open Access funding enabled and organized by Projekt DEAL.

Author information

Authors and affiliations.

Institute of Health Sciences, Department of Physiotherapy, Pain and Exercise Research Luebeck (P.E.R.L), Universität zu Lübeck, Ratzeburger Allee 160, 23562, Lübeck, Germany

Andres Jung & Kerstin Luedtke

Faculty of Applied Public Health, European University of Applied Sciences, Werftstr. 5, 18057, Rostock, Germany

Julia Balzer

Division of Physiotherapy, Department of Applied Health Sciences, Hochschule für Gesundheit (University of Applied Sciences), Gesundheitscampus 6‑8, 44801, Bochum, Germany

Tobias Braun

Department of Health, HSD Hochschule Döpfer (University of Applied Sciences), Waidmarkt 9, 50676, Cologne, Germany

You can also search for this author in PubMed   Google Scholar

Contributions

All authors contributed to the design of the study. AJ designed the search strategy and conducted the systematic search. AJ and TB screened titles and abstracts as well as full-text reports in phase (1) AJ and KL screened titles and abstracts as well as full-text reports in phase (2) Data extraction was performed by AJ and checked by TB. Quality appraisal and data analysis was performed by AJ and JB. AJ drafted the manuscript. JB, TB and KL critically revised the manuscript for important intellectual content. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Andres Jung .

Ethics declarations

Ethics approval and consent to participate.

Not applicable.

Consent for publication

Competing interests.

The authors declare that they have no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1., additional file 2., additional file 3., additional file 4., rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Jung, A., Balzer, J., Braun, T. et al. Identification of tools used to assess the external validity of randomized controlled trials in reviews: a systematic review of measurement properties. BMC Med Res Methodol 22 , 100 (2022). https://doi.org/10.1186/s12874-022-01561-5

Download citation

Received : 20 August 2021

Accepted : 28 February 2022

Published : 06 April 2022

DOI : https://doi.org/10.1186/s12874-022-01561-5

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • External validity
  • Generalizability
  • Applicability
  • Randomized controlled trial

BMC Medical Research Methodology

ISSN: 1471-2288

external validity in research methodology

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • J Bras Pneumol
  • v.44(3); May-Jun 2018

Internal and external validity: can you apply research study results to your patients?

Cecilia maria patino.

1 . Methods in Epidemiologic, Clinical, and Operations Research-MECOR-program, American Thoracic Society/Asociación Latinoamericana del Tórax, Montevideo, Uruguay.

2 . Department of Preventive Medicine, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA.

Juliana Carvalho Ferreira

3 . Divisão de Pneumologia, Instituto do Coração, Hospital das Clínicas, Faculdade de Medicina, Universidade de São Paulo, São Paulo (SP) Brasil.

CLINICAL SCENARIO

In a multicenter study in France, investigators conducted a randomized controlled trial to test the effect of prone vs. supine positioning ventilation on mortality among patients with early, severe ARDS. They showed that prolonged prone-positioning ventilation decreased 28-day mortality [hazard ratio (HR) = 0.39; 95% CI: 0.25-0.63]. 1

STUDY VALIDITY

The validity of a research study refers to how well the results among the study participants represent true findings among similar individuals outside the study. This concept of validity applies to all types of clinical studies, including those about prevalence, associations, interventions, and diagnosis. The validity of a research study includes two domains: internal and external validity.

Internal validity is defined as the extent to which the observed results represent the truth in the population we are studying and, thus, are not due to methodological errors. In our example, if the authors can support that the study has internal validity, they can conclude that prone positioning reduces mortality among patients with severe ARDS. The internal validity of a study can be threatened by many factors, including errors in measurement or in the selection of participants in the study, and researchers should think about and avoid these errors.

Once the internal validity of the study is established, the researcher can proceed to make a judgment regarding its external validity by asking whether the study results apply to similar patients in a different setting or not ( Figure 1 ). In the example, we would want to evaluate if the results of the clinical trial apply to ARDS patients in other ICUs. If the patients have early, severe ARDS, probably yes, but the study results may not apply to patients with mild ARDS . External validity refers to the extent to which the results of a study are generalizable to patients in our daily practice, especially for the population that the sample is thought to represent.

An external file that holds a picture, illustration, etc.
Object name is 1806-3713-jbpneu-44-03-00183-gf1.jpg

Lack of internal validity implies that the results of the study deviate from the truth, and, therefore, we cannot draw any conclusions; hence, if the results of a trial are not internally valid, external validity is irrelevant. 2 Lack of external validity implies that the results of the trial may not apply to patients who differ from the study population and, consequently, could lead to low adoption of the treatment tested in the trial by other clinicians.

INCREASING VALIDITY OF RESEARCH STUDIES

To increase internal validity, investigators should ensure careful study planning and adequate quality control and implementation strategies-including adequate recruitment strategies, data collection, data analysis, and sample size. External validity can be increased by using broad inclusion criteria that result in a study population that more closely resembles real-life patients, and, in the case of clinical trials, by choosing interventions that are feasible to apply. 2

Our websites may use cookies to personalize and enhance your experience. By continuing without changing your cookie settings, you agree to this collection. For more information, please see our University Websites Privacy Notice .

Neag School of Education

Educational Research Basics by Del Siegle

External validity.

Note to EPSY 5601 Students: An understanding of the difference between population and ecological validity is sufficient. Mastery of the sub categories for each is not necessary for this course.

External Validity (Generalizability) –to whom can the results of the study be applied–

There are two types of study validity: internal (more applicable with experimental research) and external. This section covers external validity.

External validity involves the extent to which the results of a study can be generalized (applied) beyond the sample. In other words, can you apply what you found in your study to other people (population validity) or settings (ecological validity).  A study of fifth graders in a rural school that found one method of teaching spelling was superior to another may not be applicable with third graders (population) in an urban school (ecological).

Threats to External Validity

Population Validity the extent to which the results of a study can be generalized from the specific sample that was studied to a larger group of subjects

  • the extent to which one can generalize from the study sample to a defined population– If  the sample is drawn from an accessible population, rather than the target population, generalizing the research results from the accessible population to the target population is risky. 2. the extent to which personological variables interact with treatment effects– If the study is an experiment, it may be possible that different results might be found with students at different grades (a personological variable).

Ecological Validity the extent to which the results of an experiment can be generalized from the set of environmental conditions created by the researcher to other environmental conditions (settings and conditions).

  • Explicit description of the experimental treatment (not sufficiently described for others to replicate) If the researcher fails to adequately describe how he or she conducted a study, it is difficult to determine whether the results are applicable to other settings.
  • Multiple-treatment interference (catalyst effect) If a researcher were to apply several treatments, it is difficult to determine how well each of the treatments would work individually. It might be that only the combination of the treatments is effective.
  • Hawthorne effect (attention causes differences) Subjects perform differently because they know they are being studied. “…External validity of the experiment is jeopardized because the findings might not generalize to a situation in which researchers or others who were involved in the research are not present” (Gall, Borg, & Gall, 1996, p. 475)
  • Novelty and disruption effect (anything different makes a difference) A treatment may work because it is novel and the subjects respond to the uniqueness, rather than the actual treatment. The opposite may also occur, the treatment may not work because it is unique, but given time for the subjects to adjust to it, it might have worked.
  • Experimenter effect (it only works with this experimenter) The treatment might have worked because of the person implementing it. Given a different person, the treatment might not work at all.
  • Pretest sensitization (pretest sets the stage) A treatment might only work if a pretest is given. Because they have taken a pretest, the subjects may be more sensitive to the treatment. Had they not taken a pretest, the treatment would not have worked.
  • Posttest sensitization (posttest helps treatment “fall into place”) The posttest can become a learning experience. “For example, the posttest might cause certain ideas presented during the treatment to ‘fall into place’ ” (p. 477). If the subjects had not taken a posttest, the treatment would not have worked.
  • Interaction of history and treatment effec t (…to everything there is a time…) Not only should researchers be cautious about generalizing to other population, caution should be taken to generalize to a different time period. As time passes, the conditions under which treatments work change.
  • Measurement of the dependent variable (maybe only works with M/C tests) A treatment may only be evident with certain types of measurements. A teaching method may produce superior results when its effectiveness is tested with an essay test, but show no differences when the effectiveness is measured with a multiple choice test.
  • Interaction of time of measurement and treatment effect (it takes a while for the treatment to kick in) It may be that the treatment effect does not occur until several weeks after the end of the treatment. In this situation, a posttest at the end of the treatment would show no impact, but a posttest a month later might show an impact.

Bracht, G. H., & Glass, G. V. (1968). The external validity of experiments. American Education Research Journal, 5, 437-474. Gall, M. D., Borg, W. R., & Gall, J. P. (1996). Educational research: An introduction. White Plains, NY: Longman.

Del Siegle, Ph.D. Neag School of Education – University of Connecticut [email protected] www.delsiegle.com

  • En español – ExME
  • Em português – EME

Internal and external validity: what are they and how do they differ?

Posted on 4th May 2023 by Gabriela Negrete-Tobar

""

Is this study valid? This is a question that should be asked when looking for reliable evidence. For you to answer it correctly, you must understand how to find the elements of validity in medical research . Additionally, these elements change according to the type of study that is being analysed .

Let’s use an example to bring this concept to life

Your aunt has mild bilateral knee osteoarthritis (OA). She complains of pain and can’t run due to it. According to her doctor, her main treatment plan consists of physical therapy and pain management. Aquatic cycling was offered to her at the gym as a method of pain control for patients with OA so she wants to know if she should sign up for it.

You start your search and find a single-blinded parallel group randomized control trial. It was independently funded and set at a University Medical Centre in a European country. It aims to compare an Aquatic cycling (AC) supervised 12-week plan of 45-minute sessions twice a week vs a Usual Care (UC) plan consisting of individualized physical therapy and mobility aids.

Now let’s look at the trial’s design in more detail

The study’s participants consisted of patients diagnosed with knee OA with mild-moderate pain and indication for conservative treatment. Participants were excluded if they had received corticosteroid or hyaluronic acid injections 3-6 months prior or had pain or OA in any other joints. Researchers used a randomizing software to allocate participants into either Aquatic Cycling (AC) or Usual Care (UC). Originally, 111 patients were randomized and 55 were assigned to AC compared to 56 into UC.  However, out of the UC group 9 patients dropped out after baseline as they didn’t want to undergo physical therapy. It is important to note that the researchers that conducted allocation and professional assessments were blinded. In contrast, participants were not blinded and received information on their treatment plan before starting the trial.

During the trial the follow-up time was 24 weeks. Primary outcomes of the amount of pain and knee mobility were measured using the self-reporting Knee Injury and Osteoarthritis Outcome Score (KOOS) questionnaire. Equally important, physical therapists applied the lower extremity functional scale, as well as evaluated pain after 6-minute walk and asked for fear of injury/re-injury, as secondary outcomes. Significantly, analysis of all the outcomes was made using the intention-to-treat principle .

What were the authors’ conclusions?

They reported information from 90 participants. These participants had a mean age of 59 years, were predominantly overweight women, and in general both groups had similar pre-trial symptom severity.  They reported that in the AC group 80% attended all 24 sessions. During the trial’s follow-up time they lost 51% of participants in the AC group due to comorbidities and other unknown reasons. Additionally, 18% presented with exacerbations of pain after the first AC session as an adverse effect. Despite these reports, it was not clear in the analysis if these adverse effects altered the reported outcomes.

In comparison the UC group participants were free to consult a physical therapist according to their symptom severity, but only 32% did. Compared to the AC group, the UC participants didn’t report adverse effects. Lastly, as it was promised by the researchers, after the follow-up period ended the UC participants received complementary AC therapy.

Finally, they concluded that a 12-week training program of AC improved knee pain and physical functioning in patients with osteoarthritis.

What is Validity?

Validity looks at how accurate the measurement made in a study was. It assesses if the information reported reflects the effects of an intervention on the sample group, and it evaluates if it is useful for people of similar characteristics. It is divided into internal and external validity.

So, is this trial valid? To answer you can follow the mnemonic: BOAS bite CA L F PA I RS

""

Internal validity

Internal validity shows if the study truthfully reports the intervention’s effect on the selected group of participants. It is influenced by the methods used to reduce the risk of bias and chance.

""

External validity

External validity looks at a study’s clinical usefulness and applicability on a group of patients. In other words, can we generalise the findings from this study to other contexts?

""

Hopefully, boas don’t bite calf pairs, however this mnemonic can assist you in remembering the elements of validity. Do you think that this study is valid? Also, is Aquatic Cycling beneficial for your aunt?  You can add your comments and thoughts in the comments below!

Acknowledgments to Dr. Claudia Granados-Rugeles for her contributions on this tutorial.

References (pdf)

You may also be interested in the following blogs to help you understand some of the concepts mentioned in the blog:

A beginner’s guide to confounding

Blinding: taking a better look at the blind side

Allocation concealment: the key to effective randomisation

Sample size: a practical introduction

Many of the blogs from our ‘ Key Concepts ‘ series will be useful to you as well. These help explain concepts that people may need to understand to assess treatment claims.

' src=

Gabriela Negrete-Tobar

Leave a reply cancel reply.

Your email address will not be published. Required fields are marked *

Save my name, email, and website in this browser for the next time I comment.

Subscribe to our newsletter

You will receive our monthly newsletter and free access to Trip Premium.

Related Articles

external validity in research methodology

How to read a funnel plot

This blog introduces you to funnel plots, guiding you through how to read them and what may cause them to look asymmetrical.

""

What is data dredging?

What is data dredging, how does it affect the p-value and what is its impact on the world around us?

""

What do trialists do about participants who are ‘lost to follow-up’?

Participants in clinical trials may exit the study prior to having their results collated; in this case, what do we do with their results?

  • Bipolar Disorder
  • Therapy Center
  • When To See a Therapist
  • Types of Therapy
  • Best Online Therapy
  • Best Couples Therapy
  • Best Family Therapy
  • Managing Stress
  • Sleep and Dreaming
  • Understanding Emotions
  • Self-Improvement
  • Healthy Relationships
  • Student Resources
  • Personality Types
  • Guided Meditations
  • Verywell Mind Insights
  • 2023 Verywell Mind 25
  • Mental Health in the Classroom
  • Editorial Process
  • Meet Our Review Board
  • Crisis Support

Internal Validity vs. External Validity in Research

Both help determine how meaningful the results of the study are

Arlin Cuncic, MA, is the author of "Therapy in Focus: What to Expect from CBT for Social Anxiety Disorder" and "7 Weeks to Reduce Anxiety." She has a Master's degree in psychology.

external validity in research methodology

Rachel Goldman, PhD FTOS, is a licensed psychologist, clinical assistant professor, speaker, wellness expert specializing in eating behaviors, stress management, and health behavior change.

external validity in research methodology

Verywell / Bailey Mariner

  • Internal Validity
  • External Validity

Internal validity is a measure of how well a study is conducted (its structure) and how accurately its results reflect the studied group.

External validity relates to how applicable the findings are in the real world. These two concepts help researchers gauge if the results of a research study are trustworthy and meaningful.

Conclusions are warranted

Controls extraneous variables

Eliminates alternative explanations

Focus on accuracy and strong research methods

Findings can be generalized

Outcomes apply to practical situations

Results apply to the world at large

Results can be translated into another context

What Is Internal Validity in Research?

Internal validity is the extent to which a research study establishes a trustworthy cause-and-effect relationship. This type of validity depends largely on the study's procedures and how rigorously it is performed.

Internal validity is important because once established, it makes it possible to eliminate alternative explanations for a finding. If you implement a smoking cessation program, for instance, internal validity ensures that any improvement in the subjects is due to the treatment administered and not something else.

Internal validity is not a "yes or no" concept. Instead, we consider how confident we can be with study findings based on whether the research avoids traps that may make those findings questionable. The less chance there is for "confounding," the higher the internal validity and the more confident we can be.

Confounding refers to uncontrollable variables that come into play and can confuse the outcome of a study, making us unsure of whether we can trust that we have identified the cause-and-effect relationship.

In short, you can only be confident that a study is internally valid if you can rule out alternative explanations for the findings. Three criteria are required to assume cause and effect in a research study:

  • The cause preceded the effect in terms of time.
  • The cause and effect vary together.
  • There are no other likely explanations for the relationship observed.

Factors That Improve Internal Validity

To ensure the internal validity of a study, you want to consider aspects of the research design that will increase the likelihood that you can reject alternative hypotheses. Many factors can improve internal validity in research, including:

  • Blinding : Participants—and sometimes researchers—are unaware of what intervention they are receiving (such as using a placebo on some subjects in a medication study) to avoid having this knowledge bias their perceptions and behaviors, thus impacting the study's outcome
  • Experimental manipulation : Manipulating an independent variable in a study (for instance, giving smokers a cessation program) instead of just observing an association without conducting any intervention (examining the relationship between exercise and smoking behavior)
  • Random selection : Choosing participants at random or in a manner in which they are representative of the population that you wish to study
  • Randomization or random assignment : Randomly assigning participants to treatment and control groups, ensuring that there is no systematic bias between the research groups
  • Strict study protocol : Following specific procedures during the study so as not to introduce any unintended effects; for example, doing things differently with one group of study participants than you do with another group

Internal Validity Threats

Just as there are many ways to ensure internal validity, there is also a list of potential threats that should be considered when planning a study.

  • Attrition : Participants dropping out or leaving a study, which means that the results are based on a biased sample of only the people who did not choose to leave (and possibly who all have something in common, such as higher motivation)
  • Confounding : A situation in which changes in an outcome variable can be thought to have resulted from some type of outside variable not measured or manipulated in the study
  • Diffusion : This refers to the results of one group transferring to another through the groups interacting and talking with or observing one another; this can also lead to another issue called resentful demoralization, in which a control group tries less hard because they feel resentful over the group that they are in
  • Experimenter bias : An experimenter behaving in a different way with different groups in a study, which can impact the results (and is eliminated through blinding)
  • Historical events : May influence the outcome of studies that occur over a period of time, such as a change in the political leader or a natural disaster that occurs, influencing how study participants feel and act
  • Instrumentation : This involves "priming" participants in a study in certain ways with the measures used, causing them to react in a way that is different than they would have otherwise reacted
  • Maturation : The impact of time as a variable in a study; for example, if a study takes place over a period of time in which it is possible that participants naturally change in some way (i.e., they grew older or became tired), it may be impossible to rule out whether effects seen in the study were simply due to the impact of time
  • Statistical regression : The natural effect of participants at extreme ends of a measure falling in a certain direction due to the passage of time rather than being a direct effect of an intervention
  • Testing : Repeatedly testing participants using the same measures influences outcomes; for example, if you give someone the same test three times, it is likely that they will do better as they learn the test or become used to the testing process, causing them to answer differently

What Is External Validity in Research?

External validity refers to how well the outcome of a research study can be expected to apply to other settings. This is important because, if external validity is established, it means that the findings can be generalizable to similar individuals or populations.

External validity affirmatively answers the question: Do the findings apply to similar people, settings, situations, and time periods?

Population validity and ecological validity are two types of external validity. Population validity refers to whether you can generalize the research outcomes to other populations or groups. Ecological validity refers to whether a study's findings can be generalized to additional situations or settings.

Another term called transferability refers to whether results transfer to situations with similar characteristics. Transferability relates to external validity and refers to a qualitative research design.

Factors That Improve External Validity

If you want to improve the external validity of your study, there are many ways to achieve this goal. Factors that can enhance external validity include:

  • Field experiments : Conducting a study outside the laboratory, in a natural setting
  • Inclusion and exclusion criteria : Setting criteria as to who can be involved in the research, ensuring that the population being studied is clearly defined
  • Psychological realism : Making sure participants experience the events of the study as being real by telling them a "cover story," or a different story about the aim of the study so they don't behave differently than they would in real life based on knowing what to expect or knowing the study's goal
  • Replication : Conducting the study again with different samples or in different settings to see if you get the same results; when many studies have been conducted on the same topic, a meta-analysis can also be used to determine if the effect of an independent variable can be replicated, therefore making it more reliable
  • Reprocessing or calibration : Using statistical methods to adjust for external validity issues, such as reweighting groups if a study had uneven groups for a particular characteristic (such as age)

External Validity Threats

External validity is threatened when a study does not take into account the interaction of variables in the real world. Threats to external validity include:

  • Pre- and post-test effects : When the pre- or post-test is in some way related to the effect seen in the study, such that the cause-and-effect relationship disappears without these added tests
  • Sample features : When some feature of the sample used was responsible for the effect (or partially responsible), leading to limited generalizability of the findings
  • Selection bias : Also considered a threat to internal validity, selection bias describes differences between groups in a study that may relate to the independent variable—like motivation or willingness to take part in the study, or specific demographics of individuals being more likely to take part in an online survey
  • Situational factors : Factors such as the time of day of the study, its location, noise, researcher characteristics, and the number of measures used may affect the generalizability of findings

While rigorous research methods can ensure internal validity, external validity may be limited by these methods.

Internal Validity vs. External Validity

Internal validity and external validity are two research concepts that share a few similarities while also having several differences.

Similarities

One of the similarities between internal validity and external validity is that both factors should be considered when designing a study. This is because both have implications in terms of whether the results of a study have meaning.

Both internal validity and external validity are not "either/or" concepts. Therefore, you always need to decide to what degree a study performs in terms of each type of validity.

Each of these concepts is also typically reported in research articles published in scholarly journals . This is so that other researchers can evaluate the study and make decisions about whether the results are useful and valid.

Differences

The essential difference between internal validity and external validity is that internal validity refers to the structure of a study (and its variables) while external validity refers to the universality of the results. But there are further differences between the two as well.

For instance, internal validity focuses on showing a difference that is due to the independent variable alone. Conversely, external validity results can be translated to the world at large.

Internal validity and external validity aren't mutually exclusive. You can have a study with good internal validity but be overall irrelevant to the real world. You could also conduct a field study that is highly relevant to the real world but doesn't have trustworthy results in terms of knowing what variables caused the outcomes.

Examples of Validity

Perhaps the best way to understand internal validity and external validity is with examples.

Internal Validity Example

An example of a study with good internal validity would be if a researcher hypothesizes that using a particular mindfulness app will reduce negative mood. To test this hypothesis, the researcher randomly assigns a sample of participants to one of two groups: those who will use the app over a defined period and those who engage in a control task.

The researcher ensures that there is no systematic bias in how participants are assigned to the groups. They do this by blinding the research assistants so they don't know which groups the subjects are in during the experiment.

A strict study protocol is also used to outline the procedures of the study. Potential confounding variables are measured along with mood , such as the participants' socioeconomic status, gender, age, and other factors. If participants drop out of the study, their characteristics are examined to make sure there is no systematic bias in terms of who stays in.

External Validity Example

An example of a study with good external validity would be if, in the above example, the participants used the mindfulness app at home rather than in the laboratory. This shows that results appear in a real-world setting.

To further ensure external validity, the researcher clearly defines the population of interest and chooses a representative sample . They might also replicate the study's results using different technological devices.

A Word From Verywell

Setting up an experiment so that it has both sound internal validity and external validity involves being mindful from the start about factors that can influence each aspect of your research.

It's best to spend extra time designing a structurally sound study that has far-reaching implications rather than to quickly rush through the design phase only to discover problems later on. Only when both internal validity and external validity are high can strong conclusions be made about your results.

San Jose State University. Internal and external validity .

Michael RS. Threats to internal & external validity: Y520 strategies for educational inquiry .

Pahus L, Burgel PR, Roche N, Paillasseur JL, Chanez P. Randomized controlled trials of pharmacological treatments to prevent COPD exacerbations: applicability to real-life patients . BMC Pulm Med . 2019;19(1):127. doi:10.1186/s12890-019-0882-y

By Arlin Cuncic, MA Arlin Cuncic, MA, is the author of "Therapy in Focus: What to Expect from CBT for Social Anxiety Disorder" and "7 Weeks to Reduce Anxiety." She has a Master's degree in psychology.

  • Search Menu
  • Browse content in Arts and Humanities
  • Browse content in Archaeology
  • Anglo-Saxon and Medieval Archaeology
  • Archaeological Methodology and Techniques
  • Archaeology by Region
  • Archaeology of Religion
  • Archaeology of Trade and Exchange
  • Biblical Archaeology
  • Contemporary and Public Archaeology
  • Environmental Archaeology
  • Historical Archaeology
  • History and Theory of Archaeology
  • Industrial Archaeology
  • Landscape Archaeology
  • Mortuary Archaeology
  • Prehistoric Archaeology
  • Underwater Archaeology
  • Urban Archaeology
  • Zooarchaeology
  • Browse content in Architecture
  • Architectural Structure and Design
  • History of Architecture
  • Residential and Domestic Buildings
  • Theory of Architecture
  • Browse content in Art
  • Art Subjects and Themes
  • History of Art
  • Industrial and Commercial Art
  • Theory of Art
  • Biographical Studies
  • Byzantine Studies
  • Browse content in Classical Studies
  • Classical Literature
  • Classical Reception
  • Classical History
  • Classical Philosophy
  • Classical Mythology
  • Classical Art and Architecture
  • Classical Oratory and Rhetoric
  • Greek and Roman Papyrology
  • Greek and Roman Archaeology
  • Greek and Roman Epigraphy
  • Greek and Roman Law
  • Late Antiquity
  • Religion in the Ancient World
  • Digital Humanities
  • Browse content in History
  • Colonialism and Imperialism
  • Diplomatic History
  • Environmental History
  • Genealogy, Heraldry, Names, and Honours
  • Genocide and Ethnic Cleansing
  • Historical Geography
  • History by Period
  • History of Emotions
  • History of Agriculture
  • History of Education
  • History of Gender and Sexuality
  • Industrial History
  • Intellectual History
  • International History
  • Labour History
  • Legal and Constitutional History
  • Local and Family History
  • Maritime History
  • Military History
  • National Liberation and Post-Colonialism
  • Oral History
  • Political History
  • Public History
  • Regional and National History
  • Revolutions and Rebellions
  • Slavery and Abolition of Slavery
  • Social and Cultural History
  • Theory, Methods, and Historiography
  • Urban History
  • World History
  • Browse content in Language Teaching and Learning
  • Language Learning (Specific Skills)
  • Language Teaching Theory and Methods
  • Browse content in Linguistics
  • Applied Linguistics
  • Cognitive Linguistics
  • Computational Linguistics
  • Forensic Linguistics
  • Grammar, Syntax and Morphology
  • Historical and Diachronic Linguistics
  • History of English
  • Language Evolution
  • Language Reference
  • Language Variation
  • Language Families
  • Language Acquisition
  • Lexicography
  • Linguistic Anthropology
  • Linguistic Theories
  • Linguistic Typology
  • Phonetics and Phonology
  • Psycholinguistics
  • Sociolinguistics
  • Translation and Interpretation
  • Writing Systems
  • Browse content in Literature
  • Bibliography
  • Children's Literature Studies
  • Literary Studies (Romanticism)
  • Literary Studies (American)
  • Literary Studies (Modernism)
  • Literary Studies (Asian)
  • Literary Studies (European)
  • Literary Studies (Eco-criticism)
  • Literary Studies - World
  • Literary Studies (1500 to 1800)
  • Literary Studies (19th Century)
  • Literary Studies (20th Century onwards)
  • Literary Studies (African American Literature)
  • Literary Studies (British and Irish)
  • Literary Studies (Early and Medieval)
  • Literary Studies (Fiction, Novelists, and Prose Writers)
  • Literary Studies (Gender Studies)
  • Literary Studies (Graphic Novels)
  • Literary Studies (History of the Book)
  • Literary Studies (Plays and Playwrights)
  • Literary Studies (Poetry and Poets)
  • Literary Studies (Postcolonial Literature)
  • Literary Studies (Queer Studies)
  • Literary Studies (Science Fiction)
  • Literary Studies (Travel Literature)
  • Literary Studies (War Literature)
  • Literary Studies (Women's Writing)
  • Literary Theory and Cultural Studies
  • Mythology and Folklore
  • Shakespeare Studies and Criticism
  • Browse content in Media Studies
  • Browse content in Music
  • Applied Music
  • Dance and Music
  • Ethics in Music
  • Ethnomusicology
  • Gender and Sexuality in Music
  • Medicine and Music
  • Music Cultures
  • Music and Media
  • Music and Culture
  • Music and Religion
  • Music Education and Pedagogy
  • Music Theory and Analysis
  • Musical Scores, Lyrics, and Libretti
  • Musical Structures, Styles, and Techniques
  • Musicology and Music History
  • Performance Practice and Studies
  • Race and Ethnicity in Music
  • Sound Studies
  • Browse content in Performing Arts
  • Browse content in Philosophy
  • Aesthetics and Philosophy of Art
  • Epistemology
  • Feminist Philosophy
  • History of Western Philosophy
  • Metaphysics
  • Moral Philosophy
  • Non-Western Philosophy
  • Philosophy of Language
  • Philosophy of Mind
  • Philosophy of Perception
  • Philosophy of Action
  • Philosophy of Law
  • Philosophy of Religion
  • Philosophy of Science
  • Philosophy of Mathematics and Logic
  • Practical Ethics
  • Social and Political Philosophy
  • Browse content in Religion
  • Biblical Studies
  • Christianity
  • East Asian Religions
  • History of Religion
  • Judaism and Jewish Studies
  • Qumran Studies
  • Religion and Education
  • Religion and Health
  • Religion and Politics
  • Religion and Science
  • Religion and Law
  • Religion and Art, Literature, and Music
  • Religious Studies
  • Browse content in Society and Culture
  • Cookery, Food, and Drink
  • Cultural Studies
  • Customs and Traditions
  • Ethical Issues and Debates
  • Hobbies, Games, Arts and Crafts
  • Lifestyle, Home, and Garden
  • Natural world, Country Life, and Pets
  • Popular Beliefs and Controversial Knowledge
  • Sports and Outdoor Recreation
  • Technology and Society
  • Travel and Holiday
  • Visual Culture
  • Browse content in Law
  • Arbitration
  • Browse content in Company and Commercial Law
  • Commercial Law
  • Company Law
  • Browse content in Comparative Law
  • Systems of Law
  • Competition Law
  • Browse content in Constitutional and Administrative Law
  • Government Powers
  • Judicial Review
  • Local Government Law
  • Military and Defence Law
  • Parliamentary and Legislative Practice
  • Construction Law
  • Contract Law
  • Browse content in Criminal Law
  • Criminal Procedure
  • Criminal Evidence Law
  • Sentencing and Punishment
  • Employment and Labour Law
  • Environment and Energy Law
  • Browse content in Financial Law
  • Banking Law
  • Insolvency Law
  • History of Law
  • Human Rights and Immigration
  • Intellectual Property Law
  • Browse content in International Law
  • Private International Law and Conflict of Laws
  • Public International Law
  • IT and Communications Law
  • Jurisprudence and Philosophy of Law
  • Law and Society
  • Law and Politics
  • Browse content in Legal System and Practice
  • Courts and Procedure
  • Legal Skills and Practice
  • Primary Sources of Law
  • Regulation of Legal Profession
  • Medical and Healthcare Law
  • Browse content in Policing
  • Criminal Investigation and Detection
  • Police and Security Services
  • Police Procedure and Law
  • Police Regional Planning
  • Browse content in Property Law
  • Personal Property Law
  • Study and Revision
  • Terrorism and National Security Law
  • Browse content in Trusts Law
  • Wills and Probate or Succession
  • Browse content in Medicine and Health
  • Browse content in Allied Health Professions
  • Arts Therapies
  • Clinical Science
  • Dietetics and Nutrition
  • Occupational Therapy
  • Operating Department Practice
  • Physiotherapy
  • Radiography
  • Speech and Language Therapy
  • Browse content in Anaesthetics
  • General Anaesthesia
  • Neuroanaesthesia
  • Clinical Neuroscience
  • Browse content in Clinical Medicine
  • Acute Medicine
  • Cardiovascular Medicine
  • Clinical Genetics
  • Clinical Pharmacology and Therapeutics
  • Dermatology
  • Endocrinology and Diabetes
  • Gastroenterology
  • Genito-urinary Medicine
  • Geriatric Medicine
  • Infectious Diseases
  • Medical Toxicology
  • Medical Oncology
  • Pain Medicine
  • Palliative Medicine
  • Rehabilitation Medicine
  • Respiratory Medicine and Pulmonology
  • Rheumatology
  • Sleep Medicine
  • Sports and Exercise Medicine
  • Community Medical Services
  • Critical Care
  • Emergency Medicine
  • Forensic Medicine
  • Haematology
  • History of Medicine
  • Browse content in Medical Skills
  • Clinical Skills
  • Communication Skills
  • Nursing Skills
  • Surgical Skills
  • Medical Ethics
  • Browse content in Medical Dentistry
  • Oral and Maxillofacial Surgery
  • Paediatric Dentistry
  • Restorative Dentistry and Orthodontics
  • Surgical Dentistry
  • Medical Statistics and Methodology
  • Browse content in Neurology
  • Clinical Neurophysiology
  • Neuropathology
  • Nursing Studies
  • Browse content in Obstetrics and Gynaecology
  • Gynaecology
  • Occupational Medicine
  • Ophthalmology
  • Otolaryngology (ENT)
  • Browse content in Paediatrics
  • Neonatology
  • Browse content in Pathology
  • Chemical Pathology
  • Clinical Cytogenetics and Molecular Genetics
  • Histopathology
  • Medical Microbiology and Virology
  • Patient Education and Information
  • Browse content in Pharmacology
  • Psychopharmacology
  • Browse content in Popular Health
  • Caring for Others
  • Complementary and Alternative Medicine
  • Self-help and Personal Development
  • Browse content in Preclinical Medicine
  • Cell Biology
  • Molecular Biology and Genetics
  • Reproduction, Growth and Development
  • Primary Care
  • Professional Development in Medicine
  • Browse content in Psychiatry
  • Addiction Medicine
  • Child and Adolescent Psychiatry
  • Forensic Psychiatry
  • Learning Disabilities
  • Old Age Psychiatry
  • Psychotherapy
  • Browse content in Public Health and Epidemiology
  • Epidemiology
  • Public Health
  • Browse content in Radiology
  • Clinical Radiology
  • Interventional Radiology
  • Nuclear Medicine
  • Radiation Oncology
  • Reproductive Medicine
  • Browse content in Surgery
  • Cardiothoracic Surgery
  • Gastro-intestinal and Colorectal Surgery
  • General Surgery
  • Neurosurgery
  • Paediatric Surgery
  • Peri-operative Care
  • Plastic and Reconstructive Surgery
  • Surgical Oncology
  • Transplant Surgery
  • Trauma and Orthopaedic Surgery
  • Vascular Surgery
  • Browse content in Science and Mathematics
  • Browse content in Biological Sciences
  • Aquatic Biology
  • Biochemistry
  • Bioinformatics and Computational Biology
  • Developmental Biology
  • Ecology and Conservation
  • Evolutionary Biology
  • Genetics and Genomics
  • Microbiology
  • Molecular and Cell Biology
  • Natural History
  • Plant Sciences and Forestry
  • Research Methods in Life Sciences
  • Structural Biology
  • Systems Biology
  • Zoology and Animal Sciences
  • Browse content in Chemistry
  • Analytical Chemistry
  • Computational Chemistry
  • Crystallography
  • Environmental Chemistry
  • Industrial Chemistry
  • Inorganic Chemistry
  • Materials Chemistry
  • Medicinal Chemistry
  • Mineralogy and Gems
  • Organic Chemistry
  • Physical Chemistry
  • Polymer Chemistry
  • Study and Communication Skills in Chemistry
  • Theoretical Chemistry
  • Browse content in Computer Science
  • Artificial Intelligence
  • Computer Architecture and Logic Design
  • Game Studies
  • Human-Computer Interaction
  • Mathematical Theory of Computation
  • Programming Languages
  • Software Engineering
  • Systems Analysis and Design
  • Virtual Reality
  • Browse content in Computing
  • Business Applications
  • Computer Games
  • Computer Security
  • Computer Networking and Communications
  • Digital Lifestyle
  • Graphical and Digital Media Applications
  • Operating Systems
  • Browse content in Earth Sciences and Geography
  • Atmospheric Sciences
  • Environmental Geography
  • Geology and the Lithosphere
  • Maps and Map-making
  • Meteorology and Climatology
  • Oceanography and Hydrology
  • Palaeontology
  • Physical Geography and Topography
  • Regional Geography
  • Soil Science
  • Urban Geography
  • Browse content in Engineering and Technology
  • Agriculture and Farming
  • Biological Engineering
  • Civil Engineering, Surveying, and Building
  • Electronics and Communications Engineering
  • Energy Technology
  • Engineering (General)
  • Environmental Science, Engineering, and Technology
  • History of Engineering and Technology
  • Mechanical Engineering and Materials
  • Technology of Industrial Chemistry
  • Transport Technology and Trades
  • Browse content in Environmental Science
  • Applied Ecology (Environmental Science)
  • Conservation of the Environment (Environmental Science)
  • Environmental Sustainability
  • Environmentalist Thought and Ideology (Environmental Science)
  • Management of Land and Natural Resources (Environmental Science)
  • Natural Disasters (Environmental Science)
  • Nuclear Issues (Environmental Science)
  • Pollution and Threats to the Environment (Environmental Science)
  • Social Impact of Environmental Issues (Environmental Science)
  • History of Science and Technology
  • Browse content in Materials Science
  • Ceramics and Glasses
  • Composite Materials
  • Metals, Alloying, and Corrosion
  • Nanotechnology
  • Browse content in Mathematics
  • Applied Mathematics
  • Biomathematics and Statistics
  • History of Mathematics
  • Mathematical Education
  • Mathematical Finance
  • Mathematical Analysis
  • Numerical and Computational Mathematics
  • Probability and Statistics
  • Pure Mathematics
  • Browse content in Neuroscience
  • Cognition and Behavioural Neuroscience
  • Development of the Nervous System
  • Disorders of the Nervous System
  • History of Neuroscience
  • Invertebrate Neurobiology
  • Molecular and Cellular Systems
  • Neuroendocrinology and Autonomic Nervous System
  • Neuroscientific Techniques
  • Sensory and Motor Systems
  • Browse content in Physics
  • Astronomy and Astrophysics
  • Atomic, Molecular, and Optical Physics
  • Biological and Medical Physics
  • Classical Mechanics
  • Computational Physics
  • Condensed Matter Physics
  • Electromagnetism, Optics, and Acoustics
  • History of Physics
  • Mathematical and Statistical Physics
  • Measurement Science
  • Nuclear Physics
  • Particles and Fields
  • Plasma Physics
  • Quantum Physics
  • Relativity and Gravitation
  • Semiconductor and Mesoscopic Physics
  • Browse content in Psychology
  • Affective Sciences
  • Clinical Psychology
  • Cognitive Psychology
  • Cognitive Neuroscience
  • Criminal and Forensic Psychology
  • Developmental Psychology
  • Educational Psychology
  • Evolutionary Psychology
  • Health Psychology
  • History and Systems in Psychology
  • Music Psychology
  • Neuropsychology
  • Organizational Psychology
  • Psychological Assessment and Testing
  • Psychology of Human-Technology Interaction
  • Psychology Professional Development and Training
  • Research Methods in Psychology
  • Social Psychology
  • Browse content in Social Sciences
  • Browse content in Anthropology
  • Anthropology of Religion
  • Human Evolution
  • Medical Anthropology
  • Physical Anthropology
  • Regional Anthropology
  • Social and Cultural Anthropology
  • Theory and Practice of Anthropology
  • Browse content in Business and Management
  • Business Ethics
  • Business History
  • Business Strategy
  • Business and Technology
  • Business and Government
  • Business and the Environment
  • Comparative Management
  • Corporate Governance
  • Corporate Social Responsibility
  • Entrepreneurship
  • Health Management
  • Human Resource Management
  • Industrial and Employment Relations
  • Industry Studies
  • Information and Communication Technologies
  • International Business
  • Knowledge Management
  • Management and Management Techniques
  • Operations Management
  • Organizational Theory and Behaviour
  • Pensions and Pension Management
  • Public and Nonprofit Management
  • Strategic Management
  • Supply Chain Management
  • Browse content in Criminology and Criminal Justice
  • Criminal Justice
  • Criminology
  • Forms of Crime
  • International and Comparative Criminology
  • Youth Violence and Juvenile Justice
  • Development Studies
  • Browse content in Economics
  • Agricultural, Environmental, and Natural Resource Economics
  • Asian Economics
  • Behavioural Finance
  • Behavioural Economics and Neuroeconomics
  • Econometrics and Mathematical Economics
  • Economic History
  • Economic Methodology
  • Economic Systems
  • Economic Development and Growth
  • Financial Markets
  • Financial Institutions and Services
  • General Economics and Teaching
  • Health, Education, and Welfare
  • History of Economic Thought
  • International Economics
  • Labour and Demographic Economics
  • Law and Economics
  • Macroeconomics and Monetary Economics
  • Microeconomics
  • Public Economics
  • Urban, Rural, and Regional Economics
  • Welfare Economics
  • Browse content in Education
  • Adult Education and Continuous Learning
  • Care and Counselling of Students
  • Early Childhood and Elementary Education
  • Educational Equipment and Technology
  • Educational Strategies and Policy
  • Higher and Further Education
  • Organization and Management of Education
  • Philosophy and Theory of Education
  • Schools Studies
  • Secondary Education
  • Teaching of a Specific Subject
  • Teaching of Specific Groups and Special Educational Needs
  • Teaching Skills and Techniques
  • Browse content in Environment
  • Applied Ecology (Social Science)
  • Climate Change
  • Conservation of the Environment (Social Science)
  • Environmentalist Thought and Ideology (Social Science)
  • Natural Disasters (Environment)
  • Social Impact of Environmental Issues (Social Science)
  • Browse content in Human Geography
  • Cultural Geography
  • Economic Geography
  • Political Geography
  • Browse content in Interdisciplinary Studies
  • Communication Studies
  • Museums, Libraries, and Information Sciences
  • Browse content in Politics
  • African Politics
  • Asian Politics
  • Chinese Politics
  • Comparative Politics
  • Conflict Politics
  • Elections and Electoral Studies
  • Environmental Politics
  • European Union
  • Foreign Policy
  • Gender and Politics
  • Human Rights and Politics
  • Indian Politics
  • International Relations
  • International Organization (Politics)
  • International Political Economy
  • Irish Politics
  • Latin American Politics
  • Middle Eastern Politics
  • Political Behaviour
  • Political Economy
  • Political Institutions
  • Political Theory
  • Political Methodology
  • Political Communication
  • Political Philosophy
  • Political Sociology
  • Politics and Law
  • Public Policy
  • Public Administration
  • Quantitative Political Methodology
  • Regional Political Studies
  • Russian Politics
  • Security Studies
  • State and Local Government
  • UK Politics
  • US Politics
  • Browse content in Regional and Area Studies
  • African Studies
  • Asian Studies
  • East Asian Studies
  • Japanese Studies
  • Latin American Studies
  • Middle Eastern Studies
  • Native American Studies
  • Scottish Studies
  • Browse content in Research and Information
  • Research Methods
  • Browse content in Social Work
  • Addictions and Substance Misuse
  • Adoption and Fostering
  • Care of the Elderly
  • Child and Adolescent Social Work
  • Couple and Family Social Work
  • Developmental and Physical Disabilities Social Work
  • Direct Practice and Clinical Social Work
  • Emergency Services
  • Human Behaviour and the Social Environment
  • International and Global Issues in Social Work
  • Mental and Behavioural Health
  • Social Justice and Human Rights
  • Social Policy and Advocacy
  • Social Work and Crime and Justice
  • Social Work Macro Practice
  • Social Work Practice Settings
  • Social Work Research and Evidence-based Practice
  • Welfare and Benefit Systems
  • Browse content in Sociology
  • Childhood Studies
  • Community Development
  • Comparative and Historical Sociology
  • Economic Sociology
  • Gender and Sexuality
  • Gerontology and Ageing
  • Health, Illness, and Medicine
  • Marriage and the Family
  • Migration Studies
  • Occupations, Professions, and Work
  • Organizations
  • Population and Demography
  • Race and Ethnicity
  • Social Theory
  • Social Movements and Social Change
  • Social Research and Statistics
  • Social Stratification, Inequality, and Mobility
  • Sociology of Religion
  • Sociology of Education
  • Sport and Leisure
  • Urban and Rural Studies
  • Browse content in Warfare and Defence
  • Defence Strategy, Planning, and Research
  • Land Forces and Warfare
  • Military Administration
  • Military Life and Institutions
  • Naval Forces and Warfare
  • Other Warfare and Defence Issues
  • Peace Studies and Conflict Resolution
  • Weapons and Equipment

Validity and Validation

  • < Previous chapter
  • Next chapter >

3 External Threats to Validity

  • Published: October 2013
  • Cite Icon Cite
  • Permissions Icon Permissions

This chapter discusses strategies researchers use to address threats to the external validity (i.e. generalizability) of research results. It describes threats to external validity in terms of populations, times, and situations. Within these factors, it considers specific threats to external validity, such as: (1) interactions among different treatments or conditions; (2) interactions between treatments and methods of data collection; (3) interactions of treatments with selection; (4) interactions of situation with treatment; and (5) interactions of history with treatment. Each of these factors can limit the generalizability of research results and, therefore, the validity of claims based on those results.

Signed in as

Institutional accounts.

  • GoogleCrawler [DO NOT DELETE]
  • Google Scholar Indexing

Personal account

  • Sign in with email/username & password
  • Get email alerts
  • Save searches
  • Purchase content
  • Activate your purchase/trial code

Institutional access

  • Sign in with a library card Sign in with username/password Recommend to your librarian
  • Institutional account management
  • Get help with access

Access to content on Oxford Academic is often provided through institutional subscriptions and purchases. If you are a member of an institution with an active account, you may be able to access content in one of the following ways:

IP based access

Typically, access is provided across an institutional network to a range of IP addresses. This authentication occurs automatically, and it is not possible to sign out of an IP authenticated account.

Sign in through your institution

Choose this option to get remote access when outside your institution. Shibboleth/Open Athens technology is used to provide single sign-on between your institution’s website and Oxford Academic.

  • Click Sign in through your institution.
  • Select your institution from the list provided, which will take you to your institution's website to sign in.
  • When on the institution site, please use the credentials provided by your institution. Do not use an Oxford Academic personal account.
  • Following successful sign in, you will be returned to Oxford Academic.

If your institution is not listed or you cannot sign in to your institution’s website, please contact your librarian or administrator.

Sign in with a library card

Enter your library card number to sign in. If you cannot sign in, please contact your librarian.

Society Members

Society member access to a journal is achieved in one of the following ways:

Sign in through society site

Many societies offer single sign-on between the society website and Oxford Academic. If you see ‘Sign in through society site’ in the sign in pane within a journal:

  • Click Sign in through society site.
  • When on the society site, please use the credentials provided by that society. Do not use an Oxford Academic personal account.

If you do not have a society account or have forgotten your username or password, please contact your society.

Sign in using a personal account

Some societies use Oxford Academic personal accounts to provide access to their members. See below.

A personal account can be used to get email alerts, save searches, purchase content, and activate subscriptions.

Some societies use Oxford Academic personal accounts to provide access to their members.

Viewing your signed in accounts

Click the account icon in the top right to:

  • View your signed in personal account and access account management features.
  • View the institutional accounts that are providing access.

Signed in but can't access content

Oxford Academic is home to a wide variety of products. The institutional subscription may not cover the content that you are trying to access. If you believe you should have access to that content, please contact your librarian.

For librarians and administrators, your personal account also provides access to institutional account management. Here you will find options to view and activate subscriptions, manage institutional settings and access options, access usage statistics, and more.

Our books are available by subscription or purchase to libraries and institutions.

  • About Oxford Academic
  • Publish journals with us
  • University press partners
  • What we publish
  • New features  
  • Open access
  • Rights and permissions
  • Accessibility
  • Advertising
  • Media enquiries
  • Oxford University Press
  • Oxford Languages
  • University of Oxford

Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide

  • Copyright © 2024 Oxford University Press
  • Cookie settings
  • Cookie policy
  • Privacy policy
  • Legal notice

This Feature Is Available To Subscribers Only

Sign In or Create an Account

This PDF is available to Subscribers Only

For full access to this pdf, sign in to an existing account, or purchase an annual subscription.

Scientific Research and Methodology : An introduction to quantitative research and statistics

5 external validity: sampling.

You have learnt to ask a RQ, and identify a study design. In this chapter , you will learn to:

  • distinguish, and explain, precision and accuracy.
  • distinguish between random and non-random sampling.
  • explain why random samples are preferred over non-random samples.
  • identify, describe and use different sampling methods.
  • identify ways to obtain samples more likely to be representative.

external validity in research methodology

5.1 The idea of sampling

A RQ implies that every member of the population should be studied (the P in P OCI means 'population'). However, doing so is very rare due to cost, time, ethics, logistics and/or practicality. Usually a subset of the population (a sample ) is studied, comprising some individuals from the population. Many different samples are possible. A study is externally valid if the results from the sample can be generalised to the population, which is only possible if the sample faithfully represents the population.

The challenge of research is learning about a population from studying just one of the countless possible samples.

external validity in research methodology

Example 5.1 (Samples) A study of the effectiveness of aspirin in treating headaches cannot possibly study every single human who may one day take aspirin. Not only would this be prohibitively expensive, time-consuming, and impractical, but the study would not even study humans yet to be born who might use aspirin.

Using the whole target population is impossible , and a sample must be used.

Studying one of many possible samples raises other questions:

  • Which individuals should be included in the sample?
  • How many individuals should be included in the sample?

The first issue is studied in this chapter. The second issue is studied later (Chap.  30 ), after learning about the implications of studying samples rather than populations.

Many samples are possible, and every sample is likely to be different . Hence, the results of studying a sample depend on which individuals are in the studied sample. These differences are called sampling variation . That is, each sample has different individuals, produces different data, and may lead to different answers to the RQ.

Example 5.2 (Number of samples possible) In a 'population' as small as \(100\) , the number of possible samples of size \(25\) is more than twice the number of people on earth.

This is the challenge of research: How to make decisions about populations, using just one of the many possible samples . Perhaps surprisingly, lots can be learnt about the population if we approach the task of selecting a sample correctly.

Almost always, samples are studied, not populations . Many samples are possible, and every sample is likely to be different , and hence the results from every sample are likely to be different . This is called sampling variation .

While we can never be certain about the conclusions from the sample, special tools allow us to make decisions about the population from a sample .

The animation below shows how the estimates calculated from a sample vary from sample to sample. We know that \(50\) % of cards in a fair, shuffled pack are red, but each hand of ten cards can produce a different percentage of red cards (and not always \(50\) %). This is an example of sampling variation .

5.2 Precision and accuracy

Two questions concerning sampling, raised in Sect.  5.1 , were: which individuals should be in the sample, and how many individuals should be in the sample. The first question addresses the accuracy of a sample value to estimate a population value. The second addresses the precision with which the population value is estimated using a sample. An estimate that is not accurate is called biased (Sect.  5.10 ; Def.  6.2 ).

Definition 5.1 (Accuracy) Accuracy refers to how close a sample estimate is likely to be to the population value, on average.

Definition 5.2 (Precision) Precision refers to how similar the sample estimates from different samples are likely to be to each other (that is, how much variation is likely in the sample estimates).

Using this language:

  • The sampling method (i.e., how the sample is selected) impacts the accuracy of the sample estimate (i.e., the external validity of the study).
  • The size of the sample impacts the precision of the sample estimate.

Large samples are more likely to produce precise estimates, but they may or may not be accurate estimates. Similarly, random samples are likely to produce accurate estimates, but they may or may not be precise . As an analogy, consider an archer aiming at a target. The shots can be accurate, or precise... or ideally both (Fig.  5.1 ).

Precision and accuracy: Each dot indicates where a shot lands, and is like a sample estimate of the population value (shown by the black central dot)

FIGURE 5.1: Precision and accuracy: Each dot indicates where a shot lands, and is like a sample estimate of the population value (shown by the black central dot)

external validity in research methodology

Example 5.3 (Precision and accuracy) To estimate the average age of all Canadians , \(9000\) Canadian school children could be sampled.

The answer may be precise (as the sample is large), but will be inaccurate because the sample is not representative of all Canadians. The sample gives a precise answer to a different question: 'What is the average age of Canadian school children?'

5.3 Types of sampling

One key to obtaining accurate estimates about the population from the sample (maximising externally validity) is to ensure that the sample faithfully represents the population. So, how is a representative sample selected from of the population?

The individuals selected for the sample can be chosen using either random sampling or non-random sampling . The word random here has a specific meaning that is different than how it is often used in everyday use.

Definition 5.3 (Random) In research and statistics, random means determined completely by impersonal chance.

5.3.1 Random sampling methods

In a random sample , each individual in the population can be selected, and is chosen on the basis of impersonal chance (such as using a random number generator, or a table of random numbers). Some examples of random sampling methods appear in Sects.  5.4 to  5.8 , and summarised in Table  5.1 .

The results obtained from a random sample are likely to generalise to the population from which the sample is drawn; that is, random samples are likely to produce externally valid and accurate studies.

Testing a pot of soup is similar. If the soup is stirred (randomised), the whole pot need not be tasted. An overall impression of the population (or the soup) is not obtained. from a non-random sample (sampling from non-stirred soup).

5.3.2 Non-random sampling methods

A non-random sample is selected using some personal input from the researchers. Examples of non-random samples include:

  • Judgement sample : Individuals are selected based on the researchers' judgement (possibly unconsciously), perhaps because the individuals may appear agreeable, supportive, accessible, or helpful. For example, researchers may select rats that are less aggressive, or plants that are accessible.
  • Convenience sample : Individuals are selected because they are convenient for the researcher. For example, researchers may study beaches that are nearby, or use their friends for a study.
  • Voluntary response (self-selecting) sample: Individuals participate if they wish to. For example, researchers may ask people to volunteer to take a survey.

In non-random sampling, the individuals in the study may be different than those not in the study. That is, non-random samples are not likely to be externally valid .

Using a non-random sample means that the results probably do not generalise to the intended population: they probably do not produce externally valid or accurate studies.

external validity in research methodology

5.4 Simple random sampling

external validity in research methodology

The most straightforward idea for a random sample is a simple random sample.

Definition 5.4 In a simple random sample , every possible sample of the same size has the same chance of being selected.

Selecting a simple random sample requires a list of all members of the population, called the sampling frame , from which to select a sample. Often, obtaining the sampling frame is difficult or impossible, and so finding a simple random sample is also difficult. For example, finding a simple random sample of wombats would require having a list and location of all wombats, so some could be selected. This is absurd; other random sampling methods, like special ecological sampling methods , would be used instead ( Manly and Alberto 2014 ) .

Definition 5.5 (Sampling frame) The sampling frame is a list of all the individuals in the population.

Selecting a simple random sample from the sampling frame can be performed using random numbers (e.g., using random number tables, or websites like https://www.random.org ). A smaller version of this webpage, which generates one number at a time, is below; just press Generate . The numbers generated by this widget come from the true random number generator at RANDOM.ORG . (The webpage generates as many numbers as you want all at the same time.) Other random sampling methods use a system to select at random, rather than by human choice, and some avoid the need for a sampling frame.

This book assumes simple random samples, unless otherwise noted.

external validity in research methodology

Example 5.4 (Simple random sampling) Suppose we are interested in this RQ:

For students at a large course at a university, is the average typing speed (in words per minute) the same for those aged under \(25\) ('younger') and \(25\) or over ('older')?

Suppose budget and time constraints mean only \(40\) students (out of \(441\) ) can be selected for the study above. The sampling frame is the list of all students enrolled in the course. Obtaining the sampling frame is feasible here; instructors have access to this information for grading.

One way to select a simple random sample using the course enrolment list is to place all \(441\) student names into a spreadsheet (ordered by name, student ID, or any way). Then, use random numbers to select \(40\) at random (without repeating numbers) between \(1\) and \(441\) inclusive. For instance, when I used random.org , the first few random numbers were: 410 , 215 , 384 , 158 , 296 , ...

Every student chosen using this method becomes part of the study. If a student could not be contacted, more students could be chosen at random to ensure \(40\) students participated (see animation below). The sample comprises \(25\)  older students and \(15\)  younger students.

5.5 Systematic sampling

external validity in research methodology

In systematic sampling , the first case is randomly selected; then, more individuals are selected at regular intervals thereafter. In general, we say that every \(n\) th individual is selected after the initial random selection.

Example 5.5 (Systematic sampling) For the study in Example  5.4 , a sample of \(40\) students in a course of \(441\) is needed. To find a systematic random sample, select a random number between \(1\) and \(441/40\) (approximately \(11\) ) as a starting point; suppose the random number selected is \(9\) .

The first student selected is the \(9\) th person in the student list (which may be ordered alphabetically, by student ID, or other means). Thereafter, every \(441/40\) th person, or \(11\) th person, in the list is selected: people labelled as \(9\) , \(20\) , \(31\) , \(42\) ,... (see animation below). The sample comprises \(23\)  older students and \(17\)  younger students.

Care needs to be taken when using systematic samples to ensure a pattern is not hidden. Consider taking a systematic sample of every \(10\) th residence on a long street. In many countries, odd numbers are usually on one side of the street, and even numbers usually on the other side. Selecting every \(10\) th house (for example) would include houses all on the same side of the street, and hence with similar exposure to the sun, traffic, etc.

5.6 Stratified sampling

external validity in research methodology

In stratified sampling , the population is split into a small number of large (usually homogeneous) groups called strata , then cases are selected using a simple random sample from each stratum. Every individual in the population must be in one, and only one, stratum.

Example 5.6 (Stratified sampling) For the typing study in Example  5.4 , \(20\) younger and \(20\) older students could be selected to obtain a sample of size \(40\) . The sample is stratified by age group of the person (see animation below).

Assume that about \(67\) % of the students are younger in the population. To ensure that two-thirds of the sample of size \(40\) comprised younger students, \(27\) younger students would be selected in the sample (see animation below).

Similarly, the second animation below shows how a stratified random sample of size \(40\) might be selected, by randomly selecting \(27\) younger and \(13\) older students.

5.7 Cluster sampling

external validity in research methodology

In cluster sampling , the population is split into a large number of small groups called clusters . Then, a simple random sample of clusters is selected, and every member of the chosen clusters become part of the sample. Every individual in the population must be in one, and only one, cluster.

Example 5.7 (Cluster sampling) For the study in Example  5.4 , a simple random sample of (say) three of the many small-group classes for the course could be selected, and every student enrolled in those selected small groups constitute the sample (see animation below). Due to the classes chosen, the sample size is \(n = 43\) ( \(30\)  older; \(13\)  younger).

5.8 Multistage sampling

external validity in research methodology

In multistage sampling , larger collections of individuals are selected using a simple random sample , then smaller collections of individuals within those large collections are selected using a simple random sample . The simple random sampling continues for as many levels as necessary, until individuals are being selected (at random).

Example 5.8 (Multistage sampling) For the study in Example  5.4 , a simple random sample of ten of the many small-group classes could be selected (Stage 1), and then four students are randomly selected from each of these \(10\) selected tutorials (Stage 2) (see animation below). The sample size is \(10\times 4 = 40\) , comprising \(24\)  older students and \(16\)  younger students.

Example 5.9 (Multistage sampling) Multistage sampling is often used by statistics agencies. For example, to obtain a multistage random sample from a country:

  • Stage 1 : Randomly select some cities in the nation;
  • Stage 2 : Randomly select some suburbs in these chosen cities;
  • Stage 3 : Randomly select some streets in these chosen suburbs;
  • Stage 4 : Randomly select some houses in these chosen streets.

This is cheaper than simple random sampling, as data collectors can be deployed in a small number of cities (only those chosen in Stage 1).

5.9 Representative sampling

Obtaining a truly random sample is usually hard or impossible. In practice, sometimes the best compromise is to select a sample diverse enough to be somewhat representative of the diversity in the population: where those in the sample are not likely to be different than those not in the sample (in any obvious way), at least for the variables of interest.

As always, the results from any non-random sample may not generalise to the intended population (but generalise to the population which the sample does represent).

external validity in research methodology

Example 5.10 (Representative sample) Suppose we wish to evaluate the functionality of two types of hand prosthetics.

A randomly-chosen group of Alaskan and Texan residents is asked for their feedback, probably (but not certainly) their views would be similar to those of all Americans. No obvious reason exists for why residents of Alaska and Texas would be very different from residents in the rest of the United States, regarding their view of hand prosthetics functionality.

Even though the sample is not a random sample of all Americans, the results may generalise to all Americans (though we cannot be sure).

external validity in research methodology

Example 5.11 (Non-representative samples) Suppose we wish to determine the average time per day that Americans households use their air-conditioners for cooling in summer.

If a group of Texas residents is asked, this sample would not be expected to represent all Americans: it would over -represent the average number of hours air-conditioners are used for cooling in summer. In this case, those in the sample are very different to those not in the sample, regarding their air-conditioners usage for cooling in summer.

In contrast, suppose a group of Alaskans was asked the same question. This second sample would not represent all Americans either (it would under -represent). Again, those in the sample are likely to be very different to those not in the sample, regarding their air-conditioners usage for cooling in summer.

Sometimes, a combination of sampling methods is used.

Example 5.12 (A combination of sampling methods) In a study of pathogens present on magazines in doctors' surgeries in Melbourne, some suburbs can be selected at random , and then (within each suburb) all surgeries are contacted, and some surgeries volunteer to be part of the study.

In a study of diets of children at child-care centres, researchers used samples in 2010 and 2016, described as follows ( N. Larson, Loth, and Nanney 2019, 336 ) :

In 2010, a stratified random sampling procedure was used to select representative cross-sections of providers working in licensed center-based programs and licensed providers of family home-based care from publicly available lists. [...] Additional participants were also recruited in 2016 using a combination of stratified random and open, convenience-based sampling.

Sometimes, practicalities dictate how the sample is obtained, which may result in a non-random sample. Even so, the impact of using a non-random sample on the conclusions should be discussed (Chap.  9 ). Sometimes, ways exist to obtain a sample that is more likely to be representative.

Random samples are often difficult to obtain, and sometimes representative samples are the best that can be done. In a representative sample, those in the sample are not obviously different than those not in the sample. Try to ensure that a broad cross-section of the target population appears in the sample.

external validity in research methodology

Example 5.13 (Representative sample) For the typing study in Example  5.4 , only selecting students who are attending the gym, or only students who are at a certain Cafe, is unlikely to be somewhat representative of the whole student population.

Instead, the researchers could approach:

  • Students at the cafe on Monday at \(8\) am;
  • Students at the gym on Tuesday at \(11\) : \(30\) am; and
  • Students entering the Library on Thursdays at \(2\) pm.

This is still not a random sample, but the sample now is likely to comprises a variety of student types. Ideally, students would not be included more than once in our sample , though this is often difficult to ensure.

Free Online Poll Maker

The researchers takes a random sample from each of the large groups (cases).

This is a stratified sample .

To determine if the sample is somewhat representative of the population, sometimes information about the sample and population can be compared. The researchers may then be able to make some comment about whether the sample seems reasonably representative. For example, the sex and age of a sample of university students may be recorded; if the proportion of females in the sample, and the average age of students in the sample, are similar to those of the whole university population, then the sample may be somewhat representative of the population (though we cannot be sure).

external validity in research methodology

Example 5.14 (Comparing samples and populations) A study of the adoption of electric vehicles (EVs) by Americans ( Egbue, Long, and Samaranayake 2017 ) used a sample of \(121\) people found through social media (such as Facebook) and professional engineering channels. This is not a random sample.

The authors compared some characteristics of the sample with the American population from the 2010 census. The sample contained a higher percentage of males, a higher percentage of people aged \(18\) -- \(44\) , and a higher percentage of wealthy individuals compared to the US population.

5.10 Bias in sampling

The sample may not be representative of the population for many reasons, all of which compromise how well the sample represents the population (i.e., compromises external validity and accuracy). This is called selection bias .

Definition 5.6 (Selection bias) Selection bias is the tendency of a sample to over- or under-estimate a population quantity.

Selection bias is less common in studies with forward directionality, compared to studies that are non-directional or have backward directionality (Sect.  3.7 ). Selection bias may occur if the wrong sampling frame is used, or non-random sampling is used. The sample is biased because those in the sample may be different than those not in the sample (and this may not always be obvious). Biased samples are less likely to produce externally valid studies.

Example 5.15 (Selection bias) Consider Example  5.11 , about estimating the average time per day that air conditioners are used for cooling in summer. Using people only from Alaska in the sample is using the wrong sampling frame: the sampling frame does not represent the target population ('Americans'). This is selection bias .

Non-response bias occurs when chosen participants do not respond for some reason. The problem is that those who do not respond may be different than those who do respond. Non-response bias can occur because of a poorly-designed survey, using voluntary-response sampling, chosen participants refusing to participate, participants forgetting to return completed surveys, etc.

Example 5.16 (Non-response bias) Consider a study to determine the average number of hours of overtime worked by various professions. People who work a large amount of overtime may be too busy to answer the survey. Those who answer the survey may be likely to work less overtime than those who do not answer the survey. This is an (extreme) example of non-response bias .

Response bias occurs when participants provide incorrect information : the answers provided by the participants may not reflect the truth. This may be intentional (for example, when researchers intend to deceive) or non-intentional (for example, if the question is poorly written, is personal, or is misunderstood).

Consider using these samples:

  • Obtaining data using a telephone survey.
  • Obtaining data using a TV stations call-in at about \(6\) : \(15\) pm.
  • Sampling students at your university, because it is easier than finding a random sample of all people in your country.

For each of the above samples, give an example of an outcome for which the sample would likely give over -estimate of the population value.

There are many correct answers; here are some:

  • The percentage of people that own a telephone.
  • The percentage of people that are shift-workers
  • The percentage of people studying at university.

5.11 Chapter summary

Almost always, the entire population of interest cannot be studied, so a sample (a subset of the population) must be studied. Many samples are possible; we only study one sample. Samples can be random or non-random . Conclusions made from random samples can usually be generalized to the population (that is, they are externally valid and accurate).

Random sampling methods include simple random samples , systematic samples , stratified samples , cluster samples , and multi-stage samples . Random samples are likely to be externally valid and accurate .

Non-random sampling methods include convenience samples , judgement samples , and self-selecting samples . Random samples are often very difficult to obtain, so the best we can do is to aim for reasonably representative samples, where those in the sample are unlikely to be very different than those not in the sample. Non-random samples may not be externally valid or accurate .

The following video may be helpful.

5.12 Quick review questions

  • Suppose students are randomly selected and sent postal surveys from their university, but some students have moved and so never receive the survey. What type of bias will this result in? Selection bias. Response bias. Non-response bias.
  • What is the main advantage of using a random sample? It is easier. It is more likely to produce an experimental study. It is more likely to produce an externally-valid study. It is more likely to produce precise estimates.
  • What is the main advantage of using a large sample? It is easier. It is more likely to produce an experimental study. It is more likely to produce an externally-valid study. It is more likely to produce precise estimates.
  • A large sample is always better than a random sample: True or false? TRUE FALSE
  • Select all the sampling methods that are random sampling methods.
  • Judgement sampling Not random Random
  • Stratified sampling Not random Random
  • Simple random sampling Not random Random
  • Voluntary sampling Not random Random
  • Cluster sampling Not random Random
  • Multi-stage sampling Not random Random
  • Self-selected sampling Not random Random

5.13 Exercises

Selected answers are available in App.  E .

Exercise 5.1 A researcher has three months in which to collected the data for a study on car park usage. Suppose the researcher wants to take a systematic sample of days, and on each of the selected days records the number of cars in the car park.

To select the days in which to collect data, she decides (by using random numbers) to start data collection on a Tuesday, and then every \(7\) th day thereafter.

  • What problem is evident in this sampling scheme?
  • What suggestions would you give to improve the sampling?

Exercise 5.2 Suppose you need to estimate the average number of pages in a book in a university library (with five campuses), using a sample of \(200\) books. Describe how to select a sample of books using:

  • a simple random sample of books.
  • a stratified sample of books.
  • a cluster sample of books.
  • a convenience sample of books.
  • a multi-stage sample of books.

Which sampling scheme would be most practical ?

Exercise 5.3 Suppose you need a sample of residents from apartments in a large residential complex, comprising \(30\) floors with \(15\) apartments on each floor. You plan to survey the residents of these apartments. For each of the possible sampling schemes given below, first describe the sampling scheme, and then determine which methods are likely to give random (or representative) sample (explaining your answers).

The four possible sampling schemes are:

  • Randomly select five floors, then randomly select four apartments from each of those five floors, and interview the oldest resident of that apartment.
  • Randomly select one floor, and select the \(15\) apartments on that floor, then interview the oldest resident of that apartment.
  • Wait at the ground-floor elevator, and ask people who emerge to complete the survey.
  • Randomly select five floors, then wait by the elevator on those floors and survey residents as they arrive at the elevator.

Exercise 5.4 Suppose a researcher needs a sample of customers at a large, local shopping centre to complete a questionnaire. Four sampling schemes are listed below.

For each, describe the type of sampling. Then, determine which would be the best method (explain why), and determine which (if any) produce a random sample.

  • The researcher locates themselves outside the supermarket at the shopping centre one morning, and approaches every \(10\) th person who walks past.
  • The researcher waits at the main entrance for \(30\) minutes at \(8\) am every morning for a week, and approaches every \(5\) th person.
  • The researcher leaves a pile of survey forms at an unattended booth in the shopping centre, and a locked barrel in which to place completed surveys.
  • The researcher goes to the shopping centre every day for two weeks, at a different time and location each day, and approaches someone every \(15\)  mins.

Exercise 5.5 A study ( Ridgewell, Sipe, and Buchanan 2009 ) investigated how children in Brisbane travel to state schools. Researchers randomly sampled four schools from a list of all Brisbane state schools, and invited every family at each of those four schools to complete a survey.

What type of sampling method is this? How could the researchers determine if the resulting sample was approximately representative?

Exercise 5.6 A study comparing two new malaria vaccines recruited \(200\) Kenyans who had contracted malaria. These recruits were obtained by approaching all patients with a confirmed malaria diagnosis who were admitted to hospitals. Patients could volunteer for the study or not. The study was then conducted to a high standard. Which of the following statements are true ?

  • This is a voluntary response sample. TRUE FALSE
  • The study is likely to have high external validity. TRUE FALSE
  • The sample size is too small for the study results to provide useful information. TRUE FALSE

Exercise 5.7 Suppose a natural forest region is classified into two quite different zones. Zone A is mostly dunes and lightly vegetated, and on the coastal side of a ridge; Zone B is more densely vegetated and on the inland side of the ridge.

Exercise 5.8 One (actual) survey in 2001 concluded ( Hieger ( 2001 ) , cited in Bock, Velleman, and De Veaux ( 2010 ) , p. 283):

All but \(2\) % of the home buyers have at least one computer at home, and \(62\) % have two or more. Of those with a computer, \(99\) % are connected to the internet.

The article later reveals the survey was conducted online (recall the survey was conducted in 2001). The target population is home buyers; however, home buyers with internet access were far more likely to complete the survey than home owners without internet access.

What type of bias is this?

Exercise 5.9 Researchers are studying the percentage of farms that use a specific management technique. The researchers randomly select \(20\) regions around the country, then select farms within each region by asking for farmers to volunteer to be in the study.

Explain why this is not a multistage sample, and what changes are necessary for the researchers to have a multistage sample.

Exercise 5.10 Researchers are comparing the average time that experienced and first-year school teachers spend in the sun. The researchers select some schools by asking school principals to volunteer their schools, then record information for every teacher in those schools.

Explain why this is not a cluster sample, and what changes are necessary for the researchers to have a cluster sample.

Internal vs. External Validity In Psychology

Julia Simkus

Editor at Simply Psychology

BA (Hons) Psychology, Princeton University

Julia Simkus is a graduate of Princeton University with a Bachelor of Arts in Psychology. She is currently studying for a Master's Degree in Counseling for Mental Health and Wellness in September 2023. Julia's research has been published in peer reviewed journals.

Learn about our Editorial Process

Saul Mcleod, PhD

Editor-in-Chief for Simply Psychology

BSc (Hons) Psychology, MRes, PhD, University of Manchester

Saul Mcleod, PhD., is a qualified psychology teacher with over 18 years of experience in further and higher education. He has been published in peer-reviewed journals, including the Journal of Clinical Psychology.

Olivia Guy-Evans, MSc

Associate Editor for Simply Psychology

BSc (Hons) Psychology, MSc Psychology of Education

Olivia Guy-Evans is a writer and associate editor for Simply Psychology. She has previously worked in healthcare and educational sectors.

Internal validity centers on demonstrating clear casual relationships within the bounds of a specific study and external validity relates to demonstrating the applicability of findings beyond that original study situation or population.

Researchers have to weigh these considerations in designing methodologically rigorous and generalizable studies.

two people working at a lap, writing notes on paper

Internal Validity 

Internal validity refers to the degree of confidence that the causal relationship being tested exists and is trustworthy.

It tests how likely it is that your treatment caused the differences in results that you observe. Internal validity is largely determined by the study’s experimental design and methods . 

Studies that have a high degree of internal validity provide strong evidence of causality, so it makes it possible to eliminate alternative explanations for a finding.

Studies with low internal validity provide weak evidence of causality. The less chance there is for confounding or extraneous variables , the higher the internal validity and the more confident we can be in our findings. 

In order to assume cause and effect in a research study, the cause must precede the effect in terms of time, the cause and effect must vary together, and there must be no other explanations for the relationship observed. If these three criteria are observed, you can be confident that a study is internally valid. 

An example of a study with high internal validity would be if you wanted to run an experiment to see if using a particular weight-loss pill will help people lose weight.

To test this hypothesis, you would randomly assign a sample of participants to one of two groups: those who will take the weight-loss pill and those who will take a placebo pill.

You can ensure that there is no bias in how participants are assigned to the groups by blinding the research assistants , so they don’t know which participants are in which groups during the experiment. The participants are also blinded, so they do not know whether they are receiving the intervention or not.

If participants drop out of the study, their characteristics are examined to ensure there is no systematic bias regarding who left.

It is important to have a well-thought-out research procedure to mitigate the threats to internal validity.

External Validity

External validity refers to the extent to which the results of a research study can be applied or generalized to another context.

This is important because if external validity is established, the studies’ findings can be generalized to a larger population as opposed to only the relatively few subjects who participated in the study. Unlike internal validity, external validity doesn’t assess causality or rule out confounders.

There are two types of external validity: ecological validity and population validity.

  • Ecological validity refers to whether a study’s findings can be generalized to other situations or settings. A high ecological validity means that there is a high degree of similarity between the experimental setting and another setting, and thus we can be confident that the results will generalize to that other setting.
  • Population validity refers to how well the experimental sample represents other populations or groups. Using random sampling techniques , such as stratified sampling or cluster sampling, significantly helps increase population validity. 

An example of a study with high external validity would be if you hypothesize that practicing mindfulness two times per week will improve the mental health of those diagnosed with depression.

You recruit people who have been diagnosed with depression for at least a year and are between 18–29 years old. Choosing this representative sample with a clearly defined population of interest helps ensure external validity. 

You give participants a pre-test and a post-test measuring how often they experienced symptoms of depression in the past week.

During the study, all participants were given individual mindfulness training and asked to practice mindfulness daily for 15 minutes as part of their morning routine. 

You can also replicate the study’s results using different methods of mindfulness or different samples of participants. 

Trade-off Between Internal and External Validity

There tends to be a negative correlation between internal and external validity in experimental research. This means that experiments that have high internal validity will likely have low external validity and vice versa. 

This happens because experimental conditions that produce higher degrees of internal validity (e.g., artificial labs) tend to be highly unlikely to match real-world conditions. So, the external validity is weaker because a lab environment is much different than the real world. 

On the other hand, to produce higher degrees of external validity, you want experimental conditions that match a real-world setting (e.g., observational studies ).

However, this comes at the expense of internal validity because these types of studies increase the likelihood of confounding variables and alternative explanations for differences in outcomes. 

A solution to this trade-off is replication! You want to conduct the research in multiple environments and settings – first in a controlled, artificial environment to establish the existence of a causal relationship and then in a “real-world” setting to analyze if the results are generalizable. 

Threats to Internal Validity

Attrition refers to the loss of study participants over time. Participants might drop out or leave the study which means that the results are based solely on a biased sample of only the people who did not choose to leave.

Differential rates of attrition between treatment and control groups can skew results by affecting the relationship between your independent and dependent variables and thus affect the internal validity of a study. 

Confounders

A confounding variable is an unmeasured third variable that influences, or “confounds,” the relationship between an independent and a dependent variable by suggesting the presence of a spurious correlation.

Confounders are threats to internal validity because you can’t tell whether the predicted independent variable causes the outcome or if the confounding variable causes it.

Participant Selection Bias

This is a bias that may result from the selection or assignment of study groups in such a way that proper randomization is not achieved.

If participants are not randomly assigned to groups, the sample obtained might not be representative of the population intended to be studied. For example, some members of a population might be less likely to be included than others due to motivation, willingness to take part in the study, or demographics. 

Experimenter Bias

Experimenter bias occurs when an experimenter behaves in a different way with different groups in a study, impacting the results and threatening internal validity. This can be eliminated through blinding. 

Social Interaction (Diffusion)

Diffusion refers to when the treatment in research spreads within or between treatment and control groups. This can happen when there is interaction or observation among the groups.

Diffusion poses a threat to internal validity because it can lead to resentful demoralization. This is when the control group is less motivated because they feel resentful over the group that they are in. 

Historical Events

Historical events might influence the outcome of studies that occur over longer periods of time. For example, changes in political leadership, natural disasters, or other unanticipated events might change the conditions of the study and influence the outcomes.

Instrumentation

Instrumentation refers to any change in the dependent variable in a study that arises from changes in the measuring instrument used. This happens when different measures are used in the pre-test and post-test phases. 

Maturation refers to the impact of time on a study. If the outcomes of the study vary as a natural result of time, it might not be possible to determine whether the effects seen in the study were due to the study treatment or simply due to the impact of time. 

Statistical Regression

Regression to the mean refers to the fact that if one sample of a random variable is extreme, the next sampling of the same random variable is likely going to be closer to its mean.

This is a threat to internal validity as participants at extreme ends of treatment can naturally fall in a certain direction due to the passage of time rather than being a direct effect of an intervention. 

Repeated Testing

Testing your research participants repeatedly with the same measures will influence your research findings because participants will become more accustomed to the testing. Due to familiarity, or awareness of the study’s purpose, many participants might achieve better results over time.

Threats to External Validity 

Sample features.

If some feature(s) of the sample used were responsible for the effect, this could lead to limited generalizability of the findings.

This is a bias that may result from the selection or assignment of study groups in such a way that proper randomization is not achieved. If participants are not randomly assigned to groups, the sample obtained might not be representative of the population intended to be studied.

For example, some members of a population might be less likely to be included than others due to motivation, willingness to take part in the study, or demographics. 

Situational Factors

Factors such as the setting, time of day, location, researchers’ characteristics, noise, or the number of measures might affect the generalizability of the findings.

Aptitude-Treatment Interaction → Aptitude-Treatment Interaction to the concept that some treatments are more or less effective for particular individuals depending upon their specific abilities or characteristics. 

Hawthorne Effect

The Hawthorne Effect refers to the tendency for participants to change their behaviors simply because they know they are being studied.

Experimenter Effect

Experimenter bias occurs when an experimenter behaves in a different way with different groups in a study, impacting the results and threatening the external validity.

John Henry Effect

The John Henry Effect refers to the tendency for participants in a control group to actively work harder because they know they are in an experiment and want to overcome the “disadvantage” of being in the control group.

Factors that Improve Internal Validity

Blinding refers to a practice where the participants (and sometimes the researchers) are unaware of what intervention they are receiving.

This reduces the influence of extraneous factors and minimizes bias, as any differences in outcome can thus be linked to the intervention and not to the participant’s knowledge of whether they were receiving a new treatment or not. 

Random Sampling

Using random sampling to obtain a sample that represents the population that you wish to study will improve internal validity. 

Random Assignment

Using random assignment to assign participants to control and treatment groups ensures that there is no systematic bias among the research groups. 

Strict Study Protocol

Highly controlled experiments tend to improve internal validity. Experiments that occur in lab settings tend to have higher validity as this reduces variability from sources other than the treatment. 

Experimental Manipulation

Manipulating an independent variable in a study as opposed to just observing an association without conducting an intervention improves internal validity. 

Factors that Improve External Validity

Replication.

Conducting a study more than once with a different sample or in a different setting to see if the results will replicate can help improve external validity.

If multiple studies have been conducted on the same topic, a meta-analysis can be used to determine if the effect of an independent variable can be replicated, thus making it more reliable.

Replication is the strongest method to counter threats to external validity by enhancing generalizability to other settings, populations, and conditions.

Field Experiments

Conducting a study outside the laboratory, in a natural, real-world setting will improve external validity (however, this will threaten the internal validity) 

Probability Sampling

Using probability sampling will counter selection bias by making sure everyone in a population has an equal chance of being selected for a study sample.

Recalibration

Recalibration is the use of statistical methods to maintain accuracy, standardization, and repeatability in measurements to assure reliable results.

Reweighting groups, if a study had uneven groups for a particular characteristic (such as age), is an example of calibration. 

Inclusion and Exclusion Criteria

Setting criteria as to who can be involved in the research and who cannot be involved will ensure that the population being studied is clearly defined and that the sample is representative of the population.

Psychological Realism

Psychological realism refers to the process of making sure participants perceive the experimental manipulations as real events so as to not reveal the purpose of the study and so participants don’t behave differently than they would in real life based on knowing the study’s goal.

Print Friendly, PDF & Email

  • Foundations
  • Write Paper

Search form

  • Experiments
  • Anthropology
  • Self-Esteem
  • Social Anxiety

external validity in research methodology

External Validity

External validity is one the most difficult of the validity types to achieve, and is at the foundation of every good experimental design.

This article is a part of the guide:

  • Validity and Reliability
  • Types of Validity
  • Definition of Reliability
  • Content Validity
  • Construct Validity

Browse Full Outline

  • 1 Validity and Reliability
  • 2 Types of Validity
  • 3.1 Population Validity
  • 3.2 Ecological Validity
  • 4 Internal Validity
  • 5.1.1 Concurrent Validity
  • 5.1.2 Predictive Validity
  • 6 Content Validity
  • 7.1 Convergent and Discriminant Validity
  • 8 Face Validity
  • 9 Definition of Reliability
  • 10.1 Reproducibility
  • 10.2 Replication Study
  • 11 Interrater Reliability
  • 12 Internal Consistency Reliability
  • 13 Instrument Reliability

Many scientific disciplines, especially the social sciences, face a long battle to prove that their findings represent the wider population in real world situations.

The main criteria of external validity is the process of generalization , and whether results obtained from a small sample group, often in laboratory surroundings, can be extended to make predictions about the entire population.

The reality is that if a research program has poor external validity, the results will not be taken seriously, so any  research design  must justify  sampling  and selection methods.

external validity in research methodology

What is External Validity?

In 1966, Campbell and Stanley proposed the commonly accepted definition of external validity.

“External validity asks the question of generalizability: To what populations, settings, treatment variables and measurement variables can this effect be generalized?”

External validity is usually split into two distinct types, population validity and ecological validity , and they are both essential elements in judging the strength of an experimental design.

Generalization in Research

Psychology and External Validity

The battle lines are drawn.

External validity often causes a little friction between clinical psychologists and research psychologists.

Clinical psychologists often believe that research psychologists spend all of their time in laboratories, testing mice and humans in conditions that bear little resemblance to the outside world. They claim that the data produced has no external validity, and does not take into account the sheer complexity and individuality of the human mind.

Before we are flamed by irate research psychologists, the truth lies somewhere between the two extremes! Research psychologists find out trends and generate sweeping generalizations that predict the behavior of groups. Clinical psychologists end up picking up the pieces, and study the individuals who lie outside the predictions, hence the animosity.

In most cases, research psychology has a very high population validity , because researchers take meticulously randomly select groups and use large sample sizes , allowing meaningful statistical analysis.

However, the artificial nature of research psychology means that ecological validity is usually low.

Clinical psychologists, on the other hand, often use focused case studies , which cause minimum disruption to the subject and have strong ecological validity. However, the small sample sizes mean that the population validity is often low.

Ideally, using both approaches provides useful generalizations , over time!

Randomization in External Validity and Internal Validity

It is also important to distinguish between external and internal validity , especially with the process of randomization, which is easily misinterpreted. Random selection is an important tenet of external validity.

For example, a research design , which involves sending out survey questionnaires to students picked at random, displays more external validity than one where the questionnaires are given to friends. This is randomization to improve external validity.

Once you have a representative sample, high internal validity involves randomly assigning subjects to groups, rather than using pre-determined selection factors.

With the student example, randomly assigning the students into test groups, rather than picking pre-determined groups based upon degree type, gender, or age strengthens the internal validity.

Campbell, D.T., Stanley, J.C. (1966). Experimental and Quasi-Experimental Designs for Research. Skokie, Il: Rand McNally.

  • Psychology 101
  • Flags and Countries
  • Capitals and Countries

Martyn Shuttleworth (Aug 7, 2009). External Validity. Retrieved Apr 03, 2024 from Explorable.com: https://explorable.com/external-validity

You Are Allowed To Copy The Text

The text in this article is licensed under the Creative Commons-License Attribution 4.0 International (CC BY 4.0) .

This means you're free to copy, share and adapt any parts (or all) of the text in the article, as long as you give appropriate credit and provide a link/reference to this page.

That is it. You don't need our permission to copy the article; just include a link/reference back to this page. You can use it freely (with some kind of link), and we're also okay with people reprinting in publications like books, blogs, newsletters, course-material, papers, wikipedia and presentations (with clear attribution).

Want to stay up to date? Follow us!

Save this course for later.

Don't have time for it all now? No problem, save it as a course and come back to it later.

Footer bottom

  • Privacy Policy

external validity in research methodology

  • Subscribe to our RSS Feed
  • Like us on Facebook
  • Follow us on Twitter

IMAGES

  1. Internal Validity vs. External Validity: Key Differences

    external validity in research methodology

  2. Validity In Research

    external validity in research methodology

  3. PPT

    external validity in research methodology

  4. External Validity in Research- Types & Examples

    external validity in research methodology

  5. PPT

    external validity in research methodology

  6. Reliability vs. Validity in Research

    external validity in research methodology

VIDEO

  1. what is "Internal Validity & External Validity" in research ?

  2. BSN

  3. Populations, Samples, and External Validity

  4. Reliability and Validity in Research Methodology explained in नेपाली- I

  5. Reliability, Internal Validity, & External Validity of Construct Measurement & Operationalization

  6. What is Reliability and Validity-Research Methodology-TheRISD

COMMENTS

  1. External Validity

    Threats to external validity and how to counter them. Threats to external validity are important to recognize and counter in a research design for a robust study. Research example A researcher wants to test the hypothesis that people with clinical diagnoses of mental disorders can benefit from practicing mindfulness daily in just two months ...

  2. External Validity

    Threats to external validity are important to recognise and counter in a research design for a robust study. Example: Research project. A researcher wants to test the hypothesis that people with clinical diagnoses of mental disorders can benefit from practising mindfulness daily in just two months time.

  3. External Validity

    Example 4: Social Science Research: A study investigates the impact of a social intervention program on reducing crime rates in a specific neighborhood. To enhance external validity, researchers select neighborhoods that represent diverse socio-economic conditions and urban and rural settings.

  4. Internal vs. External Validity

    Trade-off between internal and external validity. Better internal validity often comes at the expense of external validity (and vice versa). The type of study you choose reflects the priorities of your research. Trade-off example A causal relationship can be tested in an artificial lab setting or in the real world.

  5. External Validity: The Next Step for Systematic Reviews?

    The reviews typically focus on the internal validity of the research and do not consistently incorporate information on external validity into their conclusions. ... Her expertise in research methodology includes systematic evidence reviews and evaluation technical assistance. Jaime Thomas, a senior researcher at Mathematica Policy Research ...

  6. How to Assess the External Validity and Model Validity of Therapeutic

    External validity assessment tool (EVAT) methodology is encouraged to be useful for each clinical health care study, in assessing its external and model validity, ensuring authors report on these criteria so that systematic reviews can evaluate them for their strengths of external and model validity. The methodology assessment is based on three ...

  7. The 4 Types of Validity in Research

    The 4 Types of Validity in Research | Definitions & Examples. Published on September 6, 2019 by Fiona Middleton.Revised on June 22, 2023. Validity tells you how accurately a method measures something. If a method measures what it claims to measure, and the results closely correspond to real-world values, then it can be considered valid.

  8. External Validity

    Put in more pedestrian terms, external validity is the degree to which the conclusions in your study would hold for other persons in other places and at other times. In science there are two major approaches to how we provide evidence for a generalization. I'll call the first approach the Sampling Model. In the sampling model, you start by ...

  9. External validity

    External validity is the validity of applying the conclusions of a scientific study outside the context of that study. [1] In other words, it is the extent to which the results of a study can generalize or transport to other situations, people, stimuli, and times. [2] [3] Generalizability refers to the applicability of a predefined sample to a ...

  10. Consensus on the definition and assessment of external validity of

    1 BACKGROUND. External validity is considered an important factor for decision making in health research. 1, 2 Although research on external validity has increased in the last decades, 1 there are still many shortcomings and methodological issues in this regard. 3, 4 Research has focused on examining various aspects of the internal validity (rather than external validity) of randomized ...

  11. External Validity: Types, Research Methods & Examples

    Research methods of external validity. There are a lot of methods you can do to improve the external validity of your research. Some things that can improve are given below: Field experiments; Field experiments are like conducting research outside rather than in a controlled environment like a laboratory. Criteria for inclusion and exclusion

  12. The Importance of External Validity

    It has been frequently argued that internal validity is the priority for research.4 However, in an applied discipline, the purpose of which includes working to improve the health of the public, it is also important that external validity be emphasized and strengthened.5 - 7 For example, it is important to know not only that a program is effective, but that it is likely to be effective in ...

  13. Internal, External, and Ecological Validity in Research Design, Conduct

    The concept of validity is also applied to research studies and their findings. Internal validity examines whether the study design, conduct, and analysis answer the research questions without bias. External validity examines whether the study findings can be generalized to other contexts. Ecological validity examines, specifically, whether the ...

  14. Identification of tools used to assess the external validity of

    In health research, however, external validity is often neglected when critically appraising clinical studies [12, 13]. One possible explanation might be the lack of a gold standard for assessing the external validity of clinical trials. ... Generally, it may be argued whether these methods to assess the external validity in systematic reviews ...

  15. Internal and external validity: can you apply research study results to

    The validity of a research study includes two domains: internal and external validity. Internal validity is defined as the extent to which the observed results represent the truth in the population we are studying and, thus, are not due to methodological errors. In our example, if the authors can support that the study has internal validity ...

  16. External Validity

    External Validity. (Generalizability) -to whom can the results of the study be applied-. There are two types of study validity: internal (more applicable with experimental research) and external. This section covers external validity. External validity involves the extent to which the results of a study can be generalized (applied) beyond ...

  17. Internal and external validity: what are they and how do they differ

    Internal validity. Internal validity shows if the study truthfully reports the intervention's effect on the selected group of participants. It is influenced by the methods used to reduce the risk of bias and chance. External validity. External validity looks at a study's clinical usefulness and applicability on a group of patients. In other ...

  18. Internal Validity vs. External Validity in Research

    The essential difference between internal validity and external validity is that internal validity refers to the structure of a study (and its variables) while external validity refers to the universality of the results. But there are further differences between the two as well. For instance, internal validity focuses on showing a difference ...

  19. Validity in Qualitative Evaluation: Linking Purposes, Paradigms, and

    Peer debriefing is a form of external evaluation of the qualitative research process. Lincoln and Guba (1985, p. 308) describe the role of the peer reviewer as the "devil's advocate.". It is a person who asks difficult questions about the procedures, meanings, interpretations, and conclusions of the investigation.

  20. External Threats to Validity

    This chapter discusses strategies researchers use to address threats to the external validity (i.e. generalizability) of research results. It describes threats to external validity in terms of populations, times, and situations. Within these factors, it considers specific threats to external validity, such as: (1) interactions among different ...

  21. 5 External validity: sampling

    5.3.1 Random sampling methods. In a random sample, each individual in the population can be selected, and is chosen on the basis of impersonal chance (such as using a random number generator, or a table of random numbers). Some examples of random sampling methods appear in Sects. 5.4 to 5.8, and summarised in Table 5.1.

  22. Internal vs External Validity In Psychology

    Internal Validity . Internal validity refers to the degree of confidence that the causal relationship being tested exists and is trustworthy.. It tests how likely it is that your treatment caused the differences in results that you observe. Internal validity is largely determined by the study's experimental design and methods.. Studies that have a high degree of internal validity provide ...

  23. External Validity

    The main criteria of external validity is the process of generalization, and whether results obtained from a small sample group, often in laboratory surroundings, can be extended to make predictions about the entire population. The reality is that if a research program has poor external validity, the results will not be taken seriously, so any ...