content validity in research methodology

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, automatically generate references for free.

Knowledge Base
Methodology
What Is Content Validity? | Definition & Examples

What Is Content Validity? | Definition & Examples

Published on 2 September 2022 by Kassiani Nikolopoulou . Revised on 10 October 2022.

Content validity evaluates how well an instrument (like a test) covers all relevant parts of the construct it aims to measure. Here, a construct is a theoretical concept, theme, or idea – in particular, one that cannot usually be measured directly.

Content validity is one of the four types of measurement validity . The other three are:

Face validity : Does the content of the test appear to be suitable for its aims?
Criterion validity : Do the results accurately measure the concrete outcome they are designed to measure?
Construct validity : Does the test measure the concept that it’s intended to measure?

Content validity examples, step-by-step guide: how to measure content validity, frequently asked questions about content validity.

Some constructs are directly observable or tangible, and thus easier to measure. For example, height is measured in inches. Other constructs are more difficult to measure. Depression, for instance, consists of several dimensions and cannot be measured directly.

Additionally, in order to achieve content validity, there has to be a degree of general agreement, for example among experts, about what a particular construct represents.

Research has shown that there are at least three different components that make up intelligence: short-term memory, reasoning, and a verbal component.

Construct vs. content validity example

It can be easy to confuse construct validity and content validity, but they are fundamentally different concepts.

Construct validity evaluates how well a test measures what it is intended to measure. If any parts of the construct are missing, or irrelevant parts are included, construct validity will be compromised. Remember that in order to establish construct validity, you must demonstrate both convergent and divergent (or discriminant) validity .

Convergent validity shows whether a test that is designed to measure a particular construct correlates with other tests that assess the same construct.
Divergent (or discriminant) validity shows you whether two tests that should not be highly related to each other are indeed unrelated. There should be little to no relationship between the scores of two tests measuring two different constructs.

On the other hand, content validity applies to any context where you create a test or questionnaire for a particular construct and want to ensure that the questions actually measure what you intend them to.

High content validity. If your survey questions cover all dimensions of health needs – physical, mental, social, and environmental – your questionnaire will have high content validity.
Low content validity. If most of your survey questions relate to the attitude of the study population towards the health services provided to them instead of to health needs , the results are no longer a valid measure of community health needs.
Low construct validity. If some dimensions of health needs are left out, then the results may not give an accurate indication of the health needs of the community due to poor operationalisation of the concept.

Prevent plagiarism, run a free check.

Measuring content validity correctly is important – a high content validity score shows that the construct was measured accurately. You can measure content validity following the step-by-step guide below:

Step 1: Collect data from experts

Step 2: calculate the content validity ratio, step 3: calculate the content validity index.

Measuring content validity requires input from a judging panel of subject matter experts (SMEs). Here, SMEs are people who are in the best position to evaluate the content of a test.

For example, the expert panel for a school math test would consist of qualified math teachers who teach that subject.

For each individual question, the panel must assess whether the component measured by the question is ‘essential’, ‘useful’, but ‘not essential’, or ‘not necessary’ for measuring the construct.

The higher the agreement among panelists that a particular item is essential, the higher that item’s level of content validity is.

Next, you can use the following formula to calculate the content validity ratio (CVR) for each question:

Content Validity Ratio = (ne − N/2) / (N/2) where:

ne = number of SME panelists indicating ‘essential’
N = total number of SME panelists

The content validity ratio for the first question would be calculated as:

Using the same formula, you calculate the CVR for each question.

Note that this formula yields values which range from +1 to −1. Values above 0 indicate that at least half the SMEs agree that the question is essential. The closer to +1, the higher the content validity.

However, agreement could be due to coincidence. In order to rule that out, you can use the critical values table below. Depending on the number of experts in the panel, the content validity ratio (CVR) for a given question should not fall below a minimum value, also called the critical value.

# of panelists	Critical value
5	0.99
6	0.99
7	0.99
8	0.75
9	0.78
10	0.62
11	0.59
12	0.56
20	0.42
30	0.33
40	0.29

To measure the content validity of the entire test, you need to calculate the content validity index (CVI) . The CVI is the average CVR score of all questions in the test. Remember that values closer to 1 denote higher content validity.

To calculate the content validity index (CVI) of the entire test, you take the average of all the CVR scores of the seven questions.

Here, that would be:

Comparing the CVI with the critical value for a panel of 5 experts (0.99), you notice that the CVI is too low. This means that the test does not accurately measure what you intended it to. You decide to improve the questions with a low CVR, in order to get a higher CVI.

Face validity and content validity are similar in that they both evaluate how suitable the content of a test is. The difference is that face validity is subjective, and assesses content at surface level.

When a test has strong face validity, anyone would agree that the test’s questions appear to measure what they are intended to measure.

For example, looking at a 4th grade math test consisting of problems in which students have to add and multiply, most people would agree that it has strong face validity (i.e., it looks like a math test).

On the other hand, content validity evaluates how well a test represents all the aspects of a topic. Assessing content validity is more systematic and relies on expert evaluation. of each question, analysing whether each one covers the aspects that the test was designed to cover.

A 4th grade math test would have high content validity if it covered all the skills taught in that grade. Experts(in this case, math teachers), would have to evaluate the content validity by comparing the test to the learning objectives.

A glossary is a collection of words pertaining to a specific topic. In your thesis or dissertation, it’s a list of all terms you used that may not immediately be obvious to your reader. In contrast, an index is a list of the contents of your work organised by page number.

Content validity shows you how accurately a test or other measurement method taps into the various aspects of the specific construct you are researching.

In other words, it helps you answer the question: “does the test measure all aspects of the construct I want to measure?” If it does, then the test has high content validity.

The higher the content validity, the more accurate the measurement of the construct.

If the test fails to include parts of the construct, or irrelevant parts are included, the validity of the instrument is threatened, which brings your results into question.

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the ‘Cite this Scribbr article’ button to automatically add the citation to our free Reference Generator.

Nikolopoulou, K. (2022, October 10). What Is Content Validity? | Definition & Examples. Scribbr. Retrieved 27 September 2024, from https://www.scribbr.co.uk/research-methods/content-validity-explained/

Is this article helpful?

Kassiani Nikolopoulou

Other students also liked, construct validity | definition, types, & examples, what is convergent validity | definition & examples, face validity | guide with definition & examples.

Instant insights, infinite possibilities

What is content validity?

Last updated

8 February 2023

Reviewed by

Cathy Heath

When we discuss research accuracy, we use the term “validity.” Validity tells us how accurately a method measures what it has been deployed to measure.

There are four different types of validity: face validity, criterion validity , construct validity, and content validity.

Make research less tedious

Dovetail streamlines research to help you uncover and share actionable insights

Content validity concerns how well a specific research instrument measures what it aims to measure. In this case, “construct” refers to a particular concept that is not directly measurable. Examples include justice, happiness, or beauty. Construct validity can be used to determine how accurately a test, experiment, or similar instrument measures the construct.

When is content validity used?

Content validity is typically used to measure a test’s accuracy. The test in question would be used to measure constructs that are too complex to measure directly.

Some constructs, such as height or weight, are simple to measure quantifiably. But consider a concept like health. Some may consider health from a strictly physical perspective. Others believe good health requires high spiritual, physical, mental, emotional, and social levels.

Whether you define health according to one or several dimensions, each is comprised of multiple aspects that must be measured.

Take physical health as an example. Evaluating a person’s physical health might include assessing their medical history, weight, body composition, activity levels, diet, lifestyle, and sleep routines. A physician or medical researcher might also check for signs of temporary or chronic illness, injury, or substance abuse. Further, some evaluators may only be concerned with specific aspects of health or place more importance on certain aspects.

Content validity helps researchers understand just how precisely the instrument measures that specific construct. It’s crucial that researchers design tests that precisely define the construct they’re trying to measure, using the right attributes and characteristics.

Content validity examples

The Scholastic Aptitude Test (SAT) is a well-known example of content validity. Designed and orchestrated by the College Board, the SAT is commonly used to measure college readiness and indicate how successful a student will be in college.

Multiple studies have shown a statistically significant correlation between a combination of good high school grades and SAT scores and first-year college grades. Accordingly, the SAT has been a standard part of the college admissions process for decades.

However, many critics have argued that the SAT doesn’t provide a sufficient gauge of college readiness. They noted design aspects that have led to performance disparities among specific groups of test takers. While the College Board has made changes, academics have conducted many studies concerning the SATs’ content validity. These have reaffirmed its value.

Another example of content validity involves the commonly used measure of obesity known as body mass index (BMI). This measure involves a relatively simple set of calculations that yield appropriate weight ranges for an individual relative to their height.

The healthcare industry uses BMI widely, but the measure has been criticized for how well it measures obesity. Since obesity is defined as an excess of body fat, BMI does not accurately classify heavily muscular individuals with low body fat whose BMI calculation places them in the obese category. Nor does it accurately gauge metabolic obesity (colloquially known as skinny fat), which can present just as many health risks as those who carry a visible, significant amount of visceral fat.

Academics have published numerous papers over the past few years examining BMI’s face, content, criterion, and construct validity regarding obesity.

How to measure content validity

Measuring content validity takes some time and effort, but it is essential to ensure that the research you conduct or use is accurate.

To measure content validity, you’ll need to collect expert data, find the content validity ratio, and calculate the content validity index.

Collecting data from experts

First, you will need to find and assemble a group of experts in the research area you’re evaluating. You will need these subject matter experts (SMEs) to assess the content of the research instrument involved.

For accounting research, you might pull together a group of practicing accountants and accounting professors. If you’re gauging the content validity of a fitness test for a high school physical education test, you could assemble physical education teachers and experts in teaching and sports science.

These SMEs will evaluate the instrument on a three-point (one to three) scale, ranking each question on the questionnaire, test, or survey instrument as either “not necessary,” “useful, but not necessary,” or “essential.”

A question’s content validity is higher the more SMEs rank it as “essential.”

Finding the content validity ratio

Once you’ve gathered this initial data from the SMEs, you’ll calculate each question’s content validity ratio (CVR).

CVR = (Nₑ - N/2) / (N/2)

In this formula, Ne e refers to the number of SMEs who have indicated a question is essential. N equals the total number of participating SMEs.

When calculating the CVR, you’ll get answers ranging between - 1 (perfect disagreement) and +1 (perfect agreement). The closer a question’s CVR is to +1, the higher its content validity.

Now, SMEs may agree that a question is essential by coincidence. Ruling this out involves a critical values table. The critical values table for content validity measurements is below:


5	0.99
6	0.99
7	0.99
8	0.75
9	0.78
10	0.62
11	0.59
12	0.56
13	0.54
14	0.51
15	0.49
20	0.42
25	0.37
30	0.33
35	0.31
40	0.29

Calculating the content validity index

Once you’ve calculated the CVR for each question, you must find the entire instrument’s content validity. This measure is referred to as the content validity index (CVI). You can find the CVI by taking the average of all your CVRs .

When you have calculated the CVI, you’ll be left with a number between -1 and +1. However, this number on its own doesn’t tell you enough about the precision of your instrument. As with CVR, the closer to +1 your CVI is, the better—but you must also compare your CVI to the appropriate minimum value in your critical values table to determine how precise it is.

Let’s say you have a test with a CVI of 0.27. If you used a panel of six SMEs, you’d find that the minimum value in your critical values table is 0.99. This value is much higher than your CVI, meaning your test isn’t very precise at all. You want your CVI to be higher than the minimum value in your critical values table to attain an appropriate level of precision.

Content validity vs. face validity

Some people confuse face validity with content validity, as the two terms tackle the same aspect of instrument measurement. However, face validity involves a preliminary evaluation of whether an instrument appears to be appropriate for measuring a construct. Content validity, on the other hand, evaluates the instrument’s precision in measuring a construct.

Evaluating face validity involves examining whether an instrument is appropriate for its intended purpose at the surface level. For example, a survey designed to measure postpartum maternal health that exclusively contains questions about fast food consumption would lack face validity. In contrast, face validity would be high if the survey included questions about a woman’s physical and mental health, diet and exercise, work-life balance, and social engagement.

What is an example of a content validity test?

You can often find tests with content validity in everyday life. Common examples include driver’s license exams, standardized tests such as the SAT and ACT, professional licensing exams such as the NCLEX 9 for nurses, and more.

What is content validity in qualitative research?

Content validity helps researchers determine the measurement efficacy of quantitative and qualitative research instruments.

For example, suppose you were conducting research about the perspectives of baby boomers on today’s political discourse. You would measure how comprehensively and effectively your survey captured the possible range of opinions. Subject matter experts would calculate the survey’s content validity index in the same way they would for a quantitative research instrument.

What is the difference between validity and content validity?

Validity (or measurement validity) generally tells you about a measurement method’s accuracy. Content validity helps you understand whether your method, such as a test or survey, fully represents the concept or idea it’s measuring.

What is the difference between content validity and construct validity?

Construct validity examines how effectively an instrument measures what it is designed to measure relative to other instruments. It is composed of convergent validity and divergent validity. Convergent validity illustrates if a correlation exists between the instrument in question and other instruments that measure the same construct. Divergent validity shows that the instrument is not correlated with other instruments designed to measure different phenomena.

By contrast, content validity is an intrinsic assessment of how well an instrument measures what it’s intended to measure.

How do you quantify content validity?

Content validity is measured in quantifiable terms. Calculate the content validity for each instrument question and the content validity index using their formulas.

Should you be using a customer insights hub?

Do you want to discover previous research faster?

Do you share your research findings with others?

Do you analyze research data?

Start for free today, add your research, and get to key insights faster

Editor’s picks

Last updated: 18 April 2023

Last updated: 27 February 2023

Last updated: 22 August 2024

Last updated: 5 February 2023

Last updated: 16 April 2023

Last updated: 9 March 2023

Last updated: 30 April 2024

Last updated: 12 December 2023

Last updated: 11 March 2024

Last updated: 4 July 2024

Last updated: 6 March 2024

Last updated: 5 March 2024

Last updated: 13 May 2024

Latest articles

Related topics, .css-je19u9{-webkit-align-items:flex-end;-webkit-box-align:flex-end;-ms-flex-align:flex-end;align-items:flex-end;display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-flex-direction:row;-ms-flex-direction:row;flex-direction:row;-webkit-box-flex-wrap:wrap;-webkit-flex-wrap:wrap;-ms-flex-wrap:wrap;flex-wrap:wrap;-webkit-box-pack:center;-ms-flex-pack:center;-webkit-justify-content:center;justify-content:center;row-gap:0;text-align:center;max-width:671px;}@media (max-width: 1079px){.css-je19u9{max-width:400px;}.css-je19u9>span{white-space:pre;}}@media (max-width: 799px){.css-je19u9{max-width:400px;}.css-je19u9>span{white-space:pre;}} decide what to .css-1kiodld{max-height:56px;display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}@media (max-width: 1079px){.css-1kiodld{display:none;}} build next, decide what to build next, log in or sign up.

Get started for free

Skip to secondary menu
Skip to main content
Skip to primary sidebar

Statistics By Jim

Making statistics intuitive

Content Validity: Definition, Examples & Measuring

By Jim Frost Leave a Comment

What is Content Validity?

Content validity is the degree to which a test or assessment instrument evaluates all aspects of the topic, construct, or behavior that it is designed to measure. Do the items fully cover the subject? High content validity indicates that the test fully covers the topic for the target audience. Lower results suggest that the test does not contain relevant facets of the subject matter.

For example, imagine that I designed a test that evaluates how well students understand statistics at a level appropriate for an introductory college course. Content validity assesses my test to see if it covers suitable material for that subject area at that level of expertise. In other words, does my test cover all pertinent facets of the content area? Is it missing concepts?

Learn more about other Types of Validity and Reliability vs. Validity .

Content Validity Examples

Evaluating content validity is crucial for the following examples to ensure the tests assess the full range of knowledge and aspects of the psychological constructs:

A test to obtain a license, such as driving or selling real estate.
Standardized testing for academic purposes, such as the SAT and GRE.
Tests that evaluate knowledge of subject area domains, such as biology, physics, and literature.
A scale for assessing anger management.
A questionnaire that evaluates coping abilities.
A scale to assess problematic drinking.

How to Measure Content Validity

Measuring content validity involves assessing individual questions on a test and asking experts whether each one targets characteristics that the instrument is designed to cover. This process compares the test against its goals and the theoretical properties of the construct. Researchers systematically determine whether each item contributes, and that no aspect is overlooked.

Factor Analysis

Advanced content validity assessments use multivariate factor analysis to find the number of underlying dimensions that the test items cover. In this context, analysts can use factor analysis to determine whether the items collectively measure a sufficient number and type of fundamental factors. If the measurement instrument does not sufficiently cover the dimensions, the researchers should improve it. Learn more in my Guide to Factor Analysis with an Example .

Content Validity Ratio

For this overview, let’s look at a more intuitive approach.

Most assessment processes in this realm obtain input from subject matter experts. Lawshe* proposed a standard method for measuring content validity in psychology that incorporates expert ratings. This approach involves asking experts to determine whether the knowledge or skill that each item on the test assesses is “essential,” “useful, but not necessary,” or “not necessary.”

His method is essentially a form of inter-rater reliability about the importance of each item. You want all or most experts to agree that each item is “essential.”

Lawshe then proposes that you calculate the content validity ratio (CVR) for each question:

N e = Number of “essentials” for an item.
N = Number of experts.

Using this formula, you’ll obtain values ranging from -1 (perfect disagreement) to +1 (perfect agreement) for each question. Values above 0 indicate that more than half the experts agree.

However, it’s essential to consider whether the agreement might be due to chance. Don’t worry! Critical values for the ratio can help you make that determination. These critical values depend on the number of experts. You can find them here: Critical Values for Lawshe’s CVR .

The content validity index (CVI) is the mean CVR for all items and it provides an overall assessment of the measurement instrument. Values closer to 1 are better.

Finally, CVR distinguishes between necessary and unnecessary questions, but it does not identify missing facets.

Lawshe, CH, A Quantitative Approach to Content Validity, Personnel Psychology , 1975, 28, 563-575.

Reader Interactions

Comments and questions cancel reply.

Content Validity

Reference work entry
pp 1261–1262
Cite this reference work entry

Shayna Rusticus 3

3761 Accesses

19 Citations

Content validity refers to the degree to which an assessment instrument is relevant to, and representative of, the targeted construct it is designed to measure.

Description

Content validation, which plays a primary role in the development of any new instrument, provides evidence about the validity of an instrument by assessing the degree to which the instrument measures the targeted construct (Anastasia, 1988 ). This enables the instrument to be used to make meaningful and appropriate inferences and/or decisions from the instrument scores given the assessment purpose (Messick, 1989 ; Moss, 1995 ). All elements of the instrument (e.g., items, stimuli, codes, instructions, response formats, scoring) that can potentially impact the scores obtained and the interpretations made should be subjected to content validation. There are three key aspects of content validity: domain definition, domain representation, and domain relevance (Sireci, 1998a ). The first aspect, domain...

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save.

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime
Available as PDF
Read on any device
Instant download
Own it forever
Available as EPUB and PDF
Durable hardcover edition
Dispatched in 3 to 5 business days
Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Anastasia, A. (1988). Psychological testing (6th ed.). New York: Macmillan Publishing.

Google Scholar

DeVellis, R. F. (1991). Scale development: Theory and applications . Newbury Park, CA: Sage.

Haynes, S. N., Richard, D. C. S., & Kubany, E. S. (1995). Content validity in psychological assessment: A functional approach to concepts and methods. Psychological Assessment, 7 , 238–247.

Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13–103). New York: American Council on Education.

Mosier, C. I. (1947). A critical examination of the concepts of face validity. Educational and Psychological Measurement, 7 , 191–205.

Moss, P. A. (1995). Themes and variations in validity theory. Educational Measurement: Issues and Practice, 14 , 5–13.

Murphy, K. R., & Davidshofer, C. O. (1994). Psychological testing: Principles and applications (3rd ed.). Upper Saddle River, NJ: Prentice-Hall.

Sireci, S. G. (1998a). Gathering and analyzing content validity data. Educational Assessment, 5 , 299–321.

Sireci, S. G. (1998b). The construct of content validity. Social Indicators Research, 45 , 83–117.

Download references

Author information

Authors and affiliations.

Evaluation Studies Unit, University of British Columbia, Vancouver, BC, Canada

Shayna Rusticus

You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shayna Rusticus .

Editor information

Editors and affiliations.

University of Northern British Columbia, Prince George, BC, Canada

Alex C. Michalos

(residence), Brandon, MB, Canada

Rights and permissions

Reprints and permissions

Copyright information

About this entry

Cite this entry.

Rusticus, S. (2014). Content Validity. In: Michalos, A.C. (eds) Encyclopedia of Quality of Life and Well-Being Research. Springer, Dordrecht. https://doi.org/10.1007/978-94-007-0753-5_553

Download citation

DOI : https://doi.org/10.1007/978-94-007-0753-5_553

Publisher Name : Springer, Dordrecht

Print ISBN : 978-94-007-0752-8

Online ISBN : 978-94-007-0753-5

eBook Packages : Humanities, Social Sciences and Law Reference Module Humanities and Social Sciences Reference Module Business, Economics and Social Sciences

Share this entry

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Publish with us

Policies and ethics

Find a journal
Track your research

Validity In Psychology Research: Types & Examples

Saul McLeod, PhD

Editor-in-Chief for Simply Psychology

BSc (Hons) Psychology, MRes, PhD, University of Manchester

Saul McLeod, PhD., is a qualified psychology teacher with over 18 years of experience in further and higher education. He has been published in peer-reviewed journals, including the Journal of Clinical Psychology.

Learn about our Editorial Process

Olivia Guy-Evans, MSc

Associate Editor for Simply Psychology

BSc (Hons) Psychology, MSc Psychology of Education

Olivia Guy-Evans is a writer and associate editor for Simply Psychology. She has previously worked in healthcare and educational sectors.

In psychology research, validity refers to the extent to which a test or measurement tool accurately measures what it’s intended to measure. It ensures that the research findings are genuine and not due to extraneous factors.

Validity can be categorized into different types based on internal and external validity .

The concept of validity was formulated by Kelly (1927, p. 14), who stated that a test is valid if it measures what it claims to measure. For example, a test of intelligence should measure intelligence and not something else (such as memory).

Internal and External Validity In Research

Internal validity refers to whether the effects observed in a study are due to the manipulation of the independent variable and not some other confounding factor.

In other words, there is a causal relationship between the independent and dependent variables .

Internal validity can be improved by controlling extraneous variables, using standardized instructions, counterbalancing, and eliminating demand characteristics and investigator effects.

External validity refers to the extent to which the results of a study can be generalized to other settings (ecological validity), other people (population validity), and over time (historical validity).

External validity can be improved by setting experiments more naturally and using random sampling to select participants.

Types of Validity In Psychology

Two main categories of validity are used to assess the validity of the test (i.e., questionnaire, interview, IQ test, etc.): Content and criterion.

Content validity refers to the extent to which a test or measurement represents all aspects of the intended content domain. It assesses whether the test items adequately cover the topic or concept.
Criterion validity assesses the performance of a test based on its correlation with a known external criterion or outcome. It can be further divided into concurrent (measured at the same time) and predictive (measuring future performance) validity.

table showing the different types of validity

Face Validity

Face validity is simply whether the test appears (at face value) to measure what it claims to. This is the least sophisticated measure of content-related validity, and is a superficial and subjective assessment based on appearance.

Tests wherein the purpose is clear, even to naïve respondents, are said to have high face validity. Accordingly, tests wherein the purpose is unclear have low face validity (Nevo, 1985).

A direct measurement of face validity is obtained by asking people to rate the validity of a test as it appears to them. This rater could use a Likert scale to assess face validity.

For example:

The test is extremely suitable for a given purpose
The test is very suitable for that purpose;
The test is adequate
The test is inadequate
The test is irrelevant and, therefore, unsuitable

It is important to select suitable people to rate a test (e.g., questionnaire, interview, IQ test, etc.). For example, individuals who actually take the test would be well placed to judge its face validity.

Also, people who work with the test could offer their opinion (e.g., employers, university administrators, employers). Finally, the researcher could use members of the general public with an interest in the test (e.g., parents of testees, politicians, teachers, etc.).

The face validity of a test can be considered a robust construct only if a reasonable level of agreement exists among raters.

It should be noted that the term face validity should be avoided when the rating is done by an “expert,” as content validity is more appropriate.

Having face validity does not mean that a test really measures what the researcher intends to measure, but only in the judgment of raters that it appears to do so. Consequently, it is a crude and basic measure of validity.

A test item such as “ I have recently thought of killing myself ” has obvious face validity as an item measuring suicidal cognitions and may be useful when measuring symptoms of depression.

However, the implication of items on tests with clear face validity is that they are more vulnerable to social desirability bias. Individuals may manipulate their responses to deny or hide problems or exaggerate behaviors to present a positive image of themselves.

It is possible for a test item to lack face validity but still have general validity and measure what it claims to measure. This is good because it reduces demand characteristics and makes it harder for respondents to manipulate their answers.

For example, the test item “ I believe in the second coming of Christ ” would lack face validity as a measure of depression (as the purpose of the item is unclear).

This item appeared on the first version of The Minnesota Multiphasic Personality Inventory (MMPI) and loaded on the depression scale.

Because most of the original normative sample of the MMPI were good Christians, only a depressed Christian would think Christ is not coming back. Thus, for this particular religious sample, the item does have general validity but not face validity.

Construct Validity

Construct validity assesses how well a test or measure represents and captures an abstract theoretical concept, known as a construct. It indicates the degree to which the test accurately reflects the construct it intends to measure, often evaluated through relationships with other variables and measures theoretically connected to the construct.

Construct validity was invented by Cronbach and Meehl (1955). This type of content-related validity refers to the extent to which a test captures a specific theoretical construct or trait, and it overlaps with some of the other aspects of validity

Construct validity does not concern the simple, factual question of whether a test measures an attribute.

Instead, it is about the complex question of whether test score interpretations are consistent with a nomological network involving theoretical and observational terms (Cronbach & Meehl, 1955).

To test for construct validity, it must be demonstrated that the phenomenon being measured actually exists. So, the construct validity of a test for intelligence, for example, depends on a model or theory of intelligence .

Construct validity entails demonstrating the power of such a construct to explain a network of research findings and to predict further relationships.

The more evidence a researcher can demonstrate for a test’s construct validity, the better. However, there is no single method of determining the construct validity of a test.

Instead, different methods and approaches are combined to present the overall construct validity of a test. For example, factor analysis and correlational methods can be used.

Convergent validity

Convergent validity is a subtype of construct validity. It assesses the degree to which two measures that theoretically should be related are related.

It demonstrates that measures of similar constructs are highly correlated. It helps confirm that a test accurately measures the intended construct by showing its alignment with other tests designed to measure the same or similar constructs.

For example, suppose there are two different scales used to measure self-esteem:

Scale A and Scale B. If both scales effectively measure self-esteem, then individuals who score high on Scale A should also score high on Scale B, and those who score low on Scale A should score similarly low on Scale B.

If the scores from these two scales show a strong positive correlation, then this provides evidence for convergent validity because it indicates that both scales seem to measure the same underlying construct of self-esteem.

Concurrent Validity (i.e., occurring at the same time)

Concurrent validity evaluates how well a test’s results correlate with the results of a previously established and accepted measure, when both are administered at the same time.

It helps in determining whether a new measure is a good reflection of an established one without waiting to observe outcomes in the future.

If the new test is validated by comparison with a currently existing criterion, we have concurrent validity.

Very often, a new IQ or personality test might be compared with an older but similar test known to have good validity already.

Predictive Validity

Predictive validity assesses how well a test predicts a criterion that will occur in the future. It measures the test’s ability to foresee the performance of an individual on a related criterion measured at a later point in time. It gauges the test’s effectiveness in predicting subsequent real-world outcomes or results.

For example, a prediction may be made on the basis of a new intelligence test that high scorers at age 12 will be more likely to obtain university degrees several years later. If the prediction is born out, then the test has predictive validity.

Cronbach, L. J., and Meehl, P. E. (1955) Construct validity in psychological tests. Psychological Bulletin , 52, 281-302.

Hathaway, S. R., & McKinley, J. C. (1943). Manual for the Minnesota Multiphasic Personality Inventory . New York: Psychological Corporation.

Kelley, T. L. (1927). Interpretation of educational measurements. New York : Macmillan.

Nevo, B. (1985). Face validity revisited . Journal of Educational Measurement , 22(4), 287-293.

Home » Validity – Types, Examples and Guide

Validity – Types, Examples and Guide

Table of Contents

Validity is a fundamental concept in research, referring to the extent to which a test, measurement, or study accurately reflects or assesses the specific concept that the researcher is attempting to measure. Ensuring validity is crucial as it determines the trustworthiness and credibility of the research findings.

Research Validity

Research validity pertains to the accuracy and truthfulness of the research. It examines whether the research truly measures what it claims to measure. Without validity, research results can be misleading or erroneous, leading to incorrect conclusions and potentially flawed applications.

How to Ensure Validity in Research

Ensuring validity in research involves several strategies:

Clear Operational Definitions : Define variables clearly and precisely.
Use of Reliable Instruments : Employ measurement tools that have been tested for reliability.
Pilot Testing : Conduct preliminary studies to refine the research design and instruments.
Triangulation : Use multiple methods or sources to cross-verify results.
Control Variables : Control extraneous variables that might influence the outcomes.

Types of Validity

Validity is categorized into several types, each addressing different aspects of measurement accuracy.

Internal Validity

Internal validity refers to the degree to which the results of a study can be attributed to the treatments or interventions rather than other factors. It is about ensuring that the study is free from confounding variables that could affect the outcome.

External Validity

External validity concerns the extent to which the research findings can be generalized to other settings, populations, or times. High external validity means the results are applicable beyond the specific context of the study.

Construct Validity

Construct validity evaluates whether a test or instrument measures the theoretical construct it is intended to measure. It involves ensuring that the test is truly assessing the concept it claims to represent.

Content Validity

Content validity examines whether a test covers the entire range of the concept being measured. It ensures that the test items represent all facets of the concept.

Criterion Validity

Criterion validity assesses how well one measure predicts an outcome based on another measure. It is divided into two types:

Predictive Validity : How well a test predicts future performance.
Concurrent Validity : How well a test correlates with a currently existing measure.

Face Validity

Face validity refers to the extent to which a test appears to measure what it is supposed to measure, based on superficial inspection. While it is the least scientific measure of validity, it is important for ensuring that stakeholders believe in the test’s relevance.

Importance of Validity

Validity is crucial because it directly affects the credibility of research findings. Valid results ensure that conclusions drawn from research are accurate and can be trusted. This, in turn, influences the decisions and policies based on the research.

Examples of Validity

Internal Validity : A randomized controlled trial (RCT) where the random assignment of participants helps eliminate biases.
External Validity : A study on educational interventions that can be applied to different schools across various regions.
Construct Validity : A psychological test that accurately measures depression levels.
Content Validity : An exam that covers all topics taught in a course.
Criterion Validity : A job performance test that predicts future job success.

Where to Write About Validity in A Thesis

In a thesis, the methodology section should include discussions about validity. Here, you explain how you ensured the validity of your research instruments and design. Additionally, you may discuss validity in the results section, interpreting how the validity of your measurements affects your findings.

Applications of Validity

Validity has wide applications across various fields:

Education : Ensuring assessments accurately measure student learning.
Psychology : Developing tests that correctly diagnose mental health conditions.
Market Research : Creating surveys that accurately capture consumer preferences.

Limitations of Validity

While ensuring validity is essential, it has its limitations:

Complexity : Achieving high validity can be complex and resource-intensive.
Context-Specific : Some validity types may not be universally applicable across all contexts.
Subjectivity : Certain types of validity, like face validity, involve subjective judgments.

By understanding and addressing these aspects of validity, researchers can enhance the quality and impact of their studies, leading to more reliable and actionable results.

About the author

Muhammad Hassan

Researcher, Academic Writer, Web developer

Construct Validity – Types, Threats and Examples

Internal Validity – Threats, Examples and Guide

Reliability – Types, Examples and Guide

Internal Consistency Reliability – Methods...

Content Validity – Measurement and Examples

Alternate Forms Reliability – Methods, Examples...

METHODS article

What do you think you are measuring a mixed-methods procedure for assessing the content validity of test items and theory-based scaling.

$\r\nIngrid Koller*$

1 Department of Psychology, Alpen-Adria-Universität Klagenfurt, Klagenfurt, Austria
2 Department of Human Development and Family Sciences, Oregon State University, Corvallis, OR, USA

The valid measurement of latent constructs is crucial for psychological research. Here, we present a mixed-methods procedure for improving the precision of construct definitions, determining the content validity of items, evaluating the representativeness of items for the target construct, generating test items, and analyzing items on a theoretical basis. To illustrate the mixed-methods content-scaling-structure (CSS) procedure, we analyze the Adult Self-Transcendence Inventory, a self-report measure of wisdom (ASTI, Levenson et al., 2005 ). A content-validity analysis of the ASTI items was used as the basis of psychometric analyses using multidimensional item response models ( N = 1215). We found that the new procedure produced important suggestions concerning five subdimensions of the ASTI that were not identifiable using exploratory methods. The study shows that the application of the suggested procedure leads to a deeper understanding of latent constructs. It also demonstrates the advantages of theory-based item analysis.

Introduction

Construct validity is an important criterion of measurement validity. Broadly put, a scale or test is valid if it exhibits good psychometric properties (e.g., unidimensionality) and measures what it is intended to measure (e.g., Haynes et al., 1995 ; de Von et al., 2007 ). The 2014 Standards for Educational and Psychological Testing ( American Educational Research Association, American Psychological Association, and National Council on Measurement in Education, 2014 ) state that test content is one of four interrelated sources of validity, the other ones being internal structure, response processes, and relations to other constructs. While elaborate empirical and statistical procedures exist for evaluating at least internal structure and relations to other constructs, the validity of test content is harder to ensure and to quantify. As Webb (2006) wrote in his chapter in the Handbook of Test Development, “Identifying the content of a test designed to measure students' content knowledge and skills is as much an art as it is a science. The science of content specification draws on conceptual frameworks, mathematical models, and replicable procedures. The art of content specification is based on expert judgments, writing effective test items, and balancing the many tradeoffs that have to be made” (p. 155). This may be even more the case when the construct at hand is not a form of knowledge or skill, where responses can be coded as right or wrong, but a personality or attitudinal construct where responses are self-judgments. How can we evaluate the validity of the items that form a test or questionnaire?

The development and evaluation of tests and questionnaires is a complex and lengthy process. The phases of this process have been described, for example, in the Standards for Educational and Psychological Testing by the American Educational Research Association, American Psychological Association, and National Council on Measurement in Education (2014) , or by the Educational Testing Service (2014) . A good overview is given by Lane et al. (2016) . These standards require, for example, that the procedures for selecting experts and collecting their judgments should be fully described, or that the potential for effects of construct-irrelevant characteristics on the measure should be minimized. Here, we describe a concrete procedure for evaluating and optimizing the content validity of existing and new measures in a theory-based way that is consistent with the requirements of the various standards. That is, we do not focus on ways to develop new test items in a theory-consistent way, which are described at length in the literature (e.g., Haladyna and Rodriguez, 2013 ; Lane et al., 2016 ), but on how to evaluate existing items in an optimal, unbiased way.

Content validity ( Rossiter, 2008 ) is defined as “the degree to which elements of an assessment instrument are relevant to a representative of the targeted construct for a particular assessment purpose” ( Haynes et al., 1995 , p. 238). Content validity includes several aspects, e.g., the validity and representativeness of the definition of the construct, the clarity of the instructions, linguistic aspects of the items (e.g., content, grammar), representativeness of the item pool, and the adequacy of the response format. The higher the content validity of a test, the more accurate is the measurement of the target construct (e.g., de Von et al., 2007 ). While we all know how important content validity is, it tends to receive little attention in assessment practice and research (e.g., Rossiter, 2008 ; Johnston et al., 2014 ). In many cases, test developers assume that content validity is represented by the theoretical definition of the construct, or they do not discuss content validity at all. At the same time, content validity is a necessary condition for other aspects of construct validity. A test or scale that does not really cover the content of the construct it intends to measure will not be related to other constructs or criteria in the way that would be expected for the respective construct.

In a seminal paper, Haynes et al. (1995) emphasized the importance of content validity and gave an overview of methods to assess it. After the publication of this paper, consideration of content validity in assessment studies increased for a short time, especially in the journal where the authors published their work. However, a brief literature search in journals relevant to the topic shows that content validity is still rarely referred to and even less often analyzed systematically. Between 1995 and 2015, “content validity” was mentioned in one article published in Assessment , in 44 articles in Psychological Assessment , where the paper by Haynes et al. (1995) was published, in 22 articles in Educational Assessment , and in seven articles in European Journal of Psychological Assessment . Currently, content validity is rarely mentioned in psychological journals but receives special attention in other disciplines such as nursing research (e.g., de Von et al., 2007 ).

Methods to Evaluate Content Validity

Several approaches to evaluate content validity have been described in the literature. One of the first procedures was probably the Delphi method, which was used since 1940 by the National Aeronautics and Space Administration (NASA) as a systematic method for technical predictions (see Sackman, 1974 ; Hasson and Keeney, 2011 ). The Delphi method, which is predominantly used in medical research, is a structural iterative communication technique where experts assess the importance of characteristics, symptoms, or items for a target construct (e.g., Jobst et al., 2013 ; Kikukawa et al., 2014 ). In the first round, each expert individually rates the importance of symptoms/items for the illness/construct of interest. In the second round, the experts receive summarized results based on the first round and can make further comments or revise their answers of the first round. The process stops when a previously defined homogeneity criterion is achieved.

Most procedures currently used to investigate content validity are based on the quantitative methods described by Lawshe (1975) or Lynn (1986) , who also provided numerical content validity indices (see Sireci, 1998 ). All these methods, as well as the Delphi-method, are based on expert judgments where a number of experts rate the relevance of the items for the construct on 4- to 10-point scales or using percentages ( Haynes et al., 1995 ) or assign the items to the dimensions of the construct (see Moore and Benbasat, 2014 ). Then, indicators of average relevance are calculated. This can be done by calculating simple average percentages (e.g., if the number of experts is low) or by using a cut-off value (usually 70–80%) to decide whether an item measures the respective construct (e.g., Sireci, 1998 ; Newman et al., 2013 ). In other cases, as mentioned above, a content validity index ( Lawshe, 1975 ; Lynn, 1986 ; Polit and Beck, 2006 ; Polit et al., 2007 ; Zamanzadeh et al., 2014 ) is estimated for individual items (e.g., the proportion of agreement among experts concerning scale assignment) or for the whole scale (e.g., the proportion of items that are judged as content valid by all experts).

There exists, however, no systematical procedure that could be used as a general guideline for the evaluation of content validity (cf. Newman et al., 2013 ). Also, Johnston et al. (2014) recently called for a satisfactory, transparent, and systematical method to assess content validity. They described a quantitative “discriminant content validity” (DCV) approach that assesses not only the relevance of items for the construct, but also whether discriminant constructs play an important role for response behavior. In this method, experts evaluate how relevant each item is for the construct. After that, it is statistically determined whether each item measures the construct of interest. This procedure is well-suited for determining content validity, but the authors argued that it is not possible to determine the representativeness of the items for the target construct, as claimed by Haynes et al. (1995) . Furthermore, it involves only purely quantitative analyses and does not utilize the potential of qualitative approaches. For example, Newman et al. (2013) illustrated the advantages of mixed-method procedures for evaluating content validity or even validity in general. They specifically mention two advantages: the possibility to improve validity and reliability of the instrument, and the resulting new insights into the nature of the target construct. They introduced the Table of Specification (ToS), which requires experts, among other things, to assign items to constructs using percentages. Additionally, the experts can estimate the overall representativeness of the whole set of items for the target construct and add comments concerning this topic. The ToS is wide applicable and a good possibility to evaluate content validity, but it does not allow the evaluation of overlaps between different constructs and does not connect the results to the subsequent psychometric analysis of items. Such a connection does not only improve the measurement of content validity, it also allows for theory-based item analysis. That is, expert judgments about item content can be used to derive specific hypotheses about item fit, which can then be tested statistically.

Theory-Based Item Analysis

Scale items are often developed on the basis of theoretical definitions of the construct, and sometimes they are even analyzed for content validity in similar ways as described above, but after this step, item selection is usually purely empirical. A set of items is completed by a sample of participants, and response frequencies and indicators of reliability such as item-total correlations are used to select the best-functioning items. Rossiter (2008) criticized that at the end of such purely empirical item-selection processes, the remaining items often measure only one narrow aspect of the target construct. In such cases, it would perhaps be possible to retain the diversity of the original items by constructing subscales. Some authors, such as Gittler (1986) and Koller et al. (2012) , have long highlighted the importance of theory-based analysis. For example, Koller and Lamm (2015) showed that a theory-based analysis of items measuring cognitive empathy yielded important information concerning scale dimensionality. In this study, the authors derived hypotheses about item-specific violations of the Rasch model from expert judgments about item content. The expert judgements suggested that the perspective-taking subscale could theoretically be subdivided into the dimensions “to understand both sides” and “to put oneself in the position of another person.” Psychometric analysis confirmed this assumption, which is also in accordance with recent findings from social neuroscience (e.g., Singer and Lamm, 2009 ). In sum, approaches from neuropsychology, item response theory, and qualitative item-content analysis were integrated into a more valid assessment of cognitive empathy.

Goals of the Study

The main aim of this paper is to introduce the Content-Scaling-Structure (CSS) procedure, a mixed-methods approach for analyzing and optimizing the content validity of a questionnaire or scale. The approach is suitable for many research questions in the realm of content validity, such as the formulation of a-priori hypotheses for a theory-based item analysis, the refinement of the definition of a target construct, including possible differentiation into subdimensions, the evaluation of representativeness of items for the target construct, or the investigation of the influence of related constructs on the items of a scale. Thus, the proposed CSS procedure combines the qualitative and quantitative investigation of content validity with an approach for the theory-based psychometric analysis of items. Furthermore, it can lead to a better understanding of the latent construct itself, a better embedding of empirical findings in the research literature, and to a higher construct validity of the instrument in general. We present a general description of the procedure (see Table 1 ) and describe several possible adaptations. The proposed procedure can be adapted in several ways to examine different types of latent constructs (e.g., competencies vs. traits, less vs. more complex constructs). In summary, the proposed CSS procedure fulfills the demand for a systematical and transparent procedure for the estimation of content validity, includes the advantages of mixed-methodologies, and allows researchers not only to evaluate content validity, but also to perform theory-based item analyses. Although the CSS procedure was developed independently of the ToS ( Newman et al., 2013 ), there are several similarities. However, the CSS-procedure is not limited to the non-psychometric analysis of content validity, it also includes psychometric analyses and the integration of the non-psychometric and psychometric parts. Accordingly, the first part of the CSS-procedure could be viewed as an adaption and extension of the ToS.

Table 1. General schedule for the Content-Scaling-Structure (CSS) procedure .

To demonstrate our procedure here, we use the Adult Self-Transcendence Inventory (ASTI), a self-report scale measuring the complex target construct of wisdom. In the first part of the study, an expert panel analyzed the content of the ASTI items with respect to the underlying constructs in order to investigate dimensionality, identify potential predictors of differential item functioning, and analyze the appropriateness of the definition of the construct for the questionnaire. In the second part, data from a sample of 1215 participants were used to evaluate the items using multidimensional item response theory models, building upon the results of the first part. It is not at all mandatory to use item response modeling for such analyses; other psychometric methods, such as exploratory or confirmatory factor analyses, may also be employed, although our impression is that item response models are particularly well-suited to test specific hypotheses about item functioning. At the end, the results and the proposed procedure are discussed and a new definition of the target construct that the ASTI measures is given.

Before we describe the CSS-procedure in detail, we introduce the topic of measuring wisdom and the research questions for the presented study. After that, the CSS procedure is described and illustrated using the ASTI as an example.

Measuring Wisdom with the Adult Self-Transcendence Inventory

Wisdom is a complex and multifaceted construct, and after 30 years of psychological wisdom research, measuring it in a reliable and valid way is still a major challenge ( Glück et al., 2013 ). In this study, we used the ASTI, a self-report measure that conceptualizes wisdom as self-transcendence ( Levenson et al., 2005 ). The idea that self-transcendence is at the core of wisdom was first brought forth by the philosopher Trevor Curnow (1999) in an analysis of the common elements of wisdom conceptions across different cultures. Curnow identified four general principles of wisdom: self-knowledge, detachment, integration, and self-transcendence. Levenson et al. (2005) suggested to consider these principles as stages in the development of wisdom.

Self-knowledge is awareness of the sources of one's sense of self. The sense of self arises in the context of roles, achievements, relationships, and beliefs. Individuals high in self-knowledge have developed awareness of their own thoughts and feelings, attitudes, and motivations. They are also aware of aspects that do not agree with their ideal of who they would like to be.

Individuals high in non-attachment are aware of the transience and provisional nature of the external things, relationships, roles, and achievements that contribute to our sense of self. They know that these things are not essential parts of the self but observable, passing phenomena. This does not mean that they do not care for their relationships—on the contrary, non-attachment increases openness to and acceptance of others: individuals who are less identified with their own wishes, motives, thoughts, and feelings are better able to perceive, and care about, the needs and feelings of others.

Integration is the dissolution of separate “inner selves.” Different, contradictory self-representations or motives are no longer in conflict with each other or with the person's ideal self, but accepted as part of the person. This means that defense mechanisms that protect self-worth against threats from undesirable self-aspects are no longer needed.

Self-transcendent individuals are detached from external definitions of the self. They know and accept who they are and therefore do not need to focus on self-enhancement in their interaction with others. For this reason, they are able to dissolve rigid boundaries between themselves and others, truly care about others, and feel that they are part of a greater whole. Levenson et al. (2005) argue that self-transcendence in this sense is at the core of wisdom.

Levenson et al. (2005) developed a first version of the ASTI to measure these four dimensions. That original version consisted of 18 items with a response scale that asked for changes over the last 5 years rather than for participants' current status. This approach turned out to be suboptimal, however, and a revised version was developed that consists of 34 items with a more common response scale ranging from 1 (“disagree strongly”) to 4 (“agree strongly”). Some items were reworded from Tornstam's (1994) gerotranscendence scale, others were newly constructed by the test authors in order to broaden the construct. The majority of the items are related to one of the four dimensions identified by Curnow (1999) . They refer to inner peace independent of external things, feelings of unity with others and nature, joy in life, and an integrated sense of self. Ten of the items were included to measure alienation as a potential opposite of wisdom.

In a study including wisdom nominees as well as control participants, Glück et al. (2013) used a German-language version of the revised ASTI. The scale was translated into German and back-translated. One of the items (“Whatever I do to others, I do to myself”) was difficult to translate into German because it could have two different meanings (item 34: doing something (good) for others and oneself, or item 35: doing something (bad) to others and oneself). After consultation with the scale author, M. R. Levenson, both translations were retained in the German scale, resulting in a total of 35 items. In study by Glück et al. (2013) , the ASTI had the highest amount of shared variance with three other measures of wisdom, which suggests that it may indeed tap core aspects of wisdom. Reliability was satisfactory (Cronbach's α = 0.83), but factor analyses using promax rotation suggested that the ASTI might be measuring more than one dimension. The factor loadings did not suggest a clear structure representing Curnow's four dimensions, however. Scree plots suggested either one factor (accounting for only 21.7% of the variance) or three factors (38.5%). The first factor in the three-factor solution comprised five items describing the factor of self-transcendence (feeling one with nature, feeling as part of a greater whole, engaging in quiet contemplation), but the other two factors included only two or three items each and did not really correspond with the subfacets proposed by the scale authors. Thus, factor-analytically, the structure of the ASTI was not very clear. As the scale had not been systematically constructed to include specific subscales, there was no basis for a confirmatory factor analysis, and given the exploratory results it seems doubtful that such an analysis would have rendered clear results.

In the current study, we used the CSS procedure to gain insights about the structure of the scale, with the goal of identifying possible subscales. We only analyzed the 24 items measuring self-transcendence, excluding the 10 alienation items. Tables 3A – C include all analyzed items. Before the point by point description of the CSS Procedure, the research questions for the investigation of the ASTI are described.

Description and Application of the Content-Scaling-Structure Procedure

As the study design is intended to be a template for other studies, we first present a point-by-point description of the steps in Table 1 . In general, detailed descriptions of the procedure for assigning items to scales are important elements for the transparency of studies, especially as they offer the possibility to reanalyze and compare results across different studies (e.g., Johnston et al., 2014 ).

The research questions for the current study were as follows.

• Which items of the ASTI fulfill the requirement of content validity? The theoretical background of the ASTI describes a multidimensional, complex construct, and it is not obvious for all items to which dimension(s) they pertain. In addition, item responses may be influenced by other, related constructs that the ASTI is not intended to tap. Thus, we investigated experts' views on the relation of the items to the dimensions of the ASTI and to related constructs. The results of this analysis were also expected to show whether all dimensions of the ASTI are well-represented by the items, i.e., whether enough items are included to assess each dimension.

• How sufficient is the definition of the four-factor model underlying the ASTI? In earlier studies using content-dimensionality scaling, we have repeatedly found that the exercise of assigning items to dimensions and discussing our assignments has enabled us to rethink the definitions of the constructs themselves, identify conceptual overlaps between dimensions, and find that certain aspects were missing in a definition or did not fit into it. In other cases, we found that a construct was too broadly defined and needed to be divided into sub-constructs. In this vein, it is also important to consider whether other, related constructs may affect the responses to some items.

• Do the expert judgments suggest hypotheses for theory-based item analysis? The expert judgments are interesting not only with respect to the dimensionality of the ASTI. Other relevant psychometric issues include potential predictors of differential item functioning, comprehensibility of items, and too strong or too imprecise item formulations.

• To what extent are the theory-based assignments of items to dimensions supported by psychometric analyses? The ultimate goal of the study was to identify unidimensional subscales of the ASTI that would include as many of the items as possible and make theoretical as well as empirical sense. In earlier studies, we have repeatedly found that subscales that were identified in a theory-based way remained highly stable in cross-validations with independent samples, whereas subscales determined in a purely empirical way did not.

Development of the Expert Questionnaire: Instruction, Definitions, and Item Booklet

Independent of whether the research question of interest concerns the construction of a new measurement instrument, the evaluation of an existing measurement, or evaluating the representativeness of the items for the target construct, the first step is to lay out the definition of the target construct in a sufficiently comprehensive way (e.g., by a systematical review). That is, all relevant dimensions of the target construct should be defined in such a way that there is no conceptual overlap between them. Because the goal of the current study was to investigate the items and definitions of the ASTI and not the representativeness of the items for the construct of wisdom, we used the four levels of self-transcendence (self-knowledge, non-attachment, integration, and self-transcendence) as defined above.

The second step is the generation of the questionnaire for the expert ratings based on the definitions of the target dimensions and the items. It includes a clear written instruction, the definitions of each subdimension, and an item booklet (see Figure 1 ) with the items in random order (i.e., not ordered by subdimension or content).

Figure 1. A fictive example of an item-booklet .

In the present study, the experts were first instructed to carefully read the definitions of the four dimensions underlying the ASTI shown above. Then they should read each item and use percentages to assign each item to the four dimensions: if an item tapped just one dimension, they should write “100%” into the respective column, if an item tapped more than one dimension they should split the 100% accordingly. For example, an item might be judged as measuring 80% self-transcendence and 20% integration. It is also possible to “force” experts to assign each item to only one dimension. This might be useful for re-evaluating an item assignment produced in earlier CSS rounds. As a first step, however, we believe that this kind of assignment could lead to a loss of valuable psychometric information about the items and increase the possibility of assignment errors.

If the experts felt that an item was largely measuring something other than the four dimensions, they should not fill out the percentages but make a note in an empty space below each item. They were also asked to note down any other thoughts or comments in that space. In addition, they were asked two specific questions: (1) “Do you think the item could be difficult to understand? If yes, why?,” and (2) “Do you think the item might have a different meaning for certain groups of people (e.g., men vs. women, younger vs. older participants, participants from different professional fields, or levels of education)? If yes, why?” The responses to these last questions allowed us to formulate a-priori hypotheses about differential item functioning (DIF; e.g., Holland and Wainer, 1993 ; see Section Testing Model Fit).

Selection of Experts

We recommend to recruit at least five experts (cf. Haynes et al., 1995 ): (1) at least two individuals with high levels of scientific and/or applied knowledge and experience concerning the construct of interest and related constructs, and (2) at least two individuals with expertise in test construction and psychometrics, who evaluate the items from a different perspective than content experts, so that their answers may be particularly helpful for theory-based item analysis. In addition, depending on the construct of interest, it may be useful to include laypersons, especially individuals from the target population (e.g., patients for a measure of a psychiatric construct). A higher number of experts allows for a better evaluation of the consistency of their judgments, increases the reliability and validity of the results, and allows the calculation of content validity indices. However, a higher number of experts can also increase the complexity of the results. In any case, the quality and diversity of the experts may be more important than their number.

In the present study, nine experts (seven women and two men) participated. Because the interest of the study was to validate the items of the ASTI, all experts were psychologists; five were mainly experts in the field of wisdom research and related fields (including the second and third author), and four were mainly experts of test psychology and assessment in different research fields (including the first author). All experts worked with the German version of the ASTI, except for the second author who used the original English version. In the present study, the experts were invited by email and the questionnaire was sent to them as an RTF document.

Individual Data Collection with Each Expert

The experts filled out the questionnaire individually and without time limits.

Summary of the Results Based on Predefined Rules

Next, the responses of the experts were summarized according to pre-defined rules. As Table 2 shows, we produced one summarizing table for each item. In this table, the percentages provided by the experts were averaged. In addition, we counted how many experts assigned each item to each dimension. Finally, the notes provided by the experts were sorted into three categories: notes concerning dimension assignment, item or dimension content, and psychometric issues.

Table 2. Example of summarized results for the discussion of final assignments .

Of course, these three categories are only examples, other categories are also possible. This qualitative part of the study can offer theoretical insights about the target construct as well as the individual items.

The last row of Table 2 contains the average percentages for each dimension across all experts. These values allow for first insights concerning dimension assignments. For example, the mean values might be d 1 = 50%, d 2 = 30%, d 3 = 10%, d 4 = 10%. A cutoff value can be used to determine which dimensions are important concerning the item. Choice of the cutoff value depends on the research aims: if the goal is to select items that measure only one subcomponent of a clearly and narrowly defined construct, a cutoff of 80% may be useful. This would mean that the experts agree substantially that each item very clearly taps one and only one of the dimensions involved. Newman et al. (2013) , for example, wrote that 80% agreement among experts is a typical rule of thumb for sufficient content validity. If the goal is to inspect the functioning of a scale measuring a broader, more vaguely defined construct, lower cutoff values may be useful. In addition to setting a cutoff, it may also be important to define rules for dealing with items that have relevant percentages on more than one subdimension. Our experience is that simply discarding all such items may mean discarding important aspects of the construct. It may be more helpful to consider alternative ways of dividing the scale into subcomponents. Even if an item has a high average assignment on only one dimension, it may still be worthwhile to check whether some experts assigned it strongly to another dimension. Thus, it is important to not only look at the averages, but to inspect the whole table for each item. In addition, if several rounds of expert evaluations are performed, the cutoff could be set higher from round to round.

In the current case, as the goal was not to select items but to gain information about an existing scale, we used a much lower cutoff criterion of 0.30 to determine which dimension(s) experts considered as most fitting for each item. We then presented the results, in an aggregated form, to a subgroup of the experts with the goal redefining and perhaps differentiating the dimensions to allow for a clearer assignment of items. Thus, the number of experts who assigned the item to the most prominent dimension with a percentage of at least the cutoff value was an indicator of homogeneity of the experts' views (see Tables 3A – C , column 4), similar to the classical content validity index (see Polit and Beck, 2006 ).

Table 3A. Results of the final assignment of the ASTI items .

Table 3B. Results of the final assignment of the ASTI items .

Table 3C. Results of the final assignment of the ASTI items .

Expert Meeting: Discussion of the Results

Next, the experts are invited to discuss the assignments and comments as a group. This discussion is particularly fruitful if the assignments were relatively heterogeneous. It can lead to clarifications and possible modifications of the definitions of the dimensions, removal of items that clearly do not fit the construct, and even generation of additional items. If the original assignments were very heterogeneous, it makes sense to repeat the individual assignment and collective discussion in order to achieve a sufficient level of agreement among the experts. However, this iterative process can become very complex and is not always feasible. In any case, a minimum of two experts from different fields (for example, one content expert and one psychometrician) should make the decisions together.

The results of the analysis and discussion of the experts' assignments can take various forms. Usually, some items are clearly assigned to a specific dimension, others turn out to be so equivocal that they are eliminated. In some cases, however, the conceptualization of the dimensions needs to be reconsidered. For example, as mentioned above, if a number of items are assigned to two dimensions with about equal weight, this may mean that the two dimensions need to be collapsed or that an additional dimension is required that is conceptually located between the two. If the comments of experts provide new insights for possible dimension definitions or labels, these comments can also be included in the formulation of new definitions.

In the present study, it was not possible to discuss the results with all experts. Thus, the third author, an expert on the topic of wisdom and psychometrics, and the first author, a psychometrician not familiar with the concept of wisdom discussed the results, performed the final assignment of the items, and formulated new names and definitions for the resulting dimensions where they differed from the original ones.

Final Assignments, Modified Definitions, and Possible Associations between Dimensions

Final assignment of the items to the dimensions.

The results based on the assignments and the final discussion of the two experts are given in Tables 3A – C . Only important results are presented here. The third columns show only the final assignments to subdimensions, the fourth columns show only the relevant mean percentages and confidence intervals, and the fifth columns show the number of experts who assigned the item to one dimension with a minimum of 30% (homogeneity of expert judgments). The last columns present the summarized comments without any categorization because the number of comments was generally low. As the tables show, new dimensions such as “Peace of Mind” emerged in the discussion of the commonalities among items that seemed to tap an affective aspect of non-attachment. In the following, we describe the content of the scales that emerged from the final assignments and propose psychometric hypotheses for each subdimension.

“Self-knowledge and self-integration” (SI)

The four items in this scale all describe aspects of knowing and accepting oneself, including possibly diverging aspects and positive and negative sides (see Table 3A ). Thus, the overarching theme of this subscale is knowing, accepting, and integrating the aspects of oneself and one's life. One item (“I feel that I know myself”) was theoretically assigned to the subdimension of self-knowledge only, the others were assigned to self-knowledge as well as integration or to integration only. The two subdimensions were merged into one scale based on the rationale that self-knowledge can be considered as a precondition for integration. Item 10 (“I have a good sense of humor about myself”) was problematic as it was assigned to a minimum of three categories. We eventually assigned it to “self-knowledge and integration” because it reflects a kind of benevolent acceptance of oneself including one's flaws, but made a note to specifically look at the fit of this item with the others in the psychometric analyses.

“Peace of mind” (PM)

All items of this scale are about valuing and maintaining one's tranquility even in the face of reasons to get angry or upset (see Table 3A ). Item 22, “I can accept the impermanence of things,” seems to deviate somewhat from this pattern, but at closer inspection the ability to accept that all things are impermanent is about being able to remain calm in the face of losses as well as gains.

“Non-attachment” (NA)

This scale comprises items concerning the individual's independence of external things, namely, other people's opinions, a busy social life, or material possessions, and of other people and things in general (see Table 3B ). Thus, it clearly corresponds to Curnow's (1999) concept of non-attachment. All items in this scale were predominantly assigned to the non-attachment component, but also, with percentages ranging from 29 to 40, to the self-transcendence component. This suggests that the experts considered the individual's independence of external sources of reinforcement as a part or precondition of self-transcendence. One reason may be that our definition of self-transcendence included the statement that self-transcendent individuals are detached from external definitions of self, which was based on the idea that self-transcendence is the last stage of a development through the other stages. As mentioned above, for the independent measurement of the four dimensions, it would seem important to avoid such conceptual overlaps. In any case, the common characteristic of the four items is their reference to non-attachment.

‘Self-transcendence” (ST)

All items in this scale were unanimously assigned to the self-transcendence dimension; they refer to individuals experiencing themselves as part of or closely related to something larger than themselves—“a greater whole,” “earlier and future generations,” or nature (see Table 3B ). Item 13, “I feel compassionate even toward people who have been unkind to me,” also suggests that the individual can relate to others on a general level that goes beyond personal relations. The essence of this scale is perhaps best captured by item 02, “I feel that my individual life is a part of a greater whole.”

“Presence in the here-and-now and growth” (PG)

The fifth dimension was labeled “presence in the here-and-now and growth”: its items describe individuals who are able to live in the moment: they find joy in their life and in what they are doing in a given moment, without being fearful of the future or preoccupied with the finitude of life (see Table 3C ). They are aware that things are always changing, oriented toward learning from others, and aware that they have grown through losses.

Definition of Psychometric Hypotheses

A goal of the analyses was to test whether the hypotheses gained from the expert judgments could be used to improve the psychometric functioning of the ASTI. Specifically, we wanted to test whether the ASTI as a whole formed an unidimensional scale, and if not, whether the five subdimensions derived from the expert assignments of the items would form unidimensional scales. Also, we wanted to test whether single items within each scale diverged from the others. For the theory-based item analysis, we summarized the comments from Tables 3A – C into psychometric categories. These categories are not only useful for the interpretation of non-conforming items, but also for the construction of new additional items. We identified three main categories of expert comments:

• Test fairness (e.g., differential item functioning; see Section Testing Model Fit): For eight items (I02, I05, I12, I15, I17, I20, I21, I22), the experts noted possible context dependencies (e.g., the response may be dependent on life situation, life events, material possessions, health, or culture) and for five items (I04, I16, I17, I21, I23), possible influences of respondent age. For one item (I14), the experts suspected differences between men and women.

• Influences of other constructs: The expert judgments generally suggested that the items of the ASTI are good indicators for the target construct. But for some items, other constructs, such as emotion regulation (I01, I09), extraversion (I03, I06), egoism (I03, I06, I08), empathy (I13), or spirituality (I07, I24, I25) may influence responses.

• Linguistic factors: Only for five items (I01, I05, I06, I13, I14) linguistic problems (e.g., difficulties in comprehension) were suspected.

Definition of Possible Associations between Dimensions

Sometimes researchers have theoretical assumptions about relationships between the various dimensions. Item response models can be used to test such hypotheses, e.g., to test predictions about correlations between dimensions or the structural relationships among them. In the current example, we only explored the latent correlations between the dimensions.

Validation Study

In the current study, we used item response models to test the psychometric functioning of the ASTI based on the results of the expert assignments.

Participants

Data were collected individually from 1215 participants in Austria and Germany by trained students as part of their class work. A total of 666 participants were students (431 women, Mdn age = 23, IQR = 5, min = 18, max = 60) and 549 were non-students (346 women, Mdn age = 35, IQR = 23, min = 15, max = 81). Some participants failed to fill out the whole questionnaire, but the frequency of missing values per item was very low ( M = 0.4%, SD = 0.27, min = 0, max = 0.7) and was not associated with external variables, suggesting that the missing values can be treated as occurring randomly.

Participants filled out a set of paper-and-pencil scales and answered demographic questions. Overall, participation took about 25 min on average. The questionnaire included the ASTI and additional scales outside the scope of this paper.

Analytical Procedures

To test the unidimensionality of the new subscales, we used an approach from the family of Rasch models. Rasch models ( Rasch, 1960 ; for an overview see Fischer and Molenaar, 1995 ) and their extensions for graded response categories are very useful for testing specific hypotheses about the dimensionality of items within a scale. First, they test an assumption that is usually taken for granted when a score is computed by summing up the items of a scale: the sum score is a valid indicator of a construct only if all items measure the same latent dimension ( Rasch, 1960 ). If, for example, some items in our scale measure non-attachment while others measure self-knowledge, and these two constructs can occur independently of each other, then summing up across all items is not informative about a person's actual construct levels. One would need to know their separate scores for the two subdimensions. Only if all items measure the same construct, the raw score is a good indicator of a person's level of that construct. The main indicators that Rasch-family models use are item parameters and person parameters, which are placed on the same latent dimension. Persons' positions on the latent dimension, their so-called person parameters, are determined by their raw scores. The higher a person's score, the more likely is the person to agree to the items of the test. Items are represented by monotonically increasing asymptotic probability curves on the same latent dimension: the probability that a person agrees to an item is dependent on the relation between the position of the item and the position of the person on the latent dimension. Each item's position is described by its item parameter, i.e., the point on the latent dimension where a person's probability of agreeing to the item is 0.50. For items with graded responses, as in the current case, parameters describe the thresholds between adjacent response categories. Here, we used the Partial Credit Model (PCM; Masters, 1982 ; Masters and Wright, 1997 ), which assumes unidimensionality of the items but does not assume that the distances between categories are equal across items.

Specifically, we used the multidimensional random coefficient multinomial logit model (MRCML model; Adams et al., 1997 ), a generalization of a wide class of Rasch models including the PCM. The PCM is implemented, for example, in the software ConQuest ( Wu et al., 2007 ), in MPlus 7.4 ( Muthén and Muthén, 1998–2015 ), and in the R-package “Test Analysis Modules” (TAM; Kiefer et al., 2015 ), which was used here.

The item parameters were estimated using marginal maximum likelihood estimation (MLEs) and the person parameters using weighted maximum likelihood estimation (WLEs). The item analysis procedure follows Pohl and Carstensen (2012) , who outlined an approach for item analysis for use in large-scale assessment settings. We believe that this approach is also useful for smaller-scale studies.

Testing Model Fit

As explained earlier, a main goal of the study was to integrate the proposed dimensions and the experts' hypotheses concerning item fit with a psychometric investigation of the items. As shown in the section “Definition of Psychometric Hypotheses,” strong hypotheses concerning dimensionality and candidate predictors for possible item misfit were identified. Accordingly, in the psychometric analysis these predictors will be used to test for significant item misfit.

Before starting with the actual analyses, the category frequencies for each item were assessed because low frequencies can cause estimation problems. If the frequency of a response category was below 100, it was collapsed with the next category (see Pohl and Carstensen, 2012 ). In 11 items, the two lowest categories, and in two other items, the two highest categories were merged. In the remaining 12 items, all category frequencies were above 100. In the development of new measures, it is often a goal to have few items with very low frequencies in some response categories. With constructs like wisdom, however, which are very positively valued, few participants disagree with positively worded items, and the variance that does exist is mostly located between the middle and the highest category. If such items represent theoretically important aspects of the construct, they may well be kept as part of the scale. In the current case, low frequencies in the lowest categories were particularly typical for the SI and PG subdimensions (four items each), and removing these items would have depleted both scales of important content. In the following, we describe the analyses that were performed.

Person-Item-Maps

Person-item-maps display the distribution of the person parameters and the range of item parameters. These plots show whether any participants showed extreme response tendencies, which might lead to particularly high or low raw scores, and how the item parameters are distributed over the latent dimension. Thus, it can be examined whether the items cover the whole spectrum of the latent dimension or cluster in one part of it. If there are few items in a segment of the spectrum, the latent trait cannot be measured well in that segment.

Dimensionality

Up to now, the ASTI was scored as a unidimensional instrument, although the items were constructed so as to represent the subdimensions described earlier. Based on this theoretical background and the expert judgments, the five-dimensional model in Tables 3A – C was used as the starting point for the following analyses. In order to test whether the five-dimensional model fit better than the one-dimensional model, they were compared using the Bayesian information criterion (BIC; Schwarz, 1978 ) as recommended by, e.g., Kang and Chen (2011) . Chi-square tests were also computed (see Table 6 ) however, they may be oversensitive due to the relatively large sample size. The five-dimensional model was estimated using a quasi-Monte Carlo integration ( Kiefer et al., 2015 ) with 20,000 nodes (number of theta parameters used for the simulation). The latent correlations between the five dimensions were estimated.

Rasch Homogeneity

Once the dimensionality of the ASTI is established, we can test the fit of the Rasch model within each subscale, analyzing several indicators of fit for each individual item. First, the assumption of Rasch homogeneity was tested by comparing the PCM against the generalized partial credit model (GPCM, Muraki, 1992 ), which includes different discrimination parameters across items. Only if the PCM does not fit significantly worse than the GPCM, the assumptions of the Rasch family hold for a scale and the raw score is a sufficient statistic for the person parameter.

Additionally, the expected score curves of each item were examined. Figure 3 shows some examples of the results of this analysis. With this kind of graphical display, it is possible to examine whether the observed score curve is different from the expected curve (misfit) and whether the discrimination (slope) of an item is higher or lower than assumed by the PCM.

The fit of individual items was assessed using infit and outfit statistics, i.e., the weighted and unweighted means square statistics (MNSQ; Wright and Masters, 1990 ; Wu, 1997 ), respectively. Following Wright and Linacre (1994) , a range between 0.6 and 1.4 was defined as acceptable fit. Generally, a value below 1 indicates overfit (the data are too predictable) and a value above one indicates underfit of items (the data are less predictable than expected by the PCM). Overfit (e.g., too high discrimination of items) is less problematic than underfit (e.g., too high guessing probability). Thus, underfit should receive more attention in the evaluation of the items.

Differential Item Functioning (DIF)

Differential item functioning means that the pattern of response probabilities for some items differs between groups of participants. For example, gender-related DIF would mean that men are more likely to agree to some items of a scale than women. If that were the case, the scale as a whole would be measuring a somewhat different construct for men than for women. Here, DIF was tested with respect to gender ( N women = 777, N men = 438), age (15–31 years N = 851, 32–81 years N = 364), and students vs. non-students ( N students = 666, N non-students = 549). To assess DIF, the fit of two models was compared by means of the BIC: a main-effect model, which allows only for a main effect of the DIF variable across all items, and an interaction model, which additionally includes an interaction between the DIF variable and the items. If the interaction model fits significantly better than the main-effect model, there is a significant amount of DIF, that is, the patterns of item difficulties vary between the levels of the DIF variable. Following Pohl and Carstensen (2012) and the DIF categorization of the Educational Testing Service ( Linacre, 2014 ) we used absolute logit differences to judge the magnitude of DIF on the item level: no relevant DIF = smaller than 0.4; slight to moderate DIF = 0.4–0.6, (C) moderate to large DIF or noteworthy for further investigations = 0.6–1, and (D) very strong DIF = larger than 1.

The experts' suggestions concerning possible DIF from Section Definition of Psychometric Hypotheses were taken into account in interpreting the results of the DIF analyses.

Further Analyses

Descriptive statistics ( M, SD ) for each item were calculated of each dimension separately. As an index of internal consistency, we used the EAP reliability coefficient, an indicator for the PCM that is comparable with Cronbach's Alpha. The item parameters of all items (based on final scale assignment) and item intercorrelations are reported in the Appendix in the Supplementary Material to this article.

Person-Item-Map

The person-item-map in Figure 2 shows that the item parameters mostly covered the left-hand side of the middle range of the ability parameter distribution. That is, the items were rather “easy,” i.e., participants tended to agree rather than disagree (see also the descriptive statistics in Table 6 ). For the ASTI to also differentiate well among high-scoring individuals, more items should be constructed that participants are less likely to agree with. It is, however, a general problem with self-report measures of wisdom that they tend to elicit high scores, be it due to effects of social desirability or of overly positive self-evaluations ( Glück et al., 2013 ). Performance measures of wisdom, such as the Berlin Wisdom Paradigm ( Baltes and Staudinger, 2000 ) tend to produce far lower average levels of wisdom than self-report measures do.

Figure 2. Person-item-map .

Next, four different models were estimated and compared by means of the BIC. As Table 4 shows, the GPCM (1DIM_2PL) fit the data better than the PCM (1DIM_1PL), indicating that the items differ in discrimination. Furthermore, the comparison between five-dimensional and one-dimensional models suggested that the five-dimensional models generally fit the data better than the unidimensional ones.

Table 4. Comparison of the estimated IRT models .

Table 5 shows the latent correlations among the five dimensions and the EAP reliability indices, as well as Cronbach's α and the confidence interval for Cronbach's α ( Fan and Thompson, 2001 ). EAP reliabilities were acceptable, whereas Cronbach's α were below 0.50 for PM, NA, and PG. This may be due to the relatively low variance in the item responses. The latent correlations supported the assumption of a five-dimensional structure of the ASTI. Self-knowledge and integration, peace of mind, and presence in the here-and-now and growth were quite highly correlated, which may suggest that they all represent an accepting and appreciative stance toward oneself and the experiences of one's life. Non-attachment and self-transcendence seem to be less closely related to the others (except for the correlation between non-attachment and peace of mind), possibly because they both, although in different ways, represent the individual's relationship with the external world: non-attachment describes an independence from other people and material things, and self-transcendence represents a connectedness with others and the world at large. Both may not be part of everyone's experience of inner peace. However, Table 4 showed that the five dimensional GPCM fit the data best.

Table 5. Latent correlations between the five dimensions, EAP-Reliability, and Cronbachs- α incl. 95% confidence interval for each dimension .

Rasch Homogeneity, Item Fit, and Differential Item Functioning

Next, we assessed the items of each dimension separately. In general, the infit and outfit statistics showed no misfit of items (see Table 6 ). Because of the complexity of analyses, the following results are reported for each dimension separately. Table 4 shows the overall fit of the GPCM and PCM for each subdimension according to the BIC. Log likelihoods for both models are also reported, although likelihood ratio tests are likely to be somewhat oversensitive due to the large sample size.

Table 6. Descriptive values (M, SD), uncentered PCM item (δ i ), and category parameters (C i ) estimated for each subscale separately, Itemfit Statistics (Outfit, Infit), and absolute differences (DIFF) for the three tested external variables gender, age, and group .

Self-Knowledge and Integration (SI)

Table 4 shows that the GPCM fit the data better than the PCM. The score curves suggest that, generally, the observed slopes were steeper than expected; the observed slope of item 10 also showed small deviations from the expected slope (see Figure 3 ). When item 10 was excluded, the difference in fit between the GPCM and PCM became quite negligible (PCM: log likelihood = −3277.90, npar = 7, BIC = 6605.50; GPCM: log likelihood = −3269.56, npar = 9, BIC = 6603.03). Therefore, the PCM was considered to fit the scale sufficiently well when item 10 was excluded. As explained earlier, DIF was assessed with respect to gender, age, and professional group. Men's person parameters ( SD = 1.21, d = 0.30) were, on average, 0.27 logits higher than women's (indicating that men were higher in self-knowledge and integration than women); there were no significant main effects for age ( M difference = 0.06, SD = 1.23, d = 0.05) or students vs. non-students ( M difference = 0.01, SD = 1.23, d = 0.01). However, the model comparisons in Table 7 indicated DIF for age and group. Only item 10 (“I have a good sense of humor about myself”) showed considerable DIF: it was more often agreed to by younger participants and students than by older participants and non-students, respectively. Note that Item 10 had not received an unequivocal assignment by the experts either (see Table 3A ). In addition, item 20 (“I am accepting of myself, including my faults”) was more often agreed to by older participants and non-students than by younger people and students. However, the magnitude of DIF was small and could therefore be ignored. When the analyses were repeated excluding item 10, the PCM fit the data well and there was no considerable DIF for any item.

Figure 3. Expected response curve or score curves .

Table 7. Comparison of the main-model against the DIF interaction-model .

Peace of Mind (PM)

Table 4 shows that the difference in BIC between the GPCM and PCM is again negligible, suggesting acceptable fit of the PCM for this scale. The DIF analyses indicated that women ( M difference = 0.334, SD = 0.622, d = 0.54), older participants ( M difference = 0.152, SD = 0.644, d = 0.24), and non-students ( M difference = 0.114, SD = 0.643, d = 0.18) received higher person parameters than men, younger participants, and students, respectively, but there was no differential item functioning (see Tables 6 , 7 ).

Non-Attachment (NA)

The GPCM fit the scale better than the PCM, but again, the difference in BIC was small and the score curves showed good agreement between expected and observed response curves and slopes (see Table 4 ). Thus, the PCM was preferred. Again, there was no indication of DIF (see Tables 6 , 7 ), but women ( M difference = 0.076, SD = 0.578, d = 0.13), older participants ( M difference = 0.240, SD = 0.656, d = 0.43), and non-students ( M difference = 0.222, SD = 0.566, d = 0.39) received higher person parameters than men, younger participants, and students, respectively.

Self-Transcendence (ST)

Table 4 shows that the GPCM fit the data substantially better than the PCM. It is somewhat unclear, however, what causes the difference in fit, as the two examples of score curves in Figure 3 represent the general result for all items of this scale, indicating no substantial underfit or overfit. It seems important to reanalyze the self-transcendence scale with new data. The DIF analyses indicated that women ( M difference = 0.292, SD = 0.661, d = 0.60), older participants ( M difference = 0.248, SD = 0.668, d = 0.37), and non-students ( M difference = 0.106, SD = 0.676, d = 0.16), again received higher person parameters than men, younger participants, and students, respectively. As Tables 6 , 7 show, no substantial DIF was found for this subscale.

Presence in the Here-and-Now and Growth (PG)

Table 4 indicates that the GPCM fit the data slightly better than the PCM. The score curves (for an example see Figure 3 , left below) showed that the observed slopes were slightly higher than the expected slopes. Non-students ( M difference = 0.162, SD = 0.450, d = 0.36) had lower person parameters than students; there were no significant differences for gender or age (gender: M difference = 0.04, SD = 0.455, d = 0.09; age group: M difference = 0.066, SD = 0.458, d = 0.14). Item 14 (“I am not often fearful”) had a small amount of DIF for all three group variables, i.e., it was more difficult to agree to for men, younger participants, and students than for women, elderly people, and non-students, respectively. It was also the only negative item (see Table 3C ) in the subdimension. When item 14 was excluded, the difference in fit between the PCM and GPCM was markedly reduced (PCM: log likelihood = −6241.90, npar = 12, BIC = 12569.04; GPCM: log likelihood = −6208.536, npar = 16, BIC = 12530.71). This subscale should also be reanalyzed once new data are available. A re-analysis without item 14 showed that the PCM fit the data well.

In the following, we first discuss the methodological implications of our research, and then, its substantive implications concerning the use of the ASTI to measure wisdom.

Methodological Conclusions: A Comprehensive Approach for Evaluating Content Validity

This paper introduced the CSS procedure for evaluating content validity and discussed its advantages for the theory-based evaluation of scale items. In our experience, the method provides highly interesting practical and theoretical insights into target constructs. It does not only allow for evaluating and validating existing instruments and for improving the operationalization of a target construct, but it also offers advantages for constructing new items for existing instruments or even for developing whole new instruments. The procedure can be applied in all subdisciplines of psychology and other fields, wherever the goal is to measure specific constructs. In addition, it does not matter which kinds of items (e.g., questions, vignettes) and response formats (e.g., dichotomous, graded, open-ended) are used. The in-depth examination of the target construct is likely to increase the validity of any assessment.

We propose to follow certain quality criteria in studies using our approach. First, to optimize replicability, all steps should be carefully documented. A detailed documentation of procedures increases the validity of the study, irrespective of whether the data collection is more quantitative (as in the present study) or more qualitative (e.g., focus group discussions as in Castel et al., 2008 ). Second, the selection of experts is obviously crucial. Objectivity may be compromised if the group of experts is too homogeneous (e.g., if they are all from one research group) or too small. The instructions that the experts receive also need to be carefully written so as to avoid inducing any biases. Third, it is important that the expert judgments are complemented by actual data collected from a sample representative of the actual target population. Our experience is that the data are often astonishingly consistent with the expert ratings; however, experts may also be wrong occasionally, for example, if they assume more complex interpretations of item content than the actual participants use. As we have demonstrated here, item response models may be particularly suited for testing hypotheses about individual items, but factor-analytic approaches are also very useful for testing hypotheses about the structural relationships between subscales. For example, it would be worthwhile to test the current data for a bi-factor structure, i.e., a combination of a significant common factor with subscale-specific factors. Next steps in our work will include the comparison of these different methods of data analysis. Another important future goal is the definition of a quantitative content-validity index based on the current method.

Substantive Conclusions: Measuring Self-Transcendence Using the ASTI

In addition to utilizing the ASTI to demonstrate our approach, we believe that we have gained important insights about the ASTI, as well as about self-transcendence in general, from this study. Through the exercise of assigning and reassigning the items to the dimensions of the construct and discussing the contradictions and difficulties we encountered, we gained a far deeper understanding of the measured itself.

In general, the analyses demonstrated the importance of constructing more “difficult” items, i.e., items with a lower level of agreement. This is a general issue with self-report wisdom scales (see Glück et al., 2013 ): items representing core capacities of wisdom, such as being able to consider different perspectives, to be compassionate even with strangers, or to integrate conflicting aspects of the self may sound appealing even to individuals who are rather low in these capacities in real life. In fact, wiser individuals may even be more critical of themselves and thus less likely to rate themselves highly than other people are ( Aldwin, 2009 ; Glück et al., 2013 ). Some of the ASTI items nicely evade this problem by being difficult to understand for individuals who have not achieved the respective levels of self-transcendence. For example, the item “Whatever I do to others, I do to myself” regularly leaves our student participants dumbfounded. The positive German version of this item had the lowest mean, i.e., the lowest agreement frequencies, of all items in the scale. It may be worthwhile to try to construct more items of this kind.

For now, we have identified five subdimensions that include the 24 positive items (in German, 25) of the ASTI. The 10 negative items measuring alienation were not included in this analysis, as negative items tend to be difficult to assign to the same dimension as positive items. We recommend to leave them in the questionnaire in order to increase the range of item content, but to exclude them from score computations. In further applications of the ASTI, should the five subdimensions be scored separately or should the total score be used? Strong advocates of the Rasch model would certainly argue that using the total score across the subdimensions amounts to mixing apples and oranges. However, other self-report scales of wisdom such as the 3D-WS ( Ardelt, 2003 ) or the SAWS ( Webster, 2007 ) measure several dimensions of wisdom that are conceptually and empirically related to about the same degree as the subdimensions of the ASTI we have identified here. Both these authors suggest to use the mean across the subdimensions as an indicator of wisdom and to consider only individuals as wise who have a high mean, i.e., they score high in all subdimensions. The same may be a good idea here: for an individual to be considered as highly wise (in the sense of self-transcendence), he or she would need to have high scores in all five subdimensions, as all of them are considered as relevant components of wisdom. For individuals with lower means, we recommend to consider their profile across the subdimensions rather than compute a single score.

The Subdimensions of the ASTI

In the following, we describe the final subdimensions of the ASTI that resulted from our analysis and relate them to the theoretical background, thus completing point (8) “Final Definition of the Latent Construct” in the process. The subdimensions are ordered so as to represent a possible developmental order as suggested by Levenson et al. (2005) , with self-knowledge and integration as well as non-attachment preceding presence in the here-and-now and peace of mind, and self-transcendence being the last (and probably rarest) stage. It is important to note that in addition to producing valid and reliable subdimensions, the CSS procedure has also led us to conceptually redefine some of the subdimensions so as to better differentiate them (for example, independence of external sources of well-being was originally included in the definitions of both non-attachment and self-transcendence). We first give definitions for all subdimensions and then discuss their relationships to each other and to age and gender.

Self-Knowledge and Integration

The first subdimension includes items that were originally intended to measure Curnow's (1999) separate dimensions of self-knowledge and integration. It includes items that refer to broad and deep knowledge about as well as acceptance of all aspects of one's own self, including ambivalent or undesirable ones. Thus, the distinction between being aware of certain aspects of the self and accepting them was not supported empirically. The idea that self-knowledge and the acceptance of all aspects of the self is key to wisdom can be found in Erikson's (1980) idea of integrity, i.e., late-life acceptance of one's life as it has been lived (see also Beaumont, 2009 ).

Individuals high in this dimension of the ASTI are aware of the different, sometimes contradictory, facets of their self and their life, and they are able to accept all sides of their personality and integrate the different facets of their life. If the item “I have a good sense of humor about myself,” which was somewhat equivocal among the experts and showed differential item functioning in the quantitative analyses, is excluded, the subdimension includes only three items. Therefore, it seems advisable to add new items that refer to self-knowledge as well as items that differentiate between different kinds of integration (e.g., integration of self aspects, life contexts, and feelings). With a higher number of items, the distinction between knowing and accepting aspects of one's self might also receive more empirical support.

Non-Attachment

Non-attachment describes an individual's awareness of the fundamental independence of his or her internal self of external possessions or evaluations: non-attached individuals' self-esteem is not dependent on how others think about them or how many friends they have. The scale comprises four items concerning the individual's independence of external things, such as other people's opinions, a busy social life, or material possessions. It is important to note that non-attachment does not mean that people are not committed to others or to important issues in their current world; the main point is that they do not depend on external sources for self-enhancement. The fact that they are not affected by other people's judgments enables them to lead the life that is right for them and accept others non-judgmentally. Like other ideas originating from Buddhism, non-attachment as a path to mental health is currently receiving some attention in clinical psychology ( Shonin et al., 2014 ), but it has not yet been investigated in psychological wisdom research.

Presence in the Here-and-Now and Growth

Individuals high in this dimension, which was not part of Curnow's (1999) original conception, are able to live in the moment and enjoy the good times in their life without clinging to them, because they know that everything changes and that change may also foster growth. The items of this subdimension describe individuals who are able to live life mindfully in any given moment: they find joy in their life and in what they are doing. They are aware that things are always changing, oriented toward learning from others, and aware that they have grown through losses, and they have accepted the finitude of life. In a different study, we have found that many wisdom nominees report gratitude for the difficult experiences of their lives, i.e., they appreciate the processes of learning and growth triggered by such events ( Bluck and Glück, 2004 ; König and Glück, 2014 ).

Peace of Mind

Tranquility is a characteristic that many laypeople associate with wisdom ( Bluck and Glück, 2005 ); the related construct of emotion regulation has been proposed to be both a component of wisdom ( Webster, 2003 , 2007 ) and a developmental resource for wisdom ( Glück and Bluck, 2013 ). This subdimension of the ASTI describes individuals who are able to maintain their tranquility in situations where others would get angry or upset, and are at peace with the fundamental impermanence of things in life.

Self-Transcendence

Highly self-transcendent individuals feel that the boundaries between them and others, even humanity at large, are permeable. They feel related to past and future generations, all human beings, and nature. As they do not need to utilize social relationships to enhance their sense of self, they are able to love and accept other individuals as they are. As Levenson et al. (2005) argued, self-transcendence may be at the core of wisdom (cf. Tornstam, 1994 ).

Latent Correlations between the Five Dimensions

There were relatively high latent correlations (around 0.70) between the subdimensions of self-knowledge and integration, peace of mind, and presence in the here-and-now and growth, all of which seem to describe an accepting and appreciative stance toward oneself and one's life. For some purposes, it may make sense to average across these three subdimensions, as their discriminant validity may be limited. At the same time, the manifest correlations between these three subscales are markedly lower than the latent ones r (SI-PM) = 0.36, r (SI-PG) = 0.41, r (PM-PG) = 0.35; thus, the subscales may well be differentially related to other constructs. Therefore, we recommend to treat them separately for most research purposes. Non-attachment and self-transcendence were less closely related to the others and to each other, perhaps because they represent two important and somewhat contrary aspects of wise individuals' relationship with the external world: independence of one's self from external sources and a deep connectedness to others and the world. Our findings suggest that each of these states can exist without the other, and both can be present in an individual without the peace of mind that comes with self-integration and living in the present. A truly wise individual, however, would show high levels of all of these aspects.

Meaningful Individual Differences

The individual differences in our data (see “Differential Item Functioning”) were largely consistent with the literature. It is important to first note that our sample is not well-suited for analyzing older age: we were able to compare only two age groups roughly corresponding to adolescence and young adulthood on the one hand (15–31 years, N = 851; including many students) and early middle to older adulthood on the other hand (32–81 years, N = 364). A further differentiation among the “older” adults was not possible because only 3.3% of the sample were 60 years old or older. Comparing the two age groups, we found meaningful differences for two of the five dimensions. People older than 31 achieved higher scores in non-attachment and self-transcendence than adolescents and young adults. In another study with a mostly older sample, we found no correlation between the ASTI and age ( Glück et al., 2013 ). As has been shown for cognitive aspects of wisdom ( Pasupathi et al., 2001 ; Staudinger and Pasupathi, 2003 ), some facets of wisdom may develop in young adulthood and then stay stable into old age. The other three subdimensions, which represent an appreciative and accepting stance toward life, do not seem to be dependent on age.

Gender differences were found, interestingly, for four of the five subdimensions. Men had higher scores than women in self-knowledge and integration. This finding may suggest that men indeed know and accept themselves more than women do or that women actually tend to be more self-reflective and self-critical. In any case, the effect was small and needs further investigation. In the subdimensions peace of mind, non-attachment, and self-transcendence, women scored higher than men. These findings may, however, be partly determined by societal expectations for women to be less self-centered and more caring than men, which does not necessarily imply true self-transcendence. Thus, the limitations of self-report measures remain somewhat present even in carefully constructed scales like the ASTI.

In sum, we suggest that researchers using the ASTI may gain significant information if they use separate scores for the subdimensions we have identified in addition to, or instead of, the total score. The self-transcendence subdimension may be the purest indicator of actual self-transcendence. Whether the other subdimensions represent important preconditions, correlates, or even outcomes of self-transcendence is largely an empirical issue to be addressed in the future, which may tell us more about the development of wisdom.

Ethics Statement

No formal approval was applied for as the guidelines of the local Ethics Committee specify that the type of survey study we performed does not require such approval. All participants filled out an informed-consent form and agreed that their data are used for scientific purposes. No vulnerable populations were involved in this study.

Author Contributions

All three authors meet the four criteria for authorship required in the author guidelines. Each author's main tasks were as follows. IK: Development and application of the CSS procedure, expert in the first part of study, data analyses, writing the paper. JG: Discussion partner for the CSS procedure, expert in the first part of study, writing the parts concerning the topic of wisdom (background and results), editing of the manuscript. ML: Construction and provision of the revised (and as yet unpublished) ASTI, discussion of the translation of the items, expert in the first part of the study.

This research was partly funded by the Austrian Science Fund FWF (grant nr. P25425, PI: JG).

Conflict of Interest Statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Supplementary Material

The Supplementary Material for this article can be found online at: http://journal.frontiersin.org/article/10.3389/fpsyg.2017.00126/full#supplementary-material

Adams, R. J., Wilson, M. R., and Wang, W. (1997). The multidimensional random coefficients multinomial logit model. Appl. Psychol. Meas. 21, 1–23.

Google Scholar

Aldwin, C. (2009). Gender and wisdom: a brief overview. Res. Hum. Dev. 6, 1–8. doi: 10.1080/15427600902779347

CrossRef Full Text | Google Scholar

American Educational Research Association, American Psychological Association, National Council on Measurement in Education (2014). Standards for Educational and Psychological Testing . Washington, DC: American Educational Research Association.

Ardelt, M. (2003). Empirical assessment of a three-dimensional wisdom scale. Res. Aging 25, 275–324. doi: 10.1177/0164027503251764

Baltes, P. B., and Staudinger, U. M. (2000). Wisdom: a metaheuristic (pragmatic) to orchestrate mind and virtue toward excellence. Am. Psychol. 55, 122–136. doi: 10.1037/0003-066X.55.1.122

PubMed Abstract | CrossRef Full Text | Google Scholar

Beaumont, S. L. (2009). Identity processing and personal wisdom: an information-oriented identity style predicts self-actualization and self-transcendence. Int. J. Theor. Res. 9, 95–115. doi: 10.1080/15283480802669101

Bluck, S., and Glück, J. (2004). Making things better and learning a lesson: experiencing wisdom across the lifespan. J. Pers. 72, 543–572. doi: 10.1111/j.0022-3506.2004.00272.x

Bluck, S., and Glück, J. (2005). “From the inside out: people's implicit theories of wisdom,” in A Handbook of Wisdom: Psychological Perspectives , eds R. J. Sternberg and J. Jordan (Cambridge: Cambridge University Press), 84–109.

Castel, L. D., Williams, K. A., Bosworth, H. B., Eisen, S. V., Hahn, E. A., Irwin, D. E., et al. (2008). Content validity in the PROMIS social health domain: a qualitative analysis of focus group data. Qual. Life Res. 17, 737–749. doi: 10.1007/s11136-008-9352-3

Curnow, T. (1999). Wisdom, Intuition, and Ethics . Aldershot: Ashgate.

de Von, H. A., Block, M. E., Moyle-Wright, P., Ernst, D. M., Hayden, S. J., Lazzara, D. J., et al. (2007). A psychometric toolbox for testing validity and reliability. J. Nurs. Scholarsh. 39, 155–165. doi: 10.1111/j.1547-5069.2007.00161.x

Educational Testing Service (2014). ETS Standards for Quality and Fairness . Available online at: http://www.ets.org/s/about/pdf/standards.pdf

Erikson, E. H. (1980). Identity and the Life Cycle . New York, NY: Norton.

Fan, X., and Thompson, B. (2001). Confidence intervals about score reliability coefficients, please: an EPM guidelines editorial. Educ. Psychol. Meas. 61, 517–531. doi: 10.1177/00131640121971365

Fischer, G. H., and Molenaar, I. W. (1995). Rasch Models: Foundations, Recent Developments and Applications. New York, NY: Springer.

Gittler, G. (1986). Inhaltliche Aspekte bei der Itemselektion nach dem modell von Rasch [Content issues in the selection of items for the Rasch model]. Z. Exp. Angew. Psychol. 33, 386–412.

Glück, J., and Bluck, S. (2013). “The MORE life experience model: a theory of the development of personal wisdom,” in The Scientific Study of Personal Wisdom , eds M. Ferrari and N. Weststrate (New York, NY: Springer), 75–98.

Glück, J., König, S., Naschenweng, K., Redzanowski, U., Dorner, L., Strasser, I., et al. (2013). How to measure wisdom: content, reliability, and validity of five measures. Front. Psychol. 4:405. doi: 10.3389/fpsyg.2013.00405

Haladyna, T. M., and Rodriguez, M. C. (2013). Developing and Validating Test Items. London: Routledge.

Hasson, F., and Keeney, S. (2011). Enhancing rigor in the Delphi technique research. Technol. Forecast. Soc. Change 78, 1695–1704. doi: 10.1016/j.techfore.2011.04.005

Haynes, S. N., Richard, D. C. S., and Kubany, E. S. (1995). Content validity in psychological assessment: a functional approach to concepts and methods. Psychol. Assess. 7, 238–247.

Holland, P. W., and Wainer, H. (1993). Differential Item Functioning . Hillsdale, MI; New York, NY: Erlbaum.

PubMed Abstract | Google Scholar

Jobst, A., Kirchberger, I., Cleza, A., Stucki, G., and Stucki, A. (2013). Content validity of the comprehensive ICF core set for chronic obstructive pulmonary diseases: an international Delphi study. Open Respir. Med. J. 7, 33–45. doi: 10.2174/1874306401307010033

Johnston, M., Dixon, D., Hart, J., Glidewell, L., Schröder, C., and Pollard, B. (2014). Discriminant content validity: a quantitative methodology for assessing content of theory-based measures, with illustrative applications. Br. J. Health Psychol. 19, 240–257. doi: 10.1111/bjhp.12095

Kang, T., and Chen, T. T. (2011). Performance of the generalized S-X2 item fit index for the graded response model. Asia Pacific Educ. Rev. 12, 89–96. doi: 10.1007/s12564-010-9082-4

Kiefer, T., Robitzsch, A., and Wu, M. (2015). Package ‘TAM’, R Package Version 1.5-2. Available online at: http://CRAN.R-project.org/package=TAM

Kikukawa, M., Stalmeijer, R. E., Emura, S., Roff, S., and Scherpbier, A. J. J. A. (2014). An instrument for the evaluation clinical teaching in Japan: content validity and cultural sensitivity. BMC Med. Educ. 14:179. doi: 10.1186/1472-6920-14-179

Koller, I., Alexandrowicz, R., and Hatzinger, R. (2012). Das Rasch Modell in der Praxis: Eine Einführung mit eRm [The Rasch Model in Practical Applications: An Introduction with eRm] . Vienna: UTB.

Koller, I., and Lamm, C. (2015). Item response model investigation of the (German) interpersonal reactivity index empathy questionnaire. Eur. J. Psychol. Assess. 31, 211–221. doi: 10.1027/1015-5759/a000227

König, S., and Glück, J. (2014). “Gratitude is with me all the time:” How gratitude relates to wisdom. J. Gerontol. B Psychol. Sci. 69, 655–666. doi: 10.1093/geronb/gbt123

Lane, S., Raymond, M. R., and Haladyna, T. M. (2016). Handbook of Test Development . New York, NY: Routledge.

Lawshe, C. H. (1975). A quantitative approach to content validity. Pers. Psychol. 28, 563–575. doi: 10.1111/j.1744-6570.1975.tb01393.x

Levenson, M. R., Jennings, P. A., Aldwin, C. M., and Shiraishi, R. W. (2005). Self-Transcendence: conceptualization and measurement. Int. J. Aging Hum. Dev. 60, 127–143. doi: 10.2190/XRXM-FYRA-7U0X-GRC0

Linacre, J. M. (2014). A User's Guide to WINSTEPS, MINISTEP: Rasch-Model Computer Programs, Program Manual 3.81.0. Available online at: http://www.winsteps.com

Lynn, M. R. (1986). Determination and quantification of content validity. Nurs. Res. 35, 382–385. doi: 10.1097/00006199-198611000-00017

Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika 47, 149–174. doi: 10.1007/BF02296272

Masters, G. N., and Wright, B. D. (1997). “The partial credit model,” in Handbook of Modern Item Response Theory , eds W. J. van der Linden and R. K. Hambleton (New York, NY: Springer), 101—121.

Moore, G. C., and Benbasat, I. (2014). Development of an instrument to measure the perceptions of adopting and information technology innovation. Inform. Syst. Res. 2, 192–222. doi: 10.1287/isre.2.3.192

Muraki, E. (1992). A generalized partial credit model: application of an EM algorithm. Appl. Psychol. Meas. 1992, i–30. doi: 10.1002/j.2333-8504.1992.tb01436.x

Muthén, L. K., and Muthén, B. O. (1998–2015). Mplus User's Guide, 7th Edn . Los Angeles, CA: Muthén Muthén.

Newman, I., Lim, J., and Pineda, F. (2013). Content validity using a mixed methods approach: its application and development through the use of a table of specifications methodology. J. Mix. Methods Res. 7, 243–260. doi: 10.1177/1558689813476922

Pasupathi, M., Staudinger, U. M., and Baltes, P. B. (2001). Seeds of wisdom: adolescents' knowledge and judgment about difficult life problems. Dev. Psychol. 37, 351–361. doi: 10.1037/0012-1649.37.3.351

Pohl, S., and Carstensen, C. H. (2012). NEPS Technical Report – Scaling the Data of the Competence Tests (NEPS Working Paper No. 14) . Bamberg: Otto-Friedrich-Universität, Nationales Bildungspanel.

Polit, D. F., and Beck, C. T. (2006). The content validity index: are you sure you know what's being reported? Critique and recommendations. Res. Nurs. Health 29, 489–497. doi: 10.1002/nur.20147

Polit, D. F., Beck, C. T., and Owen, S. V. (2007). Is the CVI an acceptable indicator of content validity? Appraisal and recommendations. Res. Nurs. Health 30, 459–467. doi: 10.1002/nur.20199

Rasch, G. (1960). Studies in Mathematical Psychology: I. Probabilistic Models for Some Intellelligence and Attainment Tests . Kopenhagen: Danish Institute for Educational Research.

Rossiter, J. R. (2008). Content validity of measures of abstract constructs in management and organizational research. Br. J. Manage. 19, 380–388. doi: 10.1111/j.1467-8551.2008.00587.x

Schwarz, G. (1978). Estimating the dimension of a model. Ann. Stat. 6, 461–464. doi: 10.2307/2958889

Sackman, H. (1974). Delphi Assessment: Expert Opinion, Forecasting, and Group Process. A Report prepared for United States Air Force Project RAND. Available online at: http://www.rand.org/

Shonin, E., Van Gordon, W., and Griffiths, M. D. (2014). The emerging role of Buddhism in clinical psychology: toward effective integration. Psychol. Relig. Spiritual. 6, 123–137. doi: 10.1037/a0035859

Singer, T., and Lamm, C. (2009). The social neuroscience of empathy. Annu. N.Y. Acad. Sci. 1156, 81–96. doi: 10.1111/j.1749-6632.2009.04418.x

Sireci, S. G. (1998). Gathering and analyzing content validity data. Educ. Assess. 5, 299–321. doi: 10.1207/s15326977ea0504_2

Staudinger, U. M., and Pasupathi, M. (2003). Correlates of wisdom-related performance in adolescence and adulthood: age-graded differences in “paths” toward desirable development. J. Res. Adolesc. 13, 239–268. doi: 10.1111/1532-7795.1303001

Tornstam, L. (1994). “Gerotranscendence: a theoretical and empirical exploration,” in Aging and the Religious Dimension , eds L. E. Thomas and S. A. Eisenhandler (Westport, CT: Greenwood Publishing Group), 203–225.

Webb, N. L. (2006). “Identifying content for student achievement tests,” in Handbook of Test Development , eds S. M. Downing and T. M. Haladyna (Mahwah, NJ: Lawrence Erlbaum Associates), 155–180.

Webster, J. D. (2003). An exploratory analysis of a self-assessed wisdom scale. J. Adult Dev. 10, 13–22. doi: 10.1023/A:1020782619051

Webster, J. D. (2007). Measuring the character strength of wisdom. Int. J. Aging Hum. Dev. 65, 163–183. doi: 10.2190/AG.65.2.d

Wright, B. D., and Linacre, J. M. (1994). Reasonable Mean-Square Fit Values. Rasch Measurement Transaction 8, 2004 . Available online at: http://rasch.org/rmt/rmt83.htm

Wright, B. D., and Masters, G. N. (1990). Computation of OUTFIT and INFIT Statistics. Rasch Measurement Transaction 3, 84-85. Available online at: http://www.rasch.org/rmt/rmt34e.htm

Wu, M., Adams, R. J., Wilson, M., and Haldane, S. (2007). Conquest 2.0 . Camberwell: ACER Press.

Wu, M. L. (1997). The Development and Application of a Fit Test for Use with Marginal Maximum Likelihood Estimation and Generalized Item Response Models . Unpublished Master Dissertation. University of Melbourne.

Zamanzadeh, V., Rassouli, M., Abbaszadeh, A., Alavi-Majd, H., Nikanfar, A.-R., and Ghahramanian, A. (2014). Details of content validity and objectifying it in instrument development. Nurs. Pract. Today 1, 163–171.

Keywords: content validity, mixed-methods, CSS-procedure, wisdom, item response models, partial credit model, theory-based item analysis, Adult Self-Transcendence Inventar (ASTI)

Citation: Koller I, Levenson MR and Glück J (2017) What Do You Think You Are Measuring? A Mixed-Methods Procedure for Assessing the Content Validity of Test Items and Theory-Based Scaling. Front. Psychol . 8:126. doi: 10.3389/fpsyg.2017.00126

Received: 08 July 2016; Accepted: 18 January 2017; Published: 21 February 2017.

Reviewed by:

Copyright © 2017 Koller, Levenson and Glück. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY) . The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Ingrid Koller, [email protected]

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Content validity: Definition and procedure of content validation in psychological research

TPM - Testing 30(1):5-18

Universitas Al Azhar Indonesia

Universitas Mercu Buana
This person is not on ResearchGate, or hasn't claimed this research yet.

Abstract and Figures

Discover the world's research

25+ million members
160+ million publication pages
2.3+ billion citations
Delaram Pourbafrani

Mohsen Vahedi

Fairrul Kadir
Thitaree Srihawech

Gianmarco Altoè
Daniël Lakens

Bülent Akkaya

Alessandra Micozzi

Regina Regina

Mohammad Hossein Askari

Amna Bashir
Sesilia Yuliani
Tefanya Laili Mukhibbah
Eliasanti Agustina

N.R. Aravamudhan

SunitaY Patil

HemaB Bannur
Ashwini Ratnakar

Ali-Reza Nikanfar
Tenko Raykov
George A. Marcoulides

William R. Darden
Richard P. Bagozzi
RES SOC ADMIN PHARM

Timothy F Chen

F. Yaghmaie
Recruit researchers
Join for free
Login Email Tip: Most researchers use their institutional email address as their ResearchGate login Password Forgot password? Keep me logged in Log in or Continue with Google Welcome back! Please log in. Email · Hint Tip: Most researchers use their institutional email address as their ResearchGate login Password Forgot password? Keep me logged in Log in or Continue with Google No account? Sign up

Access through your organization
Purchase PDF

Article preview

Cited by (6), contemporary research methods in pharmacy and health services, chapter 41 - a practical approach to the assessment and quantification of content validity, references (0), tutorial on how to calculating content validity of scales in medical research, the autism program environment rating scale in swedish primary school: cultural adaptation and content validation, towards sustainable governance: do banking sectors practice sustainable finance beyond compliance, tiredness of life – conceptualizing a complex phenomenon, psychometric evaluation of suicide management competency scale for nursing students: a cross-sectional study, development of an interprofessional task-based learning program in the field of occupational health: a content validity study.

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Publications
Account settings

The PMC website is updating on October 15, 2024. Learn More or Try it out now .

Advanced Search
Journal List
J Caring Sci
v.4(2); 2015 Jun

Design and Implementation Content Validity Study: Development of an instrument for measuring Patient-Centered Communication

Vahid zamanzadeh.

1 Department of Medical-Surgical Nursing, Faculty of Nursing and Midwifery, Tabriz University of Medical Sciences, Tabriz, Iran

Akram Ghahramanian

Maryam rassouli.

2 Department of Pediatrics Nursing, Faculty of Nursing and Midwifery, Shahid Beheshti University of Medical Sciences, Tehran, Iran

Abbas Abbaszadeh

Hamid alavi-majd.

3 Department of Biostatistics, Faculty of Para Medicine, Shahid Beheshti University of Medical Sciences, Tehran, Iran

Ali-Reza Nikanfar

4 Hematology and Oncology Research Center, Tabriz University of Medical Sciences, Tabriz, Iran

Introduction: The importance of content validity in the instrument psychometric and its relevance with reliability, have made it an essential step in the instrument development. This article attempts to give an overview of the content validity process and to explain the complexity of this process by introducing an example.

Methods: We carried out a methodological study conducted to examine the content validity of the patient-centered communication instrument through a two-step process (development and judgment). At the first step, domain determination, sampling (item generation) and instrument formation and at the second step, content validity ratio, content validity index and modified kappa statistic was performed. Suggestions of expert panel and item impact scores are used to examine the instrument face validity.

Results: From a set of 188 items, content validity process identified seven dimensions includes trust building (eight items), informational support (seven items), emotional support (five items), problem solving (seven items), patient activation (10 items), intimacy/friendship (six items) and spirituality strengthening (14 items). Content validity study revealed that this instrument enjoys an appropriate level of content validity. The overall content validity index of the instrument using universal agreement approach was low; however, it can be advocated with respect to the high number of content experts that makes consensus difficult and high value of the S-CVI with the average approach, which was equal to 0.93.

Conclusion: This article illustrates acceptable quantities indices for content validity a new instrument and outlines them during design and psychometrics of patient-centered communication measuring instrument.

Introduction

In most studies, researchers study complex constructs for which valid and reliable instruments are needed. 1 Validity, which is defined as the ability of an instrument to measure the properties of the construct under study, 2 is a vital factor in selecting or applying an instrument. It is determined as its three common forms including content, construct, and criterion-related validity. 3 Since content validity is a prerequisite for other validity, it should receive the highest priority during instrument development. Validity is not the property of an instrument, but the property of the scores achieved by an instrument used for a specific purpose on a special group of respondents. Therefore, validity evidence should be obtained on each study for which an instrument is used. 4

Content validity, also known as definition validity and logical validity, 5 can be defined as the ability of the selected items to reflect the variables of the construct in the measure. This type of validity addresses the degree to which items‏ of‏ an instrument sufficiently represent the content domain. It also answers the question that to what extent the selected sample in an instrument or instrument‏ items‏ is a comprehensive sample of the content. 1 , 6 - 8 This type validity provides the preliminary evidence on construct validity of an instrument‏. 9 In addition, it can provide information on the representativeness and clarity of items and help improve an instrument through achieving recommend- dations from‏ an expert panel‏. 6 , 10 If an instrument lacks content validity, It is impossible to establish reliability for it‏. 11 On the other hand although more resources should be spent for a content validity study initially, it decreases the need for resources in the future reviews of an instrument during psychometric process‏. 1

Despite the fact that in instrument development, content validity is a critical step 12 and a trigger mechanism to link abstract concepts to visible and measurable indices, 7 it is studied superficially and transiently. This problem might be due the fact that the methods used to assess content validity in medical research literature are not referred to profoundly‏ 12 and sufficient details have rarely been provided on content validity process in a single resource. 13 It is possible that students do not realize complexities in this critical process. 12 Meanwhile, a number of experts have questioned historical legitimacy of content validity as a real type of validity. 14 - 16 These challenges about value and merit of content validity have arisen from lack of distinction between content validity and face validity, un-standardized mechanisms to determine content validity and the previously its un-quantified nature. 3 This article aims to discuss on the content validity process, to train quantifying of it with a example instrument. This is designed to measure the patient-centered communication between the patients with cancer and nurses as a key member of the health care providers in oncology wards of Iran.

Nurse-patient communication

For improving patients’ outcomes, nurses cannot perform health services such as physical cares, emotional support and exchanging information with their patients without establishing a relationship with them. 17 During recent decades, patient-centered communication was defined as a communication in which patients’ viewpoints are actively sought by a treatment team, 18 a relationship with patients, based on trust, respect, and reciprocity, and with mutually negotiated goals and expectations that can be an important support and buffer for cancer patients experiencing distress. 19

Communication serves to build and maintain this relationship, to transmit information, to provide support, and to make treatment decisions. Although patient-centered communication between providers and cancer patients can significantly affect clinical outcomes 20 and as an important element improves patient satisfaction, treatment compliance, and health outcomes, 21 , 22 however, recent evidence demonstrates that communication in cancer care may often be suboptimal, particularly with regard to the emotional experience of the patient. 23

Despite the public acceptance, there is little consensus on the meaning and operationalization of the concept of patient-centered communication, 19 , 24 so that a serious limitation is caused by lack of standard instruments to review and promote patient-centeredness in patient-healthcare communication. Part of this issue is related to the extended nature of patient-centeredness construct that has led to creating different and almost dissimilar instruments caused by researchers’ conceptualization and psychometrics. 25 Few‏ instruments can provide a comprehensive definition of this concept in cancer care and in a single tool. 26 Whereas, reviewing the literature in Iran shows that this concept has never been studied in the form of a research study. As one of the research priorities is to conduct research on cancer, 27 no quantitative and qualitative study has been carried out and no instrument has been made yet.

It is obvious that evaluating abilities of nurses in oncology wards to establish a patient-centered communication and its consequences require application of a reliable‏ instrument based on the context and culture of the target group. 26 When a new instrument is designed, measurement and report of its content validity have fundamental importance. 8 Therefore, this study was conducted to design and to examine content validity of the instrument measuring patient-centered communication in oncology wards in northwest of Iran.

Materials and methods

This methodological study is part of a larger study carried out through the exploratory mixed method research (qualitative-quantitative) to design and psychometrics the instrument measuring patient-centered communication in oncology wards in northwest of Iran. Data in the qualitative phase of study with qualitative content analysis approach was collected by semi-structured in-depth interview with 10 patients with cancer, three family members and seven oncology nurses in the Ali-Nasab and Shahid Ayatollah Qazi‏ Tabatabai‏ Hos- pitals of Tabriz and in the quantities phase of study, during a two-step process (design – judgment), the qualitative and quantities viewpoints of 15 experts were collected. 3

Ethical considerations such as approval of the ethic committee of Tabriz University of Medical Sciences, Permissions of administrators of Ali-Nasab and Shahid Ayatollah Qazi‏ Tabatabai‏ Hospitals, anonymity, informed consent, withdrawal from the study, and recording permission was respected.

Stage 1: Instrument Design

Instrument design is performed through three-‏ steps process, including determining content domain, sampling from content (item generation) and instrument construction. 11 , 14 the first step is determining the content domain of a construct that the instrument is made to measure it. Content domain is the content area related to the variables that being‏ measured. 28 It can be identified by literature review on the topic being measured, interviewing with the respondents and focus groups. Through a precise definition on the attributes and characteristics of the desired construct, a clear image of its boundaries, dimensions, and components is obtained. The qualitative research methods can also be applied to determine the variables and concepts of the pertinent construct. 29 The qualitative data collected in the interview with the respondents familiar with concept help enrich and develop what has been identified on concept, and are considered as an invaluable resource to generate instrument items. 30 To determine content domain in emotional instruments and cognitive instruments, we can use literature review and table of specifications, respectively. 3 In practice, table of specifications reviews alignment of a set of items (placed in rows) with the concepts forming the construct under study (placed in columns) through collecting quantitative and qualitative evidence from experts and by analyzing‏ data. 5 Ridenour and Newman also introduced the application of mixed method‏ (deductive- inductive) for conceptualization at the step of content domain determination and items generation. 31 However, generating items requires a preliminary task to determine the content domain an constract. 32 In addition, a useful approach would consists of returning to research questions and ensuring that the instrument items are reflect of and relevant to research questions. 33

Instrument construction is the third step in instrument design in which the items are refined and organized in a suitable format and sequence so that the finalized items are collected in a usable form. 3

Stage2: Judgment

This step entails‏ confirmation‏ by a specific number of experts, indicating that instrument items and the entire instrument have content validity. For this purpose, an expert panel is appointed. Determining the number of experts has always been partly arbitrary. At least five people are recommended to have sufficient control over chance agreement. The maximum number of judges has not been determined yet; however, it is unlikely that more than 10 people are used. Anyway, as the number of experts increases, the probability of chance agreement decreases. After determining an expert panel, we can collect and analyze their quantitative and qualitative viewpoints on the relevancy or representativeness, clarity and comprehend- siveness of the items to measure the construct operationally‏ defined by these items to ensure the content validity of the instrument. 3 , 7 , 8

Quantification of Content Validity

The content validity of instrument can be determined using the viewpoints of the panel of experts. This panel consists of content experts and lay experts. Lay experts are the potential research subjects and content experts are professionals who have research experience or work in the field. 34 Using subjects of the target group as expert ensures that the population for whom the instrument is being developed is represented 1

In qualitative content validity‏ method, content‏ experts and target group’s recommendations are adopted on observing grammar, using appropriate and correct words, applying correct and proper order of words in items and appropriate scoring. 35 However, in‏ the quantitative content validity method, confidence is maintained in selecting the most important and correct content in an‏ instrument, which is quantified by content validity‏ ratio (CVR). In this way, the experts are requested to specify whether an item is necessary for operating a construct in a set of items or not. To this end, they are requested to score each item from 1 to 3 with a three-degree range of “ not necessary,useful but not essential,essential ”‏ respectively. Content validity ratio varies between 1 and -1. The higher score indicates further agreement of members of panel on the necessity of an item in an instrument. The formula of content validity ratio is CVR=(N e - N/2)/(N/2), in which the N e is the number of panelists indicating "essential" and N is the total number of panelists. The numeric value of content validity ratio is determined by Lawshe Table. For example, in our study that is number of panelists 15 members,‏ if‏ CVR is bigger than 0.49, the item in the instrument with an acceptable level of significance‏ will‏ be‏ accepted. 36

In reports of instrument development, the most widely reported approach for content validity is the content validity index. 3 , 34 , 37 Panel members is asked to rate instrument items in terms of clarity and its relevancy to the construct underlying study as per the theoretical definitions of the construct itself and its dimensions on a 4-point ordinal scale (1[not relevant], 2[somewhat relevant], 3[quite relevant], 4[highly relevant]). 34 A table like the one shown below( Table 1 ) was added to the cover letter to guide experts for scoring method.

Relevancy	Clarity
1[ not relevant] 2[item need some revision] 3[relevant but need minor revision] 4[ very relevant]	1[not clear] 2[item need some revision] 3[clear but need minor revision] 4[very clear]

To obtain content validity index for relevancy and clarity of each item (I-CVIs), the number of those judging the item as relevant or clear (rating 3 or 4) was divided by the number of content experts but for relevancy, content validity index can be calculated both for item level (I-CVIs) and the scale-level (S-CVI). In item level, I-CVI is computed as the number of experts giving a rating 3 or 4 to the relevancy of each item, divided by the total number of experts.

The I-CVI expresses the proportion of agreement on the relevancy of each item, which is between zero and one 3 , 38 and the SCVI is defined as “the proportion of total items judged content valid” 3 or “the proportion of items on an instrument that achieved a rating of 3 or 4 by the content experts”. 28

Although instrument developers almost never give report what method have used to computing the scale-level index of an instrument (S-CVI). 6 There are two methods for calculating it, One method requires u niversal a greement among experts (S-CVI/ UA ), but a less conservative method is ave rages the item-level CVIs (S-CVI/Ave). For calculating them, first, the scale is dichotomized by combining values 3 and 4 together and 2 and 1 together and two dichotomous categories of responses including “ relevant and not relevant ” are formed for each item. 3 , 34 Then, in the universal agreement approach, the number of items considered relevant by all the judges (or number of items with CVI equal to 1) is divided by the total number of items. In the average approach, the sum of I-CVIs is divided by the total number of items. 10 Table 2 provides data for better understanding on calculation CVI and S-CVI by both methods. Data of table has been extracted from judges of our panel about relevancy items of dimension of trust building as a variable (subscale) in measuring construct of patient-centered communication. As the values obtained from both methods might be different, instrument makers should mention the method used for calculating it. 6 Davis proposes that researchers should consider 80 percent agreement or higher among judges for new instruments. 34 Judgment on each item is made as follows: If the I-CVI is higher than 79 percent, the item will be appropriate. If it is between 70 and 79 percent, it needs revision. If it is less than 70 percent, it is eliminated. 39


	14 12 13 12 11 14 12 8 14	0 2 1 2 3 0 2 6 0	1 0.857 0.928 0.857 0.785 1 0.857 0.571 1	Appropriate Appropriate Appropriate Appropriate Need for Revision Appropriate Appropriate Eliminated Appropriate

Number of items considered relevant by all the panelists=3, Number of terms=9, S-CVI/Ave *** or Average of I-CVIs=0.872, S-CVI/UA ** =3/9=.333NOTE: * Item-Content Validity Items, ** Scale-Content Validity Item/Universal agreement, *** Scale-Content Validity Item/Average Number of experts=14, Interpretation of I-CVIs: If the I-CVI is higher than 79 percent, the item will be appropriate. If it is between 70 and 79 percent, it needs revision. If it is less than 70 percent, it is eliminated.

Although content validity index is extensively used to estimate content validity by researchers, this index does not consider the possibility of inflated values because of the chance agreement. Therefore, Wynd‏ et al .,‏ propose both content validity index and multi-rater kappa statistic in content validity study‏ because, unlike the CVI, it adjusts for chance agreement. Chance agreement is an issue of concern while studying agreement indices among assessors, especially when we place four-point scoring within two relevant and not relevant classes. 7 In other words, kappa statistic is a consensus index of inter-rater agreement that adjusts for chance agreement 10 and is an important supplement to CVI because Kappa provides information about degree of agreement beyond chance. 7 Nevertheless, content validity index is mostly used by researchers because it is simple for calculation, easy to understand and provide information about each item, which can be used for modification or deletion of instrument items. 6 , 10

To calculate modified kappa statistic, the probability of chance agreement was first calculated for each item by following formula:

P C = [N! /A! (N -A)!]* . 5 N .

In this formula, N= number of experts in a panel and A= number of panelists who agree that the item is relevant.

After calculating I-CVI for all instrument items, finally, kappa was computed by entering the numerical values of probability of chance agreement (P C ) and content validity index of each item (I-CVI) in following formula:

K= (I-CVI - P C ) / (1- P C ).

Evaluation criteria for kappa is the values above 0.74, between 0.60 and 0.74, and the ones between 0.40 and 0.59 are considered as excellent, good, and fair, respectively. 40

Polit states that after controlling items by calculating adjusted kappa, each item with I-CVI equal or higher than 0.78 would be considered excellent. Researchers should note that, as the number of experts in panel increases, the probability of chance agreement diminishes and values of I-CVI and kappa converge. 10

Requesting panel members to evaluate instrument in terms of comprehensiveness would be the last step of measuring the content validity. The panel members are requested to judge whether instrument items and any of its dimensions are a complete and comprehensive sample of content as far as the theoretical definitions of concepts and its dimensions are concerned. Is it needed to eliminate or add any item? According to members’ judgment, proportion of agreement is calculated for the comprehensiveness of each dimension and the entire instrument. In so doing, the number of experts who have identified instrument comprehensiveness as favorable is divided into the total number of experts. 3 , 37

Determining face validity of an instrument

Face validity answers this question whether an instrument apparently has validity for subjects, patients and/or other participants. Face validity means if the designed instrument is apparently related to the construct underlying study. Do participants agree with items and wording of them in an instrument to realize research objectives? Face validity is related to the appearance and apparent attractiveness of an instrument, which may affect the instrument acceptability by respondents. 11 In principle, face validity is not considered as validity as far as measurement principles are concerned. In fact, it does not consider what to measure, but it focuses on the appearance of instrument. 9 To determine face validity of an instrument, researchers use respondents and experts’ viewpoints. In the qualitative method, face-to-face interviews are carried out with some members of the target groups. Difficulty level of items, desired suitability and relationship between items and the main objective of an instrument, ambiguity and misinterpretations of items, and/or incomprehensibility of the meaning of words are the issues discussed in the interviews. 41

Although content experts play a vital role in content validity, instrument review by a sample of subjects drawn from the target population is another important component of content validation. These individuals are asked to review instrument items because of their familiarity with the construct through direct personal experience. 37 Also they will be asked to identify the items they thought are the most important for them, and grade their importance on a 5-point Likert scale including very important 5 , important 4 , 2 relatively important 3 , slightly important 2 , and unimportant. In quantities method, for calculation item impact score, the first is calculated percent of patients who scored 4 or 5 to item importance (frequency), and the mean importance score of item (importance) and then item impact score of instrument items was calculated by following formula: Item Impact Score= frequency×Importance

If the item impact of an item is equal to or greater than 1.5 (which corresponds to a mean frequency of 50% and an importance mean of 3 on the 5-point Likert scale), it is maintained in the instrument; otherwise it is eliminated. 42

Results of stage1: Designing patient-centered communication measuring instrument

In the one step of our research, which was performed through qualitative content analysis by semi-structured in-depth interview with ten patients with cancer, three family members and seven oncology nurses, the results led to identifying content domain within seven dimensions including trust building, intimacy or friendship, patient activation, problem solving, emotional support, informational support, and spiritual strengthening. Each of these content domains was defined theoretically by combining qualitative study and literature review. In the item generation step, 260 items were generated from these dimensions and they were combined with 32 items obtained from literature and the related instruments. In research group, the items were studied in terms of overlapping and duplication. Finally, 188 items remained for the operational definition of the construct of patient-centered communication, and the preliminary instrument was made by 188 items (pool items) within seven dimensions.

Results of stage 2: Judgment of expert panel on validity of patient-centered communic- ation measuring instrument

In the second step and after selecting fifteen content experts including the instrument developer experts (four people), cancer research experts (four people),nurse-patient communication experts (three people) and four nurses experienced in cancer care, an expert panel was created for making quantitative and qualitative judgments on instrument items. The panel members were requested thrice to judge on content validity ratio, content validity index, and instrument comprehensiveness. In each round, they were requested to judge on face validity of instrument as well. In each round of correspondences via e-mail or in person, a letter of request was presented, which included study objectives and an account on instrument, scoring method, and required instructions on responding. Theoretical definitions of the construct underlying study, its dimensions, and items of each dimension were also mentioned in that letter. In case no reply was received for the reminder e-mail within a week, a telephone conversation would be made or a meeting would be arranged.

In the first round of judgment, 108 items out of 188 instrument items were eliminated. These eliminated items had content validity ratio lower than 0.49, (according to the expert numbers in our study that was 15, numerical values of the Lawshe table was 0.49) or those which combined to remained items based on the opinion of content experts through editing of item. Table 3 shows a sample of instrument items and CVR calculation method for them.


9	0.0667	Remained
5	-0.333	Eliminated
10	0.3333	Eliminated
15	0.8667	Remained
9	0.2	Eliminated
13	0.6	Remained
7	-0.2	Eliminated
7	-0.067	Eliminated
13	0.6	Remained

NOTE: * Number of experts evaluated the item essential, ** CVR or Content Validity Ratio = (N e -N/2)/(N/2) with 15 person at the expert panel (N=15), the items with the CVR bigger than 0.49 remained at the instrument and the rest eliminated.

The remaining items were modified according to the recommendations of panel members in the first round of judgment and for a second time to determine content validity index and instrument modification, the panel members were requested to judge by scoring 1 to 4 on the relevancy and clarity of instrument items according to Waltz and Bussel content validity index. 38

In the second round, the proportion of agreement among panel members on the relevancy and clarity of 80 remaining items of the first round of judgment was calculated.

To obtain content validity index for each item, the number of those judging the item as relevant was divided by the number of content experts (N=14). (As one of the 15 members of panel had not scored some items, the analyses were made by 14 judges). This work was also carried out to clarify the items of the instrument. The agreement among the judges for the entire instrument was only calculated for relevancy according to average and universal agreement approach.

In this round, among the 80 instrument items, 4 items with a CVI score lower than 0.70 were eliminated. Eight items with a CVI between 0.70 and 0.79 were modified (Modification of items was performed according to the recommendation of panel members and research group forums). Two items were eliminated despite having favorable CVI scores, one of which was eliminated due to ethical issues (As some content experts believed that the item “ I often think to death but I don’t speak about it with my nurses .” might cause moral harm to a patient, it was eliminated). On another eliminated item, “ Nurses know that how to communicate with me ”, some experts believed that if that item is eliminated, it would not harm the definition of trust building dimension. According to experts suggestions, an item ( Nurses try that I face no problem during care ) was added in this round. After modification, the instrument containing 57 items was sent to the panel members for the third time to judge on the relevancy, clarity and comprehensiveness of the items in each dimension and need for deletion or addition of the items. In this round, four items had a CVI lower than 0.70, which were eliminated.

The proportion of agreement among the experts was also calculated in this round in terms of comprehensiveness for each dimension of the construct underlying study. Table 4 shows the calculation of I-CVI, S-CVI and modified kappa for items in the instrument for 53 remaining items at the end of the third round of judgment. We also used panel members’ judgment on the clarity of items as well as their recommendations on the modification of items.



D1: Trust Building
D1-1	14	1	6.103	1	Excellent	14	1
D1-2	12	0.857	0.022	0.85	Excellent
D1-3	12	0.857	0.022	0.85	Excellent
D1-4	12	0.857	6.103	0.85	Excellent
D1-5	12	0.857	0.022	0.85	Excellent
D1-6	14	1	6.103	1	Excellent
D1-7	12	0.857	6.103	0.85	Excellent
D1-8	14	1	6.103	1	Excellent
D2:Intimacy/Friendship
D2-1	14	1	6.103	1	Excellent	13	0.928
D2-2	14	1	6.103	1	Excellent
D2-3	14	1	6.103	1	Excellent
D2-4	13	0.928	.0008	0.928	Excellent
D2-5	14	1	6.103	1	Excellent
D2-6	14	1	6.103	1	Excellent
D2-7	12	0.857	0.022	0.85	Excellent
D3: Patient activation
D3-1	12	0.857	0.022	0.85	Excellent	14	1
D3-2	13	0.928	0	0.928	Excellent
D3-3	13	0.928	0	0.928	Excellent
D3-4	13	0.928	0	0.928	Excellent
D3-5	14	1	6.103	1	Excellent
D4: Problem Solving
D4-1	14	1	6.103	1	Excellent	14	1
D4-2	14	1	6.103	1	Excellent
D4-3	12	0.857	0.022	0.85	Excellent
D4-4	12	0.857	0.022	0.85	Excellent
D4-5	13	0.928	0	0.928	Excellent
D4-6	14	1	6.103	1	Excellent
D4-7	14	1	6.103	1	Excellent
D5: Emotional support
D5-1	13	0.928	0	0.928	Excellent	14	1
D5-2	14	1	6.103	1	Excellent
D5-3	12	0.857	0.022	0.85	Excellent
D5-4	14	1	6.103	1	Excellent
D5-5	13	0.928	0	0.928	Excellent
D5-6	12	0.857	0.02	0.85	Excellent
D6: Informational support
D6-1	14	1	6.103	1	Excellent	13	0.928
D6-2	13	0.928	0	0.928	Excellent
D6-3	14	1	6.103	1	Excellent
D6-4	13	0.928	0	0.928	Excellent
D6-5	14	1	6.103	1	Excellent
D6-6	14	1	6.103	1	Excellent
D7:Spirituality strengthening
D7-1	12	0.857	0.022	0.85	Excellent	14	1
D7-2	12	0.857	0.022	0.85	Excellent
D7-3	12	0.857	0.022	0.85	Excellent
D7-4	14	1	6.103	1	Excellent
D7-5	14	1	6.103	1	Excellent
D7-6	12	0.857	0.022	0.85	Excellent
D7-7	12	0.857	0.022	0.85	Excellent
D7-8	13	0.928	0	0.928	Excellent
D7-9	14	1	6.103	1	Excellent
D7-10	13	0.928	0	0.928	Excellent
D7-11	14	1	6.103	1	Excellent
D7-12	13	0.928	0	0.928	Excellent
D7-13	13	0.928	0	0.928	Excellent
D7-14	13	0.928	0	0.928	Excellent
53 Items	S-CVI/Ave= 0.939S-CVI/UN= 0.434					Agreement on total comprehensiveness=14Comprehensiveness of entire instrument=1

NOTE : * I-CVI: item-level content validity index, ** p c (probability of a chance occurrence) was computed using the formula: p c = [N! /A! (N -A)!] * .5 N where N= number of experts and A= number of panelists who agree that the item is relevant. Number of experts=14, *** K(Modified Kappa) was computed using the formula: K= (I-CVI- P C )/(1- P C ). Interpretation criteria for Kappa, using guidelines described in Cicchetti and Sparrow (1981): Fair=K of 0.40 to 0.59; Good=K of 0.60 to 0.74; and Excellent=K>0.74

Face validity results of patient-centered communication measuring instrument

A sample of 10 people of patients with cancer who had a long-term history of hospitalization in oncology wards (lay experts) was requested to judge on the importance, simplicity and understandability of items in an interview with one of the members of research team. According to their opinions, to make some items more understandable, objective examples were included in an item. “For instance, the item “N urses try not to cause any problem for me ” was changed into “ During care (e.g. preparation an intravenous line), Nurses try not to cause any problem for me ”. The item “Care decisions are made without paying attention to my needs ” was changed to “ Nurses didn’t ask my opinion about care(e.g. time of care or type of interventions) ”. In addition the quantitative analysis was also performed as calculating impact score of each item. Nine items had item impact score less than 1.5 and they were eliminated from the final instrument for preliminary test. Finally, at the end of the content validity and face validity process, our instrument was prepared with seven dimensions and 44 items for the next steps and doing the rest of psychometric testing.

Present paper demonstrates quantities indices for content validity a new instrument and outlines them during design and psychometrics of patient centered communication measuring instrument. It should be said that validation is a lengthy process, in the first-step of which, the content validity should be studied and the following analyses should be directed include reliability evaluation (through internal consistency and test-retest), construct validity (through factor analysis) and criterion-related validity. 37

Some limitations of content validity studies should be noted, Experts’ feedback is subjective; thus, the study is subjected to bias that may exist among the experts. If content domain is not well identified, this type of study does not necessarily identify content that might have been omitted from the instrument. However, experts are asked to suggest other items for the instrument, which may help minimize this limitation. 11

Content validity study is a systematic, subjective and two-stage process. In the first stage, instrument design is carried out and in the second stage, judgment/quantification on instrument items is performed and content experts study the accordance between theoretical and operational definitions. Such process should be the leading study in the process of making instrument to guarantee instrument reliability and prepare a valid instrument in terms of content for preliminary test phase. Validation is a lengthy process, in the first step of which, the content validity should be studied. The following analyses should be directed include reliability evaluation (through internal consistency and test-retest), construct validity by factor analysis and criterion-related validity. Meanwhile, we showed that although content validity is a subjective process, it is possible to objectify it.

Understanding content validity is important for clinician groups and researchers because they should realize if the instruments they use for their studies are suitable for the construct, population under study, and socio-cultural background in which the study is carried out, or there is a need for new or modified instruments.

Training on content validity study helps students, researchers, and clinical staffs better understand, use and criticize research instruments with a more accurate approach.

In general, content validity study revealed that this instrument enjoys an appropriate level of content validity. The overall content validity index of the instrument using a conservative approach (universal agree- ment approach) was low; however, it can be advocated with respect to the high number of content experts that makes consensus difficult and high value of the S-CVI with the average approach, which was equal to 0.93.

Acknowledgments

The researchers appreciate patients, nurses, managers, and administrators of Ali-Nasab and Shahid Ayatollah Qazi Tabatabaee hospitals. Approval to conduct this research with no. 5/74/474 was granted by the Hematology and Oncology Research Center affiliated to Tabriz University of Medical Sciences.

Ethical issues

None to be declared.

Conflict of interest

The authors declare no conflict of interest in this study.

Research validity in surveys relates to the extent at which the survey measures right elements that need to be measured. In simple terms, validity refers to how well an instrument as measures what it is intended to measure.

Reliability alone is not enough, measures need to be reliable, as well as, valid. For example, if a weight measuring scale is wrong by 4kg (it deducts 4 kg of the actual weight), it can be specified as reliable, because the scale displays the same weight every time we measure a specific item. However, the scale is not valid because it does not display the actual weight of the item.

Research validity can be divided into two groups: internal and external. It can be specified that “internal validity refers to how the research findings match reality, while external validity refers to the extend to which the research findings can be replicated to other environments” (Pelissier, 2008, p.12).

Moreover, validity can also be divided into five types:

1. Face Validity is the most basic type of validity and it is associated with a highest level of subjectivity because it is not based on any scientific approach. In other words, in this case a test may be specified as valid by a researcher because it may seem as valid, without an in-depth scientific justification.

Example: questionnaire design for a study that analyses the issues of employee performance can be assessed as valid because each individual question may seem to be addressing specific and relevant aspects of employee performance.

2. Construct Validity relates to assessment of suitability of measurement tool to measure the phenomenon being studied. Application of construct validity can be effectively facilitated with the involvement of panel of ‘experts’ closely familiar with the measure and the phenomenon.

Example: with the application of construct validity the levels of leadership competency in any given organisation can be effectively assessed by devising questionnaire to be answered by operational level employees and asking questions about the levels of their motivation to do their duties in a daily basis.

3. Criterion-Related Validity involves comparison of tests results with the outcome. This specific type of validity correlates results of assessment with another criterion of assessment.

Example: nature of customer perception of brand image of a specific company can be assessed via organising a focus group. The same issue can also be assessed through devising questionnaire to be answered by current and potential customers of the brand. The higher the level of correlation between focus group and questionnaire findings, the high the level of criterion-related validity.

4. Formative Validity refers to assessment of effectiveness of the measure in terms of providing information that can be used to improve specific aspects of the phenomenon.

Example: when developing initiatives to increase the levels of effectiveness of organisational culture if the measure is able to identify specific weaknesses of organisational culture such as employee-manager communication barriers, then the level of formative validity of the measure can be assessed as adequate.

5. Sampling Validity (similar to content validity) ensures that the area of coverage of the measure within the research area is vast. No measure is able to cover all items and elements within the phenomenon, therefore, important items and elements are selected using a specific pattern of sampling method depending on aims and objectives of the study.

Example: when assessing a leadership style exercised in a specific organisation, assessment of decision-making style would not suffice, and other issues related to leadership style such as organisational culture, personality of leaders, the nature of the industry etc. need to be taken into account as well.

My e-book, The Ultimate Guide to Writing a Dissertation in Business Studies: a step by step assistance offers practical assistance to complete a dissertation with minimum or no stress. The e-book covers all stages of writing a dissertation starting from the selection to the research area to submitting the completed version of the work within the deadline. John Dudovskiy

Get creative with full-sentence rewrites

Polish your papers with one click, avoid unintentional plagiarism, reliability vs validity | examples and differences.

Published on September 27, 2024 by Emily Heffernan, PhD .

When choosing how to measure something, you must ensure that your method is both reliable and valid . Reliability concerns how consistent a test is, and validity (or test validity) concerns its accuracy.

Reliability and validity are especially important in research areas like psychology that study constructs . A construct is a variable that cannot be directly measured, such as happiness or anxiety.

Researchers must carefully operationalize , or define how they will measure, constructs and design instruments to properly capture them. Ensuring the reliability and validity of these instruments is a necessary component of meaningful and reproducible research.

Reliability vs validity examples
	Reliability	Validity
	Whether a test yields the same results when repeated.	How well a test actually measures what it’s supposed to.
	Is this measurement consistent?	Is this measurement accurate?
	A test can be reliable but not valid; you might get consistent results but be measuring the wrong thing.	A valid test must be reliable; if you are measuring something accurately, your results should be consistent.
	A bathroom scale produces a different result each time you step on it, even though your weight hasn’t changed. The scale is not reliable or valid.	A bathroom scale gives consistent readings (it’s reliable) but all measurements are off by 5 pounds (it’s not valid).

Free Grammar Checker

Understanding reliability and validity, reliability vs validity in research, validity vs reliability examples, frequently asked questions about reliability vs validity.

Reliability and validity are closely related but distinct concepts.

What is reliability?

Reliability is how consistent a measure is. A test should provide the same results if it’s administered under the same circumstances using the same methods. Different types of reliability assess different ways in which a test should be consistent.

Types of reliability
Type of reliability	What it assesses	Example
	Does a test yield the same results each time it’s administered (i.e., is it consistent)?	Personality is considered a stable trait. A questionnaire that measures introversion should yield the same results if the same person repeats it several days or months apart.
	Are test results consistent across different raters or observers? If two people administer the same test, will they get the same results?	Two teaching assistants grade assignments using a rubric. If they each give the same paper a very different score, the rubric lacks interrater reliability.
	Do parts of a test designed to measure the same thing produce the same results?	Seven questions on a math test are designed to test a student’s knowledge of fractions. If these questions all measure the same skill, students should perform similarly on them, supporting the test’s internal consistency.

What is validity?

Validity (more specifically, test validity ) concerns the accuracy of a test or measure—whether it actually measures the thing it’s supposed to. You provide evidence of a measure’s test validity by assessing different types of validity .

Types of test validity
Type of test validity	What it assesses	Example
	Does a test actually measure the thing it’s supposed to? Construct validity is considered the overarching concern of test validity; other types of validity provide evidence of construct validity.	A researcher designs a game to test young children’s self-control. However, the game involves a joystick controller and is actually measuring motor coordination. It lacks construct validity.
	Does a test measure all aspects of the construct it’s been designed for?	A survey on insomnia probes whether the respondent has difficulty falling asleep but not whether they have trouble staying asleep. It thus lacks content validity.
	Does a test seem to measure what it’s supposed to?	A scale that measures test anxiety includes questions about how often students feel stressed when taking exams. It has face validity because it clearly evaluates test-related stress.
	Does a test match a “gold-standard” measure (a criterion) of the same thing? The criterion measure can be taken at the same time ( ) or in the future ( ).	A questionnaire designed to measure academic success in freshmen is compared to their SAT scores (concurrent validity) and their GPA at the end of the academic year (predictive validity). A strong correlation with either of these measures indicates criterion validity.
	Does a test produce results that are close to other tests of related concepts?	A new measure of empathy correlates strongly with performance on a behavioral task where participants donate money to help others in need. The new test demonstrates convergent validity.
	Does a test produce results that differ from other tests of unrelated concepts?	A test has been designed to measure spatial reasoning. However, its results strongly correlate with a measure of verbal comprehension skills, which should be unrelated. The test lacks discriminant validity.

Test validity concerns the accuracy of a specific measure or test. When conducting experimental research , it is also important to consider experimental validity —whether a true cause-and-effect relationship exists between your dependent and independent variables ( internal validity ) and how well your results generalize to the real world ( external validity ).

In experimental research, you test a hypothesis by manipulating an independent variable and measuring changes in a dependent variabl e. Different forms of experimental validity concern how well-designed an experiment is. Mitigating threats to internal validity and threats to external validity can help yield results that are meaningful and reproducible.

Types of experimental validity
Type of experimental validity	What it measures	Example
	Does a true cause-and-effect relationship exist between the independent and dependent variables?	A researcher evaluates a program to treat anxiety. They compare changes in anxiety for a treatment group that completes the program and a control group that does not. However, some people in the treatment group start taking anti-anxiety medication during the study. It is unclear whether the program or the medication caused decreases in anxiety. Internal validity is low.
	Can findings be generalized to other populations, situations, and contexts?	A survey on smartphone use is administered to a large, randomly selected sample of people from various demographic backgrounds. The survey results have high external validity.
	Does the experiment design mimic real-world settings? This is often considered a subset of external validity.	A research team studies conflict by having couples come into a lab and discuss a scripted conflict scenario while an experimenter takes notes on a clipboard. This design does not mimic the conditions of conflict in relationships and lacks ecological validity.

Though reliability and validity are theoretically distinct, in practice both concepts are intertwined.

Reliability is a necessary condition of validity: a measure that is valid must also be reliable. An instrument that is properly measuring a construct of interest should yield consistent results.

However, a measure can be reliable but not valid. Consider a clock that’s set 5 minutes fast. If checked at noon every day, it will consistently read “12:05.” Though the clock yields reliable results, it is not valid: it does not accurately reflect reality.

Because reliability is a necessary condition of validity, it makes sense to evaluate the reliability of a measure before assessing its validity. In research, validity is more important but harder to measure than reliability. It is relatively straightforward to assess whether a measurement yields consistent results across different contexts, but how can you be certain a measurement of a construct like “happiness” actually measures what you want it to?

Reliability and validity should be considered throughout the research process. Validity is especially important during study design, when you are determining how to measure relevant constructs. Reliability should be considered both when designing your study and when collecting data—careful planning and consistent execution are key.

Reliability and validity are both important when conducting research. Consider the following examples of how a measure may or may not be reliable and valid.

Reliability : This measure is not reliable—two observers might count smiles differently when observing the same meeting, leading to inconsistent results. The measure therefore lacks interrater reliability .
Validity : Casey’s assumption that smiling signifies teamwork was incorrect. Her measure fails to capture aspects of teamwork like cooperation, communication, and collaboration. It lacks content validity , and its overall construct validity is poor.

Casey can choose a different measure in an attempt to improve the reliability and validity of her study.

Reliability : This measure has high test-retest reliability . Provided someone has a consistent schedule, they will respond in the same manner each time they answer this question.
Validity : Casey assesses the convergent validity of her new measure by determining the correlation between people’s responses and their teams’ performance metrics over the last quarter. There is no correlation between the two, which does not support the construct validity of this measure. The number of meetings a team has does not seem to capture how well they work as a team.

Even though the measure in the previous example is reliable, it lacks validity. Casey must try a different approach.

Reliability : Casey ensures that her questionnaire includes clear, well-defined questions; has participants answer using well-structured Likert scales; and trains team members on how to respond objectively. This helps ensure the interrater reliability and test-retest reliability of her measure.
Validity: Casey tests the test validity of her questionnaire using several approaches. She has an experienced manager evaluate whether her instrument addresses various aspects of teamwork to confirm content validity . She also once again compares her questionnaire results to team performance metrics and finds a high correlation between the two, indicating convergent validity . Casey also finds that questionnaire responses do not correlate with individual salaries, demonstrating divergent validity . This all provides evidence of the test validity of her questionnaire.

Psychology and other social sciences often involve the study of constructs —phenomena that cannot be directly measured—such as happiness or stress.

Because we cannot directly measure a construct, we must instead operationalize it, or define how we will approximate it using observable variables. These variables could include behaviors, survey responses, or physiological measures.

Validity is the extent to which a test or instrument actually captures the construct it’s been designed to measure. Researchers must demonstrate that their operationalization properly captures a construct by providing evidence of multiple types of validity , such as face validity , content validity , criterion validity , convergent validity , and discriminant validity .

When you find evidence of different types of validity for an instrument, you’re proving its construct validity —you can be fairly confident it’s measuring the thing it’s supposed to.

In short, validity helps researchers ensure that they’re measuring what they intended to, which is especially important when studying constructs that cannot be directly measured and instead must be operationally defined.

A construct is a phenomenon that cannot be directly measured, such as intelligence, anxiety, or happiness. Researchers must instead approximate constructs using related, measurable variables.

The process of defining how a construct will be measured is called operationalization. Constructs are common in psychology and other social sciences.

To evaluate how well a construct measures what it’s supposed to, researchers determine construct validity . Face validity , content validity , criterion validity , convergent validity , and discriminant validity all provide evidence of construct validity.

Test validity refers to whether a test or measure actually measures the thing it’s supposed to. Construct validity is considered the overarching concern of test validity; other types of validity provide evidence of construct validity and thus the overall test validity of a measure.

Experimental validity concerns whether a true cause-and-effect relationship exists in an experimental design ( internal validity ) and how well findings generalize to the real world ( external validity and ecological validity ).

Verifying that an experiment has both test and experimental validity is imperative to ensuring meaningful and generalizable results.

An experiment is a study that attempts to establish a cause-and-effect relationship between two variables.

In experimental design , the researcher first forms a hypothesis . They then test this hypothesis by manipulating an independent variable while controlling for potential confounds that could influence results. Changes in the dependent variable are recorded, and data are analyzed to determine if the results support the hypothesis.

Nonexperimental research does not involve the manipulation of an independent variable. Nonexperimental studies therefore cannot establish a cause-and-effect relationship. Nonexperimental studies include correlational designs and observational research.

Is this article helpful?

Emily Heffernan, PhD

Other students also liked, what is simple random sampling | example & definition, what is face validity | definition & example, what is nominal data | examples & definition.

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

Knowledge Base

Methodology

Construct Validity | Definition, Types, & Examples

Construct Validity | Definition, Types, & Examples

Published on February 17, 2022 by Pritha Bhandari . Revised on June 22, 2023.

Construct validity is about how well a test measures the concept it was designed to evaluate. It’s crucial to establishing the overall validity of a method.

Assessing construct validity is especially important when you’re researching something that can’t be measured or observed directly, such as intelligence, self-confidence, or happiness. You need multiple observable or measurable indicators to measure those constructs or run the risk of introducing research bias into your work.

Content validity : Is the test fully representative of what it aims to measure?
Face validity : Does the content of the test appear to be suitable to its aims?
Criterion validity : Do the results accurately measure the concrete outcome they are designed to measure?

What is a construct, what is construct validity, types of construct validity, how do you measure construct validity, threats to construct validity, other interesting articles, frequently asked questions about construct validity.

A construct is a theoretical concept, theme, or idea based on empirical observations. It’s a variable that’s usually not directly measurable.

Some common constructs include:

Self-esteem
Logical reasoning
Academic motivation
Social anxiety

Constructs can range from simple to complex. For example, a concept like hand preference is easily assessed:

A simple survey question : Ask participants which hand is their dominant hand.
Observations : Ask participants to perform simple tasks, such as picking up an object or drawing a cat, and observe which hand they use to execute the tasks.

A more complex concept, like social anxiety, requires more nuanced measurements, such as psychometric questionnaires and clinical interviews.

Simple constructs tend to be narrowly defined, while complex constructs are broader and made up of dimensions. Dimensions are different parts of a construct that are coherently linked to make it up as a whole.

As a construct, social anxiety is made up of several dimensions.

Psychological dimension: Intense fear and anxiety
Physiological dimension: Physical stress indicators
Behavioral dimension: Avoidance of social settings

Receive feedback on language, structure, and formatting

Professional editors proofread and edit your paper by focusing on:

Academic style
Vague sentences
Style consistency

See an example

Construct validity concerns the extent to which your test or measure accurately assesses what it’s supposed to.

In research, it’s important to operationalize constructs into concrete and measurable characteristics based on your idea of the construct and its dimensions.

Be clear on how you define your construct and how the dimensions relate to each other before you collect or analyze data . This helps you ensure that any measurement method you use accurately assesses the specific construct you’re investigating as a whole and helps avoid biases and mistakes like omitted variable bias or information bias .

How often do you avoid entering a room when everyone else is already seated?
Do other people tend to describe you as quiet?
When talking to new acquaintances, how often do you worry about saying something foolish?
To what extent do you fear giving a talk in front of an audience?
How often do you avoid making eye contact with other people?
Do you prefer to have a small number of close friends over a big group of friends?

When designing or evaluating a measure, it’s important to consider whether it really targets the construct of interest or whether it assesses separate but related constructs.

It’s crucial to differentiate your construct from related constructs and make sure that every part of your measurement technique is solely focused on your specific construct.

Does your questionnaire solely measure social anxiety?
Are all aspects of social anxiety covered by the questions?
Do your questions avoid measuring other relevant constructs like shyness or introversion?

There are two main types of construct validity.

Convergent validity: The extent to which your measure corresponds to measures of related constructs
Discriminant validity: The extent to which your measure is unrelated or negatively related to measures of distinct constructs

Convergent validity

Convergent validity is the extent to which measures of the same or similar constructs actually correspond to each other.

In research studies, you expect measures of related constructs to correlate with one another. If you have two related scales, people who score highly on one scale tend to score highly on the other as well.

Discriminant validity

Conversely, discriminant validity means that two measures of unrelated constructs that should be unrelated, very weakly related, or negatively related actually are in practice.

You check for discriminant validity the same way as convergent validity: by comparing results for different measures and assessing whether or how they correlate.

How do you select unrelated constructs? It’s good to pick constructs that are theoretically distinct or opposing concepts within the same category.

For example, if your construct of interest is a personality trait (e.g., introversion), it’s appropriate to pick a completely opposing personality trait (e.g., extroversion). You can expect results for your introversion test to be negatively correlated with results for a measure of extroversion.

Alternatively, you can pick non-opposing unrelated concepts and check there are no correlations (or weak correlations) between measures.

You often focus on assessing construct validity after developing a new measure. It’s best to test out a new measure with a pilot study, but there are other options.

A pilot study is a trial run of your study. You test out your measure with a small sample to check its feasibility, reliability , and validity . This helps you figure out whether you need to tweak or revise your measure to make sure you’re accurately testing your construct.
Statistical analyses are often applied to test validity with data from your measures. You test convergent and discriminant validity with correlations to see if results from your test are positively or negatively related to those of other established tests.
You can also use regression analyses to assess whether your measure is actually predictive of outcomes that you expect it to predict theoretically. A regression analysis that supports your expectations strengthens your claim of construct validity.

Prevent plagiarism. Run a free check.

It’s important to recognize and counter threats to construct validity for a robust research design. The most common threats are:

Poor operationalization

Experimenter expectancies, subject bias.

A big threat to construct validity is poor operationalization of the construct.

A good operational definition of a construct helps you measure it accurately and precisely every time. Your measurement protocol is clear and specific, and it can be used under different conditions by other people.

Without a good operational definition, you may have random or systematic error , which compromises your results and can lead to information bias . Your measure may not be able to accurately assess your construct.

Experimenter expectancies about a study can bias your results. It’s best to be aware of this research bias and take steps to avoid it.

To combat this threat, use researcher triangulation and involve people who don’t know the hypothesis in taking measurements in your study. Since they don’t have strong expectations, they are unlikely to bias the results.

When participants hold expectations about the study, their behaviors and responses are sometimes influenced by their own biases. This can threaten your construct validity because you may not be able to accurately measure what you’re interested in.

You can mitigate subject bias by using masking (blinding) to hide the true purpose of the study from participants. By giving them a cover story for your study, you can lower the effect of subject bias on your results, as well as prevent them guessing the point of your research, which can lead to demand characteristics , social desirability bias , and a Hawthorne effect .

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

Normal distribution
Degrees of freedom
Null hypothesis
Discourse analysis
Control groups
Mixed methods research
Non-probability sampling
Quantitative research
Ecological validity

Research bias

Rosenthal effect
Implicit bias
Cognitive bias
Selection bias
Negativity bias
Status quo bias

Construct validity is about how well a test measures the concept it was designed to evaluate. It’s one of four types of measurement validity , which includes construct validity, face validity , and criterion validity.

There are two subtypes of construct validity.

Convergent validity : The extent to which your measure corresponds to measures of related constructs
Discriminant validity : The extent to which your measure is unrelated or negatively related to measures of distinct constructs

When designing or evaluating a measure, construct validity helps you ensure you’re actually measuring the construct you’re interested in. If you don’t have construct validity, you may inadvertently measure unrelated or distinct constructs and lose precision in your research.

Construct validity is often considered the overarching type of measurement validity , because it covers all of the other types. You need to have face validity , content validity , and criterion validity to achieve construct validity.

Statistical analyses are often applied to test validity with data from your measures. You test convergent validity and discriminant validity with correlations to see if results from your test are positively or negatively related to those of other established tests.

You can also use regression analyses to assess whether your measure is actually predictive of outcomes that you expect it to predict theoretically. A regression analysis that supports your expectations strengthens your claim of construct validity .

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

Bhandari, P. (2023, June 22). Construct Validity | Definition, Types, & Examples. Scribbr. Retrieved September 24, 2024, from https://www.scribbr.com/methodology/construct-validity/

Is this article helpful?

Pritha Bhandari

Other students also liked, the 4 types of validity in research | definitions & examples, reliability vs. validity in research | difference, types and examples, correlation coefficient | types, formulas & examples, what is your plagiarism score.

IMAGES

What Is Validity In Research Methodology
8 Types of Validity in Research
Content Validity
What is Validity in Research? Types of Validity-Face/Content/Criterion/Construct Validity
38: Content Validity: Generalized to Entire Content
PPT

VIDEO

C11 Validity & Reliability (Part 5)
C11 Validity & Reliability (Part 3)
C11 Validity & Reliability (Part 4)
BSN
C11 Validity & Reliability (Part 1)
What is Reliability and Validity-Research Methodology-TheRISD

COMMENTS

What Is Content Validity?
Content validity evaluates how well an instrument covers all relevant parts of the construct it aims to measure. Learn the difference between content validity and construct validity, see examples from various fields, and follow a step-by-step guide to calculate content validity.
Content Validity in Research: Definition & Examples
Content validity is the degree to which an assessment instrument covers the construct it is meant to measure. Learn how to establish content validity using expert opinion, focus groups, surveys, and numerical methods.
The 4 Types of Validity in Research
Learn about the four main types of validity in research: construct, content, face and criterion. Criterion validity is the most difficult to establish in quantitative research, as it requires comparing the test results with external criteria.
Content Validity
Content validity is the extent to which a measurement instrument covers the intended content domain or construct. Learn about different methods, steps, and threats to content validity, and see examples of content validity in research.
What Is Content Validity?
Content validity evaluates how well a test or questionnaire covers all relevant parts of a construct. Learn how to measure content validity with experts, formulas, and examples from psychology research.
Understand Content Validity: Guide, Examples & FAQs
Content validity is how well a research instrument measures a construct that is not directly measurable, such as health or happiness. Learn how to use expert data, content validity ratio, and content validity index to evaluate content validity.
Content Validity: Definition, Examples & Measuring
Learn how to measure content validity, the degree to which a test or assessment instrument covers the full range of topics or dimensions related to the construct. See examples, methods, and critical values for the content validity ratio.
Content Validity
Survey Research Methods. A. Fink, in International Encyclopedia of Education (Third Edition), 2010 Content Validity. Content validity refers to the extent to which a measure thoroughly and appropriately assesses the skills or characteristics it is intended to measure. For example, a survey researcher who is interested in developing a measure of mental health has to first define the concept ...
Sage Research Methods
Content validity refers to the extent to which the items on a test are fairly representative of the entire domain the test seeks to measure. This entry discusses origins and definitions of content validation, methods of content validation, the role of [Page 239] content validity evidence in validity arguments, and unresolved issues in content ...
Qualitative Research and Content Validity
Qualitative research to establish and support content validity should have a strong and documentable scientific basis and be conducted with the rigor required of all robust research (Brod et al., 2009; Lasch et al., 2010; Magasi et al., 2012; Patrick et al., 2011a, 2011b).An interviewer who is well versed in managing qualitative research and who understands the importance of accurately ...
Content Validity
Content validity refers to the degree to which an assessment instrument is relevant to, and representative of, the targeted construct it is designed to measure. It involves domain definition, domain representation, and domain relevance, and can be evaluated using judgmental or statistical methods.
Content Analysis
Learn how to use content analysis to identify patterns in recorded communication, such as texts, speeches, or social media posts. Find out the advantages, disadvantages, and steps of this research method.
Qualitative Research and Content Validity: Developing Best Practices
content validity. Methods This paper provides an overview of the current state of knowledge regarding qualitative research to establish content validity based on the scientific methodo-logical literature and authors' experience. Results Conceptual issues and frameworks for qualitative interview research, developing the interview discussion
Validity In Psychology Research: Types & Examples
Learn about the concept of validity in psychology research, which refers to the accuracy and reliability of tests and measurements. Explore different types of validity, such as internal, external, content, criterion, face, and construct validity, with examples and definitions.
Validity
Learn about validity, the accuracy and truthfulness of research, and its types: internal, external, construct, content, criterion, and face validity. Find out how to ensure validity in research and its applications and limitations.
Frontiers
The higher the content validity of a test, the more accurate is the measurement of the target construct (e.g., de Von et al., 2007). While we all know how important content validity is, it tends to receive little attention in assessment practice and research (e.g., Rossiter, 2008; Johnston et al., 2014). In many cases, test developers assume ...
Content validity: Definition and procedure of content validation in
Some content validity measurement methods, such as interrater reliability (IRR), Aiken's validity, content validity ratio (CVR), and content validity index (CVI), were also used.
Content Validity Using a Mixed Methods Approach:
The argument presented is that content validity requires a mixed methods approach since data are developed through qualitative and quantitative methods that inform each other. ... Brod M., Tesler L., Christensen T. (2009). Qualitative research and content validity: Developing best practices based on science and experience. Quality of Life ...
A practical approach to the assessment and quantification of content
Content validity ratio (Lawshe's method) One approach to achieving content validity includes a panel of experts considering the relevance of individual items within an instrument. The content validity ratio (CVR), an item statistic originally suggested by Lawshe (1975), is one of the most widely used methods for quantifying content validity.
Design and Implementation Content Validity Study: Development of an
The qualitative research methods can also be applied to determine the variables and concepts of the pertinent construct. 29 The qualitative data collected in the interview with the respondents familiar with concept help enrich ... In qualitative content validity‏ method, content‏ experts and target group's recommendations are adopted on ...
Validity
5. Sampling Validity (similar to content validity) ensures that the area of coverage of the measure within the research area is vast. No measure is able to cover all items and elements within the phenomenon, therefore, important items and elements are selected using a specific pattern of sampling method depending on aims and objectives of the ...
Reliability vs. Validity in Research
Reliability and validity are concepts used to evaluate the quality of research. They indicate how well a method, technique, or test measures something. Learn the difference, types and examples of reliability and validity in quantitative research.
Reliability vs Validity
Reliability vs Validity | Examples and Differences. Published on September 27, 2024 by Emily Heffernan, PhD. When choosing how to measure something, you must ensure that your method is both reliable and valid.Reliability concerns how consistent a test is, and validity (or test validity) concerns its accuracy.. Reliability and validity are especially important in research areas like psychology ...
Construct Validity
Construct validity is about how well a test measures the concept it was designed to evaluate. It's one of four types of measurement validity. Learn about the types, methods, and threats of construct validity.

Have a language expert improve your writing

What Is Content Validity? | Definition & Examples

Table of contents

Construct vs. content validity example

Prevent plagiarism, run a free check.

Step 1: Collect data from experts

Cite this Scribbr article

Is this article helpful?

Kassiani Nikolopoulou

Make research less tedious

Collecting data from experts

Finding the content validity ratio

Calculating the content validity index

What is an example of a content validity test?

What is content validity in qualitative research?

What is the difference between validity and content validity?

What is the difference between content validity and construct validity?

How do you quantify content validity?

Should you be using a customer insights hub?

Editor’s picks

Latest articles

Content Validity: Definition, Examples & Measuring

What is Content Validity?

Content Validity Examples

How to Measure Content Validity

Factor Analysis

Content Validity Ratio

Share this:

Reader Interactions

Content Validity

Description

Access this chapter

Author information

Corresponding author

Editor information

Rights and permissions

Copyright information

About this entry

Download citation

Share this entry

Validity In Psychology Research: Types & Examples

Internal and External Validity In Research

Types of Validity In Psychology

Face Validity

For example:

Construct Validity

Convergent validity

Concurrent Validity (i.e., occurring at the same time)

Predictive Validity

Validity – Types, Examples and Guide

Research Validity

How to Ensure Validity in Research

Types of Validity

Internal Validity

External Validity

Construct Validity

Content Validity

Criterion Validity

Face Validity

Importance of Validity

Examples of Validity

Where to Write About Validity in A Thesis

Applications of Validity

Limitations of Validity

About the author

Muhammad Hassan

You may also like

Construct Validity – Types, Threats and Examples

Internal Validity – Threats, Examples and Guide

Reliability – Types, Examples and Guide

Internal Consistency Reliability – Methods...

Content Validity – Measurement and Examples

Alternate Forms Reliability – Methods, Examples...

METHODS article

Introduction

Methods to Evaluate Content Validity

Theory-Based Item Analysis

Goals of the Study

Measuring Wisdom with the Adult Self-Transcendence Inventory

Description and Application of the Content-Scaling-Structure Procedure