freedom (d.f.)
d.f. for tests of association based on contingency tables along with associated PLINK keyword are shown for allele and genotype counts in case and control groups, comprising N individuals at a bi-allelic locus with alleles a and A .
Tests of association can also be conducted with likelihood ratio (LR) methods in which inference is based on the likelihood of the genotyped data given disease status. The likelihood of the observed data under the proposed model of disease association is compared with the likelihood of the observed data under the null model of no association; a high LR value tends to discredit the null hypothesis. All disease models can be tested using LR methods. In large samples, the χ 2 and LR methods can be shown to be equivalent under the null hypothesis 14 .
More complicated logistic regression models of association are used when there is a need to include additional covariates to handle complex traits. Examples of this are situations in which we expect disease risk to be modified by environmental effects such as epidemiological risk factors (e.g., smoking and gender), clinical variables (e.g., disease severity and age at onset) and population stratification (e.g., principal components capturing variation due to differential ancestry 3 ), or by the interactive and joint effects of other marker loci. In logistic regression models, the logarithm of the odds of disease is the response variable, with linear (additive) combinations of the explanatory variables (genotype variables and any covariates) entering into the model as its predictors. For suitable linear predictors, the regression coefficients fitted in the logistic regression represent the log of the ORs for disease gene association described above. Linear predictors for genotype variables in a selection of standard disease models are shown in Table 3 .
Linear predictors for genotype variables in a selection of standard disease models.
Genotype | Model | |||
---|---|---|---|---|
Multiplicative | Genotypic | Recessive | Dominant | |
/ | ||||
/ | + | + + | + | |
/ | + 2 | + | + | + |
Interpretation | provides an estimate of the log odds ratio for disease risk associated with each additional allele (also called the haplotype relative risk). If is significant, then there is a multiplicative contribution to disease risk in that the odds ratio for disease risk increases multiplicatively for every additional allele | and provide estimates of the log odds ratio for disease risk in individuals with genotypes and , respectively, relative to an individual with genotype . A likelihood ratio test of whether both and are significant is equivalent to the conventional 2 d.f. test for association in a 2 × 3 contingency table | provides an estimate of the log odds ratio for disease risk in an individual with at least 1 allele (genotype or ) compared with an individual with no alleles (genotype ). A test of whether is significant corresponds to a 1 d.f. test for association in a 2 × 2 contingency table of disease outcome by genotype classified as or not | provides an estimate of the log odds ratio for disease risk in an individual with two alleles (genotype compared with an individual with one or two alleles (genotype or ). A test of whether is significant corresponds to a 1 d.f. test for association in a 2 × 2 contingency table of disease outcome by genotype classified as or not |
Controlling for multiple testing to accurately estimate significance thresholds is a very important aspect of studies involving many genetic markers, particularly GWA studies. The type I error, also called the significance level or false-positive rate, is the probability of rejecting the null hypothesis when it is true. The significance level indicates the proportion of false positives that an investigator is willing to tolerate in his or her study. The family-wise error rate (FWER) is the probability of making one or more type I errors in a set of tests. Lower FWERs restrict the proportion of false positives at the expense of reducing the power to detect association when it truly exists. A suitable FWER should be specified at the design stage of the analysis 1 . It is then important to keep track of the number of statistical comparisons performed and correct the individual SNP-based significance thresholds for multiple testing to maintain the overall FWER. For association tests applied at each of n SNPs, per-test significance levels of α* for a given FWER of α can be simply approximated using Bonferroni (α* = α/ n ) or Sidak 15 , 16 (α* = 1 − (1 – α) 1/ n ) adjustments. When tests are independent, the Sidak correction is exact; however, in GWA studies comprising dense sets of markers, this is unlikely to be true and both corrections are then very conservative. A similar but slightly less-stringent alternative to the Bonferroni correction is given by Holm 17 . Alternatives to the FWER approach include false discovery rate (FDR) procedures 18 , 19 , which control for the expected proportion of false positives among those SNPs declared significant. However, dependence between markers and the small number of expected true positives make FDR procedures problematic for GWA studies. Alternatively, permutation approaches aim to render the null hypothesis correct by randomization: essentially, the original P value is compared with the empirical distribution of P values obtained by repeating the original tests while randomly permuting the case-control labels 20 . Although Bonferroni and Sidak corrections provide a simple way to adjust for multiple testing by assuming independence between markers, permutation testing is considered to be the ‘gold standard’ for accurate correction 20 . Permutation procedures are computationally intensive in the setting of GWA studies and, moreover, apply only to the current genotyped data set; therefore, unless the entire genome is sequenced, they cannot generate truly genome-wide significance thresholds. Bayes factors have also been proposed for the measurement of significance 6 . For GWA studies of dense SNPs and resequence data, a standard genome-wide significance threshold of 7.2 × 10 − 8 for the UK Caucasian population has been proposed by Dudbridge and Gusnanto 21 . Other thresholds for contemporary populations, based on sample size and proposed FWER, have been proposed by Hoggart et al 22 . Informally, some journals have accepted a genome-wide significance threshold of 5 × 10 − 7 as strong evidence for association 6 ; however, most recently, the accepted standard is 5 × 10 − 8 ( ref. 23 ). Further, graphical techniques for assessing whether observed P values are consistent with expected values include log quantile-quantile P value plots that highlight loci that deviate from the null hypothesis 24 .
A significant result in an association test rarely implies that a SNP is directly influencing disease risk; population association can be direct, indirect or spurious. A direct, or causal, association occurs when different alleles at the marker locus are directly involved in the etiology of the disease through a biological pathway. Such associations are typically only found during follow-up genotyping phases of initial GWA studies, or in focused CG studies in which particular functional polymorphisms are targeted. An indirect, or non-causal, association occurs when the alleles at the marker locus are correlated (in LD) with alleles at a nearby causal locus but do not directly influence disease risk. When a significant finding in a genetic association study is true, it is most likely to be indirect. Spurious associations can occur as a consequence of data quality issues or statistical sampling, or because of confounding by population stratification or admixture. Population stratification occurs when cases and controls are sampled disproportionately from different populations with distinct genetic ancestry. Admixture occurs when there has been genetic mixing of two or more groups in the recent past. For example, genetic admixture is seen in Native American populations in which there has been recent genetic mixing of individuals with both American Indian and Caucasian ancestry 25 . Confounding occurs when a factor exists that is associated with both the exposure (genotype) and the disease but is not a consequence of the exposure. As allele frequencies and disease frequencies are known to vary among populations of different genetic ancestry, population stratification or admixture can confound the association between the disease trait and the genetic marker; it can bias the observed association, or indeed can cause a spurious association. Principal component analyses or multidimensional scaling methods are commonly used to identify and remove individuals exhibiting divergent ancestry before association testing. These techniques are described in detail in an earlier protocol 3 . To adjust for any residual population structure during association testing, the principal components from principal component analyses or multidimensional scaling methods can be included as covariates in a logistic regression. In addition, the technique of genomic control 26 can be used to detect and compensate for the presence of fine-scale or within-population stratification during association testing. Under genomic control, population stratification is treated as a random effect that causes the distribution of the χ 2 association test statistics to have an inflated variance and a higher median than would otherwise be observed. The test statistics are assumed to be uniformly affected by an inflation factor λ, the magnitude of which is estimated from a set of selected markers by comparing the median of their observed test statistics with the median of their expected test statistics under an assumption of no population stratification. Under genomic control, if λ > 1, then population stratification is assumed to exist and a correction is applied by dividing the actual association test χ 2 statistic values by λ. As λ scales with sample size, λ 1,000 , the inflation factor for an equivalent study of 1,000 cases and 1,000 controls calculated by rescaling λ, is often reported 27 . In a CG study, λ can only be determined if an additional set of markers specifically designed to indicate population stratification are genotyped. In a GWA study, an unbiased estimation of λ can be determined using all of the genotyped markers; the effect on the inflation factor of potential causal SNPs in such a large set of genomic control markers is assumed to be negligible.
Replication occurs when a positive association from an initial study is confirmed in a subsequent study involving an independent sample drawn from the same population as the initial study. It is the process by which genetic association results are validated. In theory, a repeated significant association between the same trait and allele in an independent sample is the benchmark for replication. However, in practice, so-called replication studies often comprise findings of association between the same trait and nearby variants in the same gene as the original SNP, or between the same SNP and different high-risk traits. A precise definition of what constitutes replication for any given study is therefore important and should be clearly stated 28 .
In practice, replication studies often involve different investigators with different samples and study designs aiming to independently verify reports of positive association and obtain accurate effect-size estimates, regardless of the designs used to detect effects in the primary study. Two commonly used strategies in such cases are an exact strategy, in which only marker loci indicating a positive association are subsequently genotyped in the replicate sample, and a local strategy, in which additional variants are also included, thus combining replication with fine-mapping objectives. In general, the exact strategy is more balanced in power and efficiency; however, depending on local patterns of LD and the strength of primary association signals, a local strategy can be beneficial 28 .
In the past, multistage designs have been proposed as cost-efficient approaches to allow the possibility of replication within a single overall study. The first stage of a standard two-stage design involves genotyping a large number of markers on a proportion of available samples to identify potential signals of association using a nominal P value threshold. In stage two, the top signals are then followed up by genotyping them on the remaining samples while a joint analysis of data from both stages is conducted 29 , 30 . Significant signals are subsequently tested for replication in a second data set. With the ever-decreasing costs of GWA genotyping, two-stage studies have become less common.
Standard statistical software (such as R ( ref. 31 ) or SPSS) can be used to conduct and visualize all the analyses outlined above. However, many researchers choose to use custom-built GWA software. In this protocol we use PLINK 32 , Haploview 33 and the customized R package car 34 . PLINK is a popular and computationally efficient software program that offers a comprehensive and well-documented set of automated GWA quality control and analysis tools. It is a freely available open source software written in C++, which can be installed on Windows, Mac and Unix machines ( http://pngu.mgh.harvard.edu/~purcell/plink/index.shtml ). Haploview ( http://www.broadinstitute.org/haploview/haploview ) is a convenient tool for visualizing LD; it interfaces directly with PLINK to produce a standard visualization of PLINK association results. Haploview is most easily run through a graphical user interface, which offers many advantages in terms of display functions and ease of use. car ( http://socserv.socsci.mcmaster.ca/jfox/ ) is an R package that contains a variety of functions for graphical diagnostic methods.
The next section describes protocols for the analysis of SNP data and is illustrated by the use of simulated data sets from CG and GWA studies (available as gzipped files from http://www.well.ox.ac.uk/ggeu/NPanalysis/ or .zip files as Supplementary Data 1 and Supplementary Data 2 ). We assume that SNP data for a CG study, typically comprising on the order of thousands of markers, will be available in a standard PED and MAP file format (for an explanation of these file formats, see http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#ped ) and that SNP data for a GWA study, typically comprising on the order of hundreds of thousands of markers, will be available in a standard binary file format (for an explanation of the binary file format, see http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#bed ). In general, SNP data for either type of study may be available in either format. The statistical analysis described here is for the analysis of one SNP at a time; therefore, apart from the requirement to take potentially differing input file formats into account, it does not differ between CG and GWA studies.
Computer workstation with Unix/Linux operating system and web browser
1 | For SNP data available in standard PED and MAP file formats, as in our CG study, follow option A. For SNP data available in standard binary file format, as in our GWA study, follow option B. The instructions provided here are for unpacking the sample data provided as gzipped files at http://www.well.ox.ac.uk/ggeu/NPanalysis/ . If using the .zip files provided as supplementary Data 1 or supplementary Data 2 , please proceed directly to step 2.
▲ CRITICAL STEP The format in which genotype data are returned to investigators varies according to genome-wide SNP platforms and genotyping centers. We assume that genotypes have been called by the genotyping center, undergone appropriate quality control filters as described in a previous protocol 3 and returned as clean data in a standard file format.
▲ CRITICAL STEP The simulated data used here have passed standard quality control filters: all individuals have a missing data rate of < 20%, and SNPs with a missing rate of > 5%, a MAF < 1% or an HWE P value < 1 × 10 − 4 have already been excluded. These filters were selected in accordance with procedures described elsewhere 3 to minimize the influence of genotype-calling artifacts in a CG study.
▲ CRITICAL STEP We assume that covariate files are available in a standard file format. For an explanation of the standard format for covariate files, see http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#covar .
▲ CRITICAL STEP Optimized binary BED files contain the genotype information and the corresponding BIM/FAM files contain the map and pedigree information. The binary BED file is a compressed file that allows faster processing in PLINK and takes less storage space, thus facilitating the analysis of large-scale data sets 32 .
▲ CRITICAL STEP The simulated data used here have passed standard quality control: all individuals have a missing data rate of < 10%. SNPs with a missing rate > 10%, a MAF < 1% or an HWE P value < 1 × 10 − 5 have already been excluded. These filters were selected in accordance with procedures described elsewhere 3 to minimize the influence of genotype-calling artifacts in a GWA study.
? TROUBLESHOOTING
2 | To obtain a summary of MAFs in case and control populations and an estimate of the OR for association between the minor allele (based on the whole sample) and disease in the CG study, type ‘plink --file cg --assoc --out data’. In any of the PLINK commands in this protocol, replace the ‘--file cg’ option with the ‘--bfile gwa’ option to use the binary file format of the GWA data rather than the PED and MAP file format of the CG data.
▲ CRITICAL STEP PLINK always creates a log file called ‘data.log’, which includes details of the implemented commands, the number of cases and controls in the input files, any excluded data and the genotyping rate in the remaining data. This file is very useful for checking the software is successfully completing commands.
▲ CRITICAL STEP The options in a PLINK command can be specified in any order.
3 | Open the output file ‘data.assoc’. It has one row per SNP containing the chromosome [CHR], the SNP identifier [SNP], the base-pair location [BP], the minor allele [A1], the frequency of the minor allele in the cases [F_A] and controls [F_U], the major allele [A2] and statistical data for an allelic association test including the χ 2 -test statistic [CHISQ], the asymptotic P value [ P ] and the estimated OR for association between the minor allele and disease [OR].
4 | When there are no covariates to consider, carry out simple χ 2 tests of association by following option A. For inclusion of multiple covariates and covariate interactions, follow option B.
▲ CRITICAL STEP Genotypic, dominant and recessive tests will not be conducted if any one of the cells in the table of case control by genotype counts contains less than five observations. This is because the χ 2 approximation may not be reliable when cell counts are small. For SNPs with MAFs < 5%, a sample of more than 2,000 cases and controls would be required to meet this threshold and more than 50,000 would be required for SNPs with MAF < 1%. To change the threshold, use the ‘--cell’ option. For example, we could lower the threshold to 3 and repeat the χ 2 tests of association by typing ‘plink --file cg --model --cell 3 --out data’.
▲ CRITICAL STEP To specify a genotypic, dominant or recessive model in place of a multiplicative model, include the model option --genotypic, --dominant or --recessive, respectively. To include sex as a covariate, include the option --sex. To specify interactions between covariates, and between SNPs and covariates, include the option --interaction. Open the output file ‘data.assoc.logistic’. If no model option is specified, the first row for each SNP corresponds to results for a multiplicative test of association. If the ‘--genotypic’ option has been selected, the first row will correspond to a test for additivity and the subsequent row to a separate test for deviation from additivity. If the ‘--dominant’ or ‘--recessive’ model options have been selected, then the first row will correspond to tests for a dominant or recessive model of association, respectively. If covariates have been included, each of these P values is adjusted for the effect of the covariates. The C ≥ 0 subsequent rows for each SNP correspond to separate tests of significance for each of the C covariates included in the regression model. Finally, if the ‘--genotypic’ model option has been selected, there is a final row per SNP corresponding to a 2 d.f. LR test of whether both the additive and the deviation from additivity components of the regression model are significant. Each row contains the chromosome [CHR], the SNP identifier [SNP], the base-pair location [BP], the minor allele [A1], the test performed [TEST: ADD (multiplicative model or genotypic model testing additivity), GENO_2DF (genotypic model), DOMDEV (genotypic model testing deviation from additivity), DOM (dominant model) or REC (recessive model)], the number of missing individuals included [NMISS], the OR, the coefficient z -statistic [STAT] and the asymptotic P value [ P ].▲ CRITICAL STEP ORs for main effects cannot be interpreted directly when interactions are included in the model; their interpretation depends on the exact combination of variables included in the model. Refer to a standard text on logistic regression for more details 36 .
5 | To create quantile-quantile plots to compare the observed association test statistics with their expected values under the null hypothesis of no association and so assess the number, magnitude and quality of true associations, follow option A. Note that quantile-quantile plots are only suitable for GWA studies comprising hundreds of thousands of markers. To create a Manhattan plot to display the association test P values as a function of chromosomal location and thus provide a visual summary of association test results that draw immediate attention to any regions of significance, follow option B. To visualize the LD between sets of markers in an LD plot, follow option C. Manhattan and LD plots are suitable for both GWA and CG studies comprising any number of markers. Otherwise, create customized graphics for the visualization of association test output using customized simple R 31 commands 37 (not detailed here)).
6 | For CG studies, typically comprising hundreds of thousands of markers, control for multiple testing using Bonferroni’s adjustment (follow option A); Holm, Sidak or FDR (follow option B) methods; or permutation (follow option C). Although Bonferroni, Holm, Sidak and FDR are simple to implement, permutation testing is widely recommended for accurately correcting for multiple testing and should be used when computationally possible. For GWA studies, select an appropriate genome-wide significance threshold (follow option D).
▲ CRITICAL STEP If some of the SNPs are in LD so that there are fewer than 40 independent tests, the Bonferroni correction will be too conservative. Use LD information from HapMap and SNPSpD ( http://genepi.qimr.edu.au/general/daleN/SNPSpD/ ) 35 to estimate the effective number of independent SNPs 1 . Derive the per-test significance rate α* by dividing α by the effective number of independent SNPs.
7 | For CG studies, typically comprising hundreds of thousands of markers, calculate the inflation factor λ (follow option A). For GWA studies, obtain an unbiased evaluation of the inflation factor λ by using all testing SNPs (follow option B).
▲ CRITICAL STEP To assess the inflation factor in CG studies, an additional set of null marker loci, which are common SNPs not associated with the disease and not in LD with CG SNPs, must be available. We do not have any null loci data files available for our CG study.
Open the PLINK log file ‘data.log’ that records the inflation factor.
▲ CRITICAL STEP When the sample size is large, the inflation factor λ 1000 , for an equivalent study of 1,000 cases and 1,000 controls, can be calculated by rescaling λ according to the following formula
For general help on the programs and websites used in this protocol, refer to the relevant websites:
Step 1: If genotypes are not available in standard PED and MAP or binary file formats, both Goldsurfer2 (Gs2; see refs. 38 , 39 ) and PLINK have the functionality to read other file formats (e.g., HapMap, HapMart, Affymetrix, transposed file sets and long-format file sets) and convert these into PED and MAP or binary file formats.
Steps 2–6: The default missing genotype character is ‘0′. PLINK can recognize a different character as the missing genotype by using the ‘--missing-genotype’ option. For example, specify a missing genotype character of ‘N’ instead of ‘0′ in Step 2 by typing ‘plink --file cg --assoc --missing-genotype N --out data’.
None of the programs used take longer than a few minutes to run. Displaying and interpreting the relevant information are the rate-limiting steps.
Summary of results.
Table 4 shows the unadjusted P value for an allelic test of association in the CG region, as well as corresponding adjusted P values for SNPs with significant P values. Here we have defined a P value to be significant if at least one of the adjusted values is smaller than the threshold required to maintain a FWER of 0.05. The top four SNPs are significant according to every method of adjustment for multiple testing. The last SNP is only significant according to the FDR method of Benjamini and Hochberg, and statements of significance should be made with some caution.
SNPs in the CG study showing the strongest association signals.
Unadjusted | Adjusted | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
Chr | SNP | Allelic test of association | Genomic control | Bonferroni | Holm | Sidak single step | Sidak step-down | FDR BH | FDR BY | Family-wise permutation |
3 | rs1801282 | 3.92E – 14 | 2.22E – 05 | 1.45E – 12 | 1.61E – 12 | 1.61E – 12 | 1.61E – 12 | 1.61E – 12 | 6.92E – 12 | 9.90E – 03 |
3 | rs12636454 | 5.54E – 07 | 4.99E – 03 | 2.05E – 05 | 2.22E – 05 | 2.27E – 05 | 2.22E – 05 | 1.14E – 05 | 4.89E – 05 | 9.90E – 03 |
3 | rs4135247 | 1.27E – 05 | 1.44E – 02 | 4.71E – 04 | 4.96E – 04 | 5.21E – 04 | 4.96E – 04 | 1.64E – 04 | 7.05E – 04 | 9.90E – 03 |
3 | rs2120825 | 1.60E – 05 | 1.56E – 02 | 5.92E – 04 | 6.08E – 04 | 6.56E – 04 | 6.08E – 04 | 1.64E – 04 | 7.05E – 04 | 9.90E – 03 |
3 | rs3856806 | 3.62E – 03 | 1.03E – 01 | 1.34E – 01 | 1.34E – 01 | 1.38E – 01 | 1.26E – 01 | 2.97E – 02 | 1.28E – 01 | 9.90E – 02 |
Shown are adjusted and unadjusted P values for those SNPs with significant P values in an allelic test of association according to at least one method of adjustment for multiple testing. Chr, chromosome; FDR, false discovery rate; BH, Benjamini and Hochberg; BY, Benjamini and Yekutieli.
Figure 1 shows an LD plot based on CG data. Numbers within diamonds indicate the r 2 values. SNPs with significant P values ( P value < 0.05 and listed in Table 4 ) in the CG study are shown in white boxes. Six haplotype blocks of LD across the region have been identified and are marked in black. The LD plot shows that the five significant SNPs belong to three different haplotype blocks with the region studied: three out of five significantly associated SNPs are located in Block 2, which is a 52-kb block of high LD ( r 2 > 0.34). The two remaining significant SNPs are each located in separate blocks, Block 3 and Block 5. Results indicate possible allelic heterogeneity (the presence of multiple independent risk-associated variants). Further fine mapping would be required to locate the precise causal variants.
LD plot. LD plot showing LD patterns among the 37 SNPs genotyped in the CG study. The LD between the SNPs is measured as r 2 and shown (× 100) in the diamond at the intersection of the diagonals from each SNP. r 2 = 0 is shown as white, 0 < r 2 < 1 is shown in gray and r 2 = 1 is shown in black. The analysis track at the top shows the SNPs according to chromosomal location. Six haplotype blocks (outlined in bold black line) indicating markers that are in high LD are shown. At the top, the markers with the strongest evidence for association (listed in Table 4 ) are boxed in white.
Figure 2 shows the quantile-quantile plots for two different tests of association in the GWA data, one based on χ 2 statistics from a test of allelic association and another based on − log 10 P values from a logistic regression under a multiplicative model of association. These plots show only minor deviations from the null distribution, except in the upper tail of the distribution, which corresponds to the SNPs with the strongest evidence for association. By illustrating that the majority of the results follow the null distribution and that only a handful deviate from the null we suggest that we do not have population structure that is unaccounted for in the analysis. These plots thus give confidence in the quality of the data and the robustness of the analysis. Both these plots are included here for illustration purposes only; typically only one (corresponding to the particular test of association) is required.
Quantile-quantile plots. Quantile-quantile plots of the results from the GWA study of ( a ) a simple χ 2 allelic test of association and ( b ) a multiplicative test of association based on logistic regression for all 306,102 SNPs that have passed the standard quality control filters. The solid line indicates the middle of the first and third quartile of the expected distribution of the test statistics. The dashed lines mark the 95% confidence interval of the expected distribution of the test statistics. Both plots show deviation from the null distribution only in the upper tails, which correspond to SNPs with the strongest evidence for association.
Figure 3 shows a Manhattan plot for the allelic test of association in the GWA study. SNPs with significant P values are easy to distinguish, corresponding to those values with large log10 P values. Three black ellipses mark regions on chromosomes 3, 8 and 16 that reach genome-wide significance ( P < 5 × 10 −8 ). Markers in these regions would then require further scrutiny through replication in an independent sample for confirmation of a true association.
Manhattan plot. Manhattan plot of simple χ 2 allelic test of association P values from the GWA study. The plot shows –log10 P values for each SNP against chromosomal location. Values for each chromosome (Chr) are shown in different colors for visual effect. Three regions are highlighted where markers have reached genome-wide significance ( P value < 5 × 10 −8 ).
Acknowledgments.
G.M.C. is funded by the Wellcome Trust. F.H.P. is funded by the Welcome Trust. C.A.A. is funded by the Wellcome Trust (WT91745/Z/10/Z). A.P.M. is supported by a Wellcome Trust Senior Research Fellowship. K.T.Z. is supported by a Wellcome Trust Research Career Development Fellowship.
Note: Supplementary information is available in the HTML version of this article.
COMPETING FINANCIAL INTERESTS The authors declare no competing financial interests.
Reprints and permissions information is available online at http://npg.nature.com/reprintsandpermissions/ .
IMAGES
COMMENTS
Revised on June 22, 2023. A case-control study is an experimental design that compares a group of participants possessing a condition of interest to a very similar group lacking that condition. Here, the participants possessing the attribute of study, such as a disease, are called the "case," and those without it are the "control.".
Examples. A case-control study is an observational study where researchers analyzed two groups of people (cases and controls) to look at factors associated with particular diseases or outcomes. Below are some examples of case-control studies: Investigating the impact of exposure to daylight on the health of office workers (Boubekri et al., 2014).
A case-control study is a type of observational study commonly used to look at factors associated with diseases or outcomes.[1] The case-control study starts with a group of cases, which are the individuals who have the outcome of interest. The researcher then tries to construct a second group of individuals called the controls, who are similar to the case individuals but do not have the ...
A case control study is a retrospective, observational study that compares two existing groups. Researchers form these groups based on the existence of a condition in the case group and the lack of that condition in the control group. They evaluate the differences in the histories between these two groups looking for factors that might cause a ...
Case-Control study design is a type of observational study. In this design, participants are selected for the study based on their outcome status. Thus, some participants have the outcome of interest (referred to as cases), whereas others do not have the outcome of interest (referred to as controls). The investigator then assesses the exposure ...
In the module entitled Overview of Analytic Studies it was noted that Rothman describes the case-control strategy as follows: "Case-control studies are best understood by considering as the starting point a source population, which represents a hypothetical study population in which a cohort study might have been conducted.The source population is the population that gives rise to the cases ...
Formulating Hypotheses for Different Study Designs. Generating a testable working hypothesis is the first step towards conducting original research. Such research may prove or disprove the proposed hypothesis. Case reports, case series, online surveys and other observational studies, clinical trials, and narrative reviews help to generate ...
1.1 A Brief History. The case-control study examines the association between disease and potential risk factors by taking separate samples of diseased cases and of controls at risk of developing disease. Information may be collected for both cases and controls on genetic, social, behavioral, environmental, or other determinants of disease risk.
A case-control study (also known as case-referent study) is a type of observational study in which two existing groups differing in outcome are identified and compared on the basis of some supposed causal attribute. Case-control studies are often used to identify factors that may contribute to a medical condition by comparing subjects who have the condition with patients who do not have ...
Formulation of a clearly defined hypothesis ... Selection bias in case-control studies Selection bias is a particular problem inherent in case-control studies, where it gives rise to non-comparability between cases and controls. Selection bias in case control studies may occur when: 'cases (or controls) are included in (or excluded from) a ...
While a case-control study can help to test a hypothesis about the link between a risk factor and an outcome, it is not as powerful as other types of study in confirming a causal relationship.
Case-control designs are an important yet commonly misunderstood tool in the epidemiologist's arsenal for causal inference. We reconsider classical concepts, assumptions and principles and explore when the results of case-control studies can be endowed a causal interpretation. We establish how, and under which conditions, various causal estimands relating to intention-to-treat or per ...
Case-control studies belong to observational studies. It set up a control group. In case-control studies, Odds Ratio was used to estimate the strength of the association between disease and exposure factors. Selection bias, information bias, and confounding bias are major sources of bias in case-control studies.
Case-control studies are observational studies in which cases are subjects who have a characteristic of interest, such as a clinical diagnosis, and controls are (usually) matched subjects who do not have that characteristic. After cases and controls are identified, researchers "look back" to determine what past events (exposures), if any ...
A case-control study is a type of observational study commonly used to look at factors associated with diseases or outcomes. The case-control study starts with a group of cases, which are the individuals who have the outcome of interest. The researcher then tries to construct a second group of individuals called the controls, who are similar to ...
A case-control study seeks to understand whether some exposure (disease, procedure, condition, or patient characteristic) has any effect on the probability of developing an outcome of interest [7, 9].When reviewing the history of the cases and controls, the presence or absence of an exposure should be obtained from the medical record or via participant survey/interview.
Use the case-control method to design a study that will allow you to compare the exposures to these products among your cases of Susser Syndrome and healthy controls of your choice. From all of your class work, you know that you want your hypotheses to be as explicit and detailed as possible. 1. Based on the information you gathered, which of ...
Hypothesis testing in case-control studies BY A. J. SCOTT AND C. J. WILD Department of Mathematics and Statistics, University of Auckland, Auckland, New Zealand SUMMARY Prentice & Pyke (1979) have shown that valid estimators of the odds-ratio parameters in a logistic regression model may be obtained from case-control data by fitting the model ...
Introduction. A case-control study is designed to help determine if an exposure is associated with an outcome (i.e., disease or condition of interest). In theory, the case-control study can be described simply. First, identify the cases (a group known to have the outcome) and the controls (a group known to be free of the outcome).
Cross sectional studies. A cross sectional study measures the prevalence of health outcomes or determinants of health, or both, in a population at a point in time or over a short period. Such information can be used to explore aetiology - for example, the relation between cataract and vitamin status has been examined in cross sectional surveys.
The relative merits of the likelihood ratio statistic, the Wald statistic, and the score statistic are examined by an empirical evaluation based on matched case-control data. A mixture model for the relative-odds function is used. The likelihood ratio statistic is relatively constant for reasonable values of the mixture parameter, but the Wald ...
Cohort studies and case-control studies are two primary types of observational studies that aid in evaluating associations between diseases and exposures. In this review article, we describe these study designs, methodological issues, and provide examples from the plastic surgery literature. Keywords: observational studies, case-control study ...
The hypothesis proposed in the literature is of microembolic origin. Our objectives were to determine the incidence of HCa, to analyze its characteristics and to identify associated factors. ... The results of our large case-control study suggest that this focal pulmonary tracer uptake is mostly unique, intense and small in volume (< 1 ml ...
Following previous protocols on study design, marker selection and data quality control 1-3, this protocol considers basic statistical analysis methods and techniques for the analysis of genetic SNP data from population-based genome-wide and candidate-gene (CG) case-control studies.We describe disease models, measures of association and testing at genotypic (individual) versus allelic ...