• Published: 13 September 2013

The Human Genome Project: big science transforms biology and medicine

  • Leroy Hood 1 &
  • Lee Rowen 1  

Genome Medicine volume  5 , Article number:  79 ( 2013 ) Cite this article

158k Accesses

122 Citations

122 Altmetric

Metrics details

The Human Genome Project has transformed biology through its integrated big science approach to deciphering a reference human genome sequence along with the complete sequences of key model organisms. The project exemplifies the power, necessity and success of large, integrated, cross-disciplinary efforts - so-called ‘big science’ - directed towards complex major objectives. In this article, we discuss the ways in which this ambitious endeavor led to the development of novel technologies and analytical tools, and how it brought the expertise of engineers, computer scientists and mathematicians together with biologists. It established an open approach to data sharing and open-source software, thereby making the data resulting from the project accessible to all. The genome sequences of microbes, plants and animals have revolutionized many fields of science, including microbiology, virology, infectious disease and plant biology. Moreover, deeper knowledge of human sequence variation has begun to alter the practice of medicine. The Human Genome Project has inspired subsequent large-scale data acquisition initiatives such as the International HapMap Project, 1000 Genomes, and The Cancer Genome Atlas, as well as the recently announced Human Brain Project and the emerging Human Proteome Project.

Origins of the human genome project

The Human Genome Project (HGP) has profoundly changed biology and is rapidly catalyzing a transformation of medicine [ 1 – 3 ]. The idea of the HGP was first publicly advocated by Renato Dulbecco in an article published in 1984, in which he argued that knowing the human genome sequence would facilitate an understanding of cancer [ 4 ]. In May 1985 a meeting focused entirely on the HGP was held, with Robert Sinsheimer, the Chancellor of the University of California, Santa Cruz (UCSC), assembling 12 experts to debate the merits of this potential project [ 5 ]. The meeting concluded that the project was technically possible, although very challenging. However, there was controversy as to whether it was a good idea, with six of those assembled declaring themselves for the project, six against (and those against felt very strongly). The naysayers argued that big science is bad science because it diverts resources from the ‘real’ small science (such as single investigator science); that the genome is mostly junk that would not be worth sequencing; that we were not ready to undertake such a complex project and should wait until the technology was adequate for the task; and that mapping and sequencing the genome was a routine and monotonous task that would not attract appropriate scientific talent. Throughout the early years of advocacy for the HGP (mid- to late 1980s) perhaps 80% of biologists were against it, as was the National Institutes of Health (NIH) [ 6 ]. The US Department of Energy (DOE) initially pushed for the HGP, partly using the argument that knowing the genome sequence would help us understand the radiation effects on the human genome resulting from exposure to atom bombs and other aspects of energy transmission [ 7 ]. This DOE advocacy was critical to stimulating the debate and ultimately the acceptance of the HGP. Curiously, there was more support from the US Congress than from most biologists. Those in Congress understood the appeal of international competitiveness in biology and medicine, the potential for industrial spin-offs and economic benefits, and the potential for more effective approaches to dealing with disease. A National Academy of Science committee report endorsed the project in 1988 [ 8 ] and the tide of opinion turned: in 1990, the program was initiated, with the finished sequence published in 2004 ahead of schedule and under budget [ 9 ].

What did the human genome project entail?

This 3-billion-dollar, 15-year program evolved considerably as genomics technologies improved. Initially, the HGP set out to determine a human genetic map, then a physical map of the human genome [ 10 ], and finally the sequence map. Throughout, the HGP was instrumental in pushing the development of high-throughput technologies for preparing, mapping and sequencing DNA [ 11 ]. At the inception of the HGP in the early 1990s, there was optimism that the then-prevailing sequencing technology would be replaced. This technology, now called ‘first-generation sequencing’, relied on gel electrophoresis to create sequencing ladders, and radioactive- or fluorescent-based labeling strategies to perform base calling [ 12 ]. It was considered to be too cumbersome and low throughput for efficient genomic sequencing. As it turned out, the initial human genome reference sequence was deciphered using a 96-capillary (highly parallelized) version of first-generation technology. Alternative approaches such as multiplexing [ 13 ] and sequencing by hybridization [ 14 ] were attempted but not effectively scaled up. Meanwhile, thanks to the efforts of biotech companies, successive incremental improvements in the cost, throughput, speed and accuracy of first-generation automated fluorescent-based sequencing strategies were made throughout the duration of the HGP. Because biologists were clamoring for sequence data, the goal of obtaining a full-fledged physical map of the human genome was abandoned in the later stages of the HGP in favor of generating the sequence earlier than originally planned. This push was accelerated by Craig Venter’s bold plan to create a company (Celera) for the purpose of using a whole-genome shotgun approach [ 15 ] to decipher the sequence instead of the piecemeal clone-by-clone approach using bacterial artificial chromosome (BAC) vectors that was being employed by the International Consortium. Venter’s initiative prompted government funding agencies to endorse production of a clone-based draft sequence for each chromosome, with the finishing to come in a subsequent phase. These parallel efforts accelerated the timetable for producing a genome sequence of immense value to biologists [ 16 , 17 ].

As a key component of the HGP, it was wisely decided to sequence the smaller genomes of significant experimental model organisms such as yeast, a small flowering plant ( Arabidopsis thaliana ), worm and fruit fly before taking on the far more challenging human genome. The efforts of multiple centers were integrated to produce these reference genome sequences, fostering a culture of cooperation. There were originally 20 centers mapping and sequencing the human genome as part of an international consortium [ 18 ]; in the end five large centers (the Wellcome Trust Sanger Institute, the Broad Institute of MIT and Harvard, The Genome Institute of Washington University in St Louis, the Joint Genome Institute, and the Whole Genome Laboratory at Baylor College of Medicine) emerged from this effort, with these five centers continuing to provide genome sequence and technology development. The HGP also fostered the development of mathematical, computational and statistical tools for handling all the data it generated.

The HGP produced a curated and accurate reference sequence for each human chromosome, with only a small number of gaps, and excluding large heterochromatic regions [ 9 ]. In addition to providing a foundation for subsequent studies in human genomic variation, the reference sequence has proven essential for the development and subsequent widespread use of second-generation sequencing technologies, which began in the mid-2000s. Second-generation cyclic array sequencing platforms produce, in a single run, up to hundreds of millions of short reads (originally approximately 30 to 70 bases, now up to several hundred bases), which are typically mapped to a reference genome at highly redundant coverage [ 19 ]. A variety of cyclic array sequencing strategies (such as RNA-Seq, ChIP-Seq, bisulfite sequencing) have significantly advanced biological studies of transcription and gene regulation as well as genomics, progress for which the HGP paved the way.

Impact of the human genome project on biology and technology

First, the human genome sequence initiated the comprehensive discovery and cataloguing of a ‘parts list’ of most human genes [ 16 , 17 ], and by inference most human proteins, along with other important elements such as non-coding regulatory RNAs. Understanding a complex biological system requires knowing the parts, how they are connected, their dynamics and how all of these relate to function [ 20 ]. The parts list has been essential for the emergence of ‘systems biology’, which has transformed our approaches to biology and medicine [ 21 , 22 ].

As an example, the ENCODE (Encyclopedia Of DNA Elements) Project, launched by the NIH in 2003, aims to discover and understand the functional parts of the genome [ 23 ]. Using multiple approaches, many based on second-generation sequencing, the ENCODE Project Consortium has produced voluminous and valuable data related to the regulatory networks that govern the expression of genes [ 24 ]. Large datasets such as those produced by ENCODE raise challenging questions regarding genome functionality. How can a true biological signal be distinguished from the inevitable biological noise produced by large datasets [ 25 , 26 ]? To what extent is the functionality of individual genomic elements only observable (used) in specific contexts (for example, regulatory networks and mRNAs that are operative only during embryogenesis)? It is clear that much work remains to be done before the functions of poorly annotated protein-coding genes will be deciphered, let alone those of the large regions of the non-coding portions of the genome that are transcribed. What is signal and what is noise is a critical question.

Second, the HGP also led to the emergence of proteomics, a discipline focused on identifying and quantifying the proteins present in discrete biological compartments, such as a cellular organelle, an organ or the blood. Proteins - whether they act as signaling devices, molecular machines or structural components - constitute the cell-specific functionality of the parts list of an organism’s genome. The HGP has facilitated the use of a key analytical tool, mass spectrometry, by providing the reference sequences and therefore the predicted masses of all the tryptic peptides in the human proteome - an essential requirement for the analysis of mass-spectrometry-based proteomics [ 27 ]. This mass-spectrometry-based accessibility to proteomes has driven striking new applications such as targeted proteomics [ 28 ]. Proteomics requires extremely sophisticated computational techniques, examples of which are PeptideAtlas [ 29 ] and the Trans-Proteomic Pipeline [ 30 ].

Third, our understanding of evolution has been transformed. Since the completion of the HGP, over 4,000 finished or quality draft genome sequences have been produced, mostly from bacterial species but including 183 eukaryotes [ 31 ]. These genomes provide insights into how diverse organisms from microbes to human are connected on the genealogical tree of life - clearly demonstrating that all of the species that exist today descended from a single ancestor [ 32 ]. Questions of longstanding interest with implications for biology and medicine have become approachable. Where do new genes come from? What might be the role of stretches of sequence highly conserved across all metazoa? How much large-scale gene organization is conserved across species and what drives local and global genome reorganization? Which regions of the genome appear to be resistant (or particularly susceptible) to mutation or highly susceptible to recombination? How do regulatory networks evolve and alter patterns of gene expression [ 33 ]? The latter question is of particular interest now that the genomes of several primates and hominids have been or are being sequenced [ 34 , 35 ] in hopes of shedding light on the evolution of distinctively human characteristics. The sequence of the Neanderthal genome [ 36 ] has had fascinating implications for human evolution; namely, that a few percent of Neanderthal DNA and hence the encoded genes are intermixed in the human genome, suggesting that there was some interbreeding while the two species were diverging [ 36 , 37 ].

Fourth, the HGP drove the development of sophisticated computational and mathematical approaches to data and brought computer scientists, mathematicians, engineers and theoretical physicists together with biologists, fostering a more cross-disciplinary culture [ 1 , 21 , 38 ]. It is important to note that the HGP popularized the idea of making data available to the public immediately in user-friendly databases such as GenBank [ 39 ] and the UCSC Genome Browser [ 40 ]. Moreover, the HGP also promoted the idea of open-source software, in which the source code of programs is made available to and can be edited by those interested in extending their reach and improving them [ 41 , 42 ]. The open-source operating system of Linux and the community it has spawned have shown the power of this approach. Data accessibility is a critical concept for the culture and success of biology in the future because the ‘democratization of data’ is critical for attracting available talent to focus on the challenging problems of biological systems with their inherent complexity [ 43 ]. This will be even more critical in medicine, as scientists need access to the data cloud available from each individual human to mine for the predictive medicine of the future - an effort that could transform the health of our children and grandchildren [ 44 ].

Fifth, the HGP, as conceived and implemented, was the first example of ‘big science’ in biology, and it clearly demonstrated both the power and the necessity of this approach for dealing with its integrated biological and technological aims. The HGP was characterized by a clear set of ambitious goals and plans for achieving them; a limited number of funded investigators typically organized around centers or consortia; a commitment to public data/resource release; and a need for significant funding to support project infrastructure and new technology development. Big science and smaller-scope individual-investigator-oriented science are powerfully complementary, in that the former generates resources that are foundational for all researchers while the latter adds detailed experimental clarification of specific questions, and analytical depth and detail to the data produced by big science. There are many levels of complexity in biology and medicine; big science projects are essential to tackle this complexity in a comprehensive and integrative manner [ 45 ].

The HGP benefited biology and medicine by creating a sequence of the human genome; sequencing model organisms; developing high-throughput sequencing technologies; and examining the ethical and social issues implicit in such technologies. It was able to take advantage of economies of scale and the coordinated effort of an international consortium with a limited number of players, which rendered the endeavor vastly more efficient than would have been possible if the genome were sequenced on a gene-by-gene basis in small labs. It is also worth noting that one aspect that attracted governmental support to the HGP was its potential for economic benefits. The Battelle Institute published a report on the economic impact of the HGP [ 46 ]. For an initial investment of approximately $3.5 billion, the return, according to the report, has been about $800 billion - a staggering return on investment.

Even today, as budgets tighten, there is a cry to withdraw support from big science and focus our resources on small science. This would be a drastic mistake. In the wake of the HGP there are further valuable biological resource-generating projects and analyses of biological complexity that require a big science approach, including the HapMap Project to catalogue human genetic variation [ 47 , 48 ], the ENCODE project, the Human Proteome Project (described below) and the European Commission’s Human Brain Project, as well as another brain-mapping project recently announced by President Obama [ 49 ]. Similarly to the HGP, significant returns on investment will be possible for other big science projects that are now under consideration if they are done properly. It should be stressed that discretion must be employed in choosing big science projects that are fundamentally important. Clearly funding agencies should maintain a mixed portfolio of big and small science - and the two are synergistic [ 1 , 45 ].

Last, the HGP ignited the imaginations of unusually talented scientists - Jim Watson, Eric Lander, John Sulston, Bob Waterston and Sydney Brenner to mention only a few. So virtually every argument initially posed by the opponents of the HGP turned out to be wrong. The HGP is a wonderful example of a fundamental paradigm change in biology: initially fiercely resisted, it was ultimately far more transformational than expected by even the most optimistic of its proponents.

Impact of the human genome project on medicine

Since the conclusion of the HGP, several big science projects specifically geared towards a better understanding of human genetic variation and its connection to human health have been initiated. These include the HapMap Project aimed at identifying haplotype blocks of common single nucleotide polymorphisms (SNPs) in different human populations [ 47 , 48 ], and its successor, the 1000 Genomes project, an ongoing endeavor to catalogue common and rare single nucleotide and structural variation in multiple populations [ 50 ]. Data produced by both projects have supported smaller-scale clinical genome-wide association studies (GWAS), which correlate specific genetic variants with disease risk of varying statistical significance based on case–control comparisons. Since 2005, over 1,350 GWAS have been published [ 51 ]. Although GWAS analyses give hints as to where in the genome to look for disease-causing variants, the results can be difficult to interpret because the actual disease-causing variant might be rare, the sample size of the study might be too small, or the disease phenotype might not be well stratified. Moreover, most of the GWAS hits are outside of coding regions - and we do not have effective methods for easily determining whether these hits reflect the mis-functioning of regulatory elements. The question as to what fraction of the thousands of GWAS hits are signal and what fraction are noise is a concern. Pedigree-based whole-genome sequencing offers a powerful alternative approach to identifying potential disease-causing variants [ 52 ].

Five years ago, a mere handful of personal genomes had been fully sequenced (for example, [ 53 , 54 ]). Now there are thousands of exome and whole-genome sequences (soon to be tens of thousands, and eventually millions), which have been determined with the aim of identifying disease-causing variants and, more broadly, establishing well-founded correlations between sequence variation and specific phenotypes. For example, the International Cancer Genome Consortium [ 55 ] and The Cancer Genome Atlas [ 56 ] are undertaking large-scale genomic data collection and analyses for numerous cancer types (sequencing both the normal and cancer genome for each individual patient), with a commitment to making their resources available to the research community.

We predict that individual genome sequences will soon play a larger role in medical practice. In the ideal scenario, patients or consumers will use the information to improve their own healthcare by taking advantage of prevention or therapeutic strategies that are known to be appropriate for real or potential medical conditions suggested by their individual genome sequence. Physicians will need to educate themselves on how best to advise patients who bring consumer genetic data to their appointments, which may well be a common occurrence in a few years [ 57 ].

In fact, the application of systems approaches to disease has already begun to transform our understanding of human disease and the practice of healthcare and push us towards a medicine that is predictive, preventive, personalized and participatory: P4 medicine. A key assumption of P4 medicine is that in diseased tissues biological networks become perturbed - and change dynamically with the progression of the disease. Hence, knowing how the information encoded by disease-perturbed networks changes provides insights into disease mechanisms, new approaches to diagnosis and new strategies for therapeutics [ 58 , 59 ].

Let us provide some examples. First, pharmacogenomics has identified more than 70 genes for which specific variants cause humans to metabolize drugs ineffectively (too fast or too slow). Second, there are hundreds of ‘actionable gene variants’ - variants that cause disease but whose consequences can be avoided by available medical strategies with knowledge of their presence [ 60 ]. Third, in some cases, cancer-driving mutations in tumors, once identified, can be counteracted by treatments with currently available drugs [ 61 ]. And last, a systems approach to blood protein diagnostics has generated powerful new diagnostic panels for human diseases such as hepatitis [ 62 ] and lung cancer [ 63 ].

These latter examples portend a revolution in blood diagnostics that will lead to early detection of disease, the ability to follow disease progression and responses to treatment, and the ability to stratify a disease type (for instance, breast cancer) into its different subtypes for proper impedance match against effective drugs [ 59 ]. We envision a time in the future when all patients will be surrounded by a virtual cloud of billions of data points, and when we will have the analytical tools to reduce this enormous data dimensionality to simple hypotheses to optimize wellness and minimize disease for each individual [ 58 ].

Impact of the human genome project on society

The HGP challenged biologists to consider the social implications of their research. Indeed, it devoted 5% of its budget to considering the social, ethical and legal aspects of acquiring and understanding the human genome sequence [ 64 ]. That process continues as different societal issues arise, such as genetic privacy, potential discrimination, justice in apportioning the benefits from genomic sequencing, human subject protections, genetic determinism (or not), identity politics, and the philosophical concept of what it means to be human beings who are intrinsically connected to the natural world.

Strikingly, we have learned from the HGP that there are no race-specific genes in humans [ 65 – 68 ]. Rather, an individual’s genome reveals his or her ancestral lineage, which is a function of the migrations and interbreeding among population groups. We are one race and we honor our species’ heritage when we treat each other accordingly, and address issues of concern to us all, such as human rights, education, job opportunities, climate change and global health.

What is to come?

There remain fundamental challenges for fully understanding the human genome. For example, as yet at least 5% of the human genome has not been successfully sequenced or assembled for technical reasons that relate to eukaryotic islands being embedded in heterochromatic repeats, copy number variations, and unusually high or low GC content [ 69 ]. The question of what information these regions contain is a fascinating one. In addition, there are highly conserved regions of the human genome whose functions have not yet been identified; presumably they are regulatory, but why they should be strongly conserved over a half a billion years of evolution remains a mystery.

There will continue to be advances in genome analysis. Developing improved analytical techniques to identify biological information in genomes and decipher what this information relates to functionally and evolutionarily will be important. Developing the ability to rapidly analyze complete human genomes with regard to actionable gene variants is essential. It is also essential to develop software that can accurately fold genome-predicted proteins into three dimensions, so that their functions can be predicted from structural homologies. Likewise, it will be fascinating to determine whether we can make predictions about the structures of biological networks directly from the information of their cognate genomes. Indeed, the idea that we can decipher the ‘logic of life’ of an organism solely from its genome sequence is intriguing. While we have become relatively proficient at determining static and stable genome sequences, we are still learning how to measure and interpret the dynamic effects of the genome: gene expression and regulation, as well as the dynamics and functioning of non-coding RNAs, metabolites, proteins and other products of genetically encoded information.

The HGP, with its focus on developing the technology to enumerate a parts list, was critical for launching systems biology, with its concomitant focus on high-throughput ‘omics’ data generation and the idea of ‘big data’ in biology [ 21 , 38 ]. The practice of systems biology begins with a complete parts list of the information elements of living organisms (for example, genes, RNAs, proteins and metabolites). The goals of systems biology are comprehensive yet open ended because, as seen with the HGP, the field is experiencing an infusion of talented scientists applying multidisciplinary approaches to a variety of problems. A core feature of systems biology, as we see it, is to integrate many different types of biological information to create the ‘network of networks’ - recognizing that networks operate at the genomic, the molecular, the cellular, the organ, and the social network levels, and that these are integrated in the individual organism in a seamless manner [ 58 ]. Integrating these data allows the creation of models that are predictive and actionable for particular types of organisms and individual patients. These goals require developing new types of high-throughput omic technologies and ever increasingly powerful analytical tools.

The HGP infused a technological capacity into biology that has resulted in enormous increases in the range of research, for both big and small science. Experiments that were inconceivable 20 years ago are now routine, thanks to the proliferation of academic and commercial wet lab and bioinformatics resources geared towards facilitating research. In particular, rapid increases in throughput and accuracy of the massively parallel second-generation sequencing platforms with their correlated decreases in cost of sequencing have resulted in a great wealth of accessible genomic and transcriptional sequence data for myriad microbial, plant and animal genomes. These data in turn have enabled large- and small-scale functional studies that catalyze and enhance further research when the results are provided in publicly accessible databases [ 70 ].

One descendant of the HGP is the Human Proteome Project, which is beginning to gather momentum, although it is still poorly funded. This exciting endeavor has the potential to be enormously beneficial to biology [ 71 – 73 ]. The Human Proteome Project aims to create assays for all human and model organism proteins, including the myriad protein isoforms produced from the RNA splicing and editing of protein-coding genes, chemical modifications of mature proteins, and protein processing. The project also aims to pioneer technologies that will achieve several goals: enable single-cell proteomics; create microfluidic platforms for thousands of protein enzyme-linked immunosorbent assays (ELISAs) for rapid and quantitative analyses of, for example, a fraction of a droplet of blood; develop protein-capture agents that are small, stable, easy to produce and can be targeted to specific protein epitopes and hence avoid extensive cross-reactivity; and develop the software that will enable the ordinary biologist to analyze the massive amounts of proteomics data that are beginning to emerge from human and other organisms.

Newer generations of DNA sequencing platforms will be introduced that will transform how we gather genome information. Third-generation sequencing [ 74 ] will employ nanopores or nanochannels, utilize electronic signals, and sequence single DNA molecules for read lengths of 10,000 to 100,000 bases. Third-generation sequencing will solve many current problems with human genome sequences. First, contemporary short-read sequencing approaches make it impossible to assemble human genome sequences de novo ; hence, they are usually compared against a prototype reference sequence that is itself not fully accurate, especially with respect to variations other than SNPs. This makes it extremely difficult to precisely identify the insertion-deletion and structural variations in the human genome, both for our species as a whole and for any single individual. The long reads of third-generation sequencing will allow for the de novo assembly of human (and other) genomes, and hence delineate all of the individually unique variability: nucleotide substitutions, indels, and structural variations. Second, we do not have global techniques for identifying the 16 different chemical modifications of human DNA (epigenetic marks, reviewed in [ 75 ]). It is increasingly clear that these epigenetic modifications play important roles in gene expression [ 76 ]. Thus, single-molecule analyses should be able to identify all the epigenetic marks on DNA. Third, single-molecule sequencing will facilitate the full-length sequencing of RNAs; thus, for example, enhancing interpretation of the transcriptome by enabling the identification of RNA editing, alternative splice forms with a given transcript, and different start and termination sites. Last, it is exciting to contemplate that the ability to parallelize this process (for example, by generating millions of nanopores that can be used simultaneously) could enable the sequencing of a human genome in 15 minutes or less [ 77 ]. The high-throughput nature of this sequencing may eventually lead to human genome costs of $100 or under. The interesting question is how long it will take to make third-generation sequencing a mature technology.

The HGP has thus opened many avenues in biology, medicine, technology and computation that we are just beginning to explore.

Abbreviations

Bacterial artificial chromosome

Department of Energy

Enzyme-linked immunosorbent assay

Genome-wide association studies

  • Human Genome Project

National Institutes of Health

Single nucleotide polymorphism

University of California, Santa Cruz.

Hood L: Acceptance remarks for Fritz J. and Delores H. Russ Prize. The Bridge. 2011, 41: 46-49.

Google Scholar  

Collins FS, McKusick VA: Implications of the Human Genome Project for medical science. JAMA. 2001, 285: 540-544. 10.1001/jama.285.5.540.

Article   CAS   PubMed   Google Scholar  

Green ED, Guyer MS, National Human Genome Research Institute: Charting a course for genomic medicine from base to bedside. Nature. 2011, 470: 204-213. 10.1038/nature09764.

Dulbecco R: A turning point in cancer research: sequencing the human genome. Science. 1984, 231: 1055-1056.

Article   Google Scholar  

Sinsheimer RL: The Santa Cruz workshop - May 1985. Genomics. 1989, 5: 954-956. 10.1016/0888-7543(89)90142-0.

Cooke-Degan RM: The Gene Wars: Science, Politics and the Human Genome. 1994, New York: WW Norton

Report on the Human Genome Initiative for the Office of Health and Environmental Research. http://www.ornl.gov/sci/techresources/Human_Genome/project/herac2.shtml ,

National Academy of Science: Report of the Committee on Mapping and Sequencing the Human Genome. 1988, Washington DC: National Academy Press

Human Genome Sequencing Consortium: Finishing the euchromatic sequence of the human genome. Nature. 2004, 431: 931-945. 10.1038/nature03001.

Understanding Our Genetic Inheritance. The United States Human Genome Project, The First Five Years: Fiscal Years. 1991, http://www.genome.gov/10001477 , –1995,

Collins FS, Galas D: A new five-year plan for the U.S. Human Genome Program. Science. 1993, 262: 43-46. 10.1126/science.8211127.

Smith LM, Sanders JZ, Kaiser RJ, Hughes P, Dodd C, Connell CR, Heiner C, Kent SBH, Hood LE: Fluorescence detection in automated DNA sequence analysis. Nature. 1986, 321: 674-679. 10.1038/321674a0.

Church G, Kieffer-Higgins S: Multiplex DNA sequencing. Science. 1988, 240: 185-188. 10.1126/science.3353714.

Strezoska Z, Paunesku T, Radosavljević D, Labat I, Drmanac R, Crkvenjakov R: DNA sequencing by hybridization: 100 bases read by a non-gel-based method. Proc Natl Acad Sci USA. 1991, 88: 10089-10093. 10.1073/pnas.88.22.10089.

Article   PubMed Central   CAS   PubMed   Google Scholar  

Venter JC, Adams MD, Sutton GG, Kerlavage AR, Smith HO, Hunkapiller M: Shotgun sequencing of the human genome. Science. 1998, 280: 1540-1542. 10.1126/science.280.5369.1540.

International Human Genome Sequencing Consortium: Initial sequencing and analysis of the human genome. Nature. 2001, 409: 860-921. 10.1038/35057062.

Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, Smith HO, Yandell M, Evans CA, Holt RA, Gocayne JD, Amanatides P, Ballew RM, Huson DH, Wortman JR, Zhang Q, Kodira CD, Zheng XH, Chen L, Skupski M, Subramanian G, Thomas PD, Zhang J, Miklos GLG, Nelson C, Broder S, Clark AG, Nadeau J, McKusick VA, Zinder N, et al: The sequence of the human genome. Science. 2001, 291: 1304-1351. 10.1126/science.1058040.

International Human Genome Sequencing Consortium. http://www.genome.gov/11006939 ,

Shendure J, Aiden ER: The expanding scope of DNA sequencing. Nat Biotechnol. 2012, 30: 1084-1094. 10.1038/nbt.2421.

Hood L: A personal journey of discovery: developing technology and changing biology. Annu Rev Anal Chem. 2008, 1: 1-43. 10.1146/annurev.anchem.1.031207.113113.

Article   CAS   Google Scholar  

Committee on a New Biology for the 21st Century: A New Biology for the 21st Century. 2009, Washington DC: The National Academies Press

Ideker T, Galitski T, Hood L: A new approach to decoding life: systems biology. Annu Rev Genomics Hum Genet. 2001, 2: 343-372. 10.1146/annurev.genom.2.1.343.

Encyclopedia of DNA Elements. http://encodeproject.org/ENCODE/ ,

ENCODE Project Consortium: An integrated encyclopedia of DNA elements in the human genome. Nature. 2012, 489: 57-74. 10.1038/nature11247.

Editorial: Form and function. Nature. 2013, 495: 141-142.

ENCODE Project Consortium: A user’s guide to the Encyclopedia of DNA Elements (ENCODE). PLoS Biol. 2011, 9: e1001046-10.1371/journal.pbio.1001046.

Aebersold R, Mann M: Mass spectrometry-based proteomics. Nature. 2003, 422: 198-207. 10.1038/nature01511.

Picotti P, Aebersold R: Selected reaction monitoring-based proteomics: workflows, potential, pitfalls and future directions. Nat Methods. 2012, 9: 555-566. 10.1038/nmeth.2015.

Desiere F, Deutsch EW, King NL, Nesvizhskii AI, Mallick P, Eng J, Chen S, Eddes J, Loevenich SN, Aebersold R: The PeptideAtlas Project. Nucleic Acids Res. 2006, 34: D655-D658. 10.1093/nar/gkj040.

Deutsch ED, Mendoza L, Shteynberg D, Farrah T, Lam H, Tasman N, Sun Z, Nilsson E, Pratt B, Prazen B, Eng JK, Martin DB, Nesvizhskii A, Aebersold R: A guided tour of the Trans-Proteomic Pipeline. Proteomics. 2010, 10: 1150-1159. 10.1002/pmic.200900375.

Genomes Online Database: complete genome projects. http://www.genomesonline.org/cgi-bin/GOLD/index.cgi?page_requested=Complete+Genome+Projects ,

Theobald DL: A formal test of the theory of universal common ancestry. Nature. 2010, 465: 219-222. 10.1038/nature09014.

Wolfe KE, Li W-H: Molecular evolution meets the genomics evolution. Nat Genet. 2003, Suppl 33: 255-265.

Marques-Bonet T, Ryder OA, Eichler EE: Sequencing primate genomes: what have we learned?. Annu Rev Genomics Hum Genet. 2009, 10: 355-386. 10.1146/annurev.genom.9.081307.164420.

Noonan JP: Neanderthal genomics and the evolution of modern human. Genome Res. 2010, 20: 547-553. 10.1101/gr.076000.108.

Stoneking M, Krause J: Learning about human population history from ancient and modern genomes. Nat Rev Genet. 2011, 12: 603-614.

Sankararaman S, Patterson N, Li H, Paabo S, Reich D: The date of interbreeding between Neanderthals and Modern Humans. PLoS Genet. 2012, 8: e1002947-10.1371/journal.pgen.1002947.

Schatz MC: Computational thinking in the era of big data biology. Genome Biol. 2012, 13: 177-10.1186/gb-2012-13-11-177.

Article   PubMed Central   PubMed   Google Scholar  

Mizrachi I: GenBank: the Nucleotide Sequence Database. The NCBI Handbook. Edited by: McEntyre J, Ostell J. 2002, Bethesda: National Center for Biotechnology Information

Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D: The human genome browser at UCSC. Genome Res. 2002, 12: 996-1006.

SourceForge. http://sourceforge.net/ ,

Bioconductor: open source software for bioinformatics. http://www.bioconductor.org/ ,

Field D, Sansone S-A, Collina A, Booth T, Dukes P, Gregurick SK, Kennedy K, Kolar P, Kolker E, Maxon M, Millard S, Mugabushaka M, Perrin N, Remacle JE, Remington K, Rocca-Serra P, Taylor CF, Thorley M, Tiwari B, Wilbanks J: Omics data sharing. Science. 2009, 326: 234-236. 10.1126/science.1180598.

Knoppers BM, Harris JR, Tasse AM, Budin-Ljosne I, Kaye J, Deschenes M, Zawati M: Towards a data-sharing Code of Conduct for international genomic research. Genome Med. 2011, 3: 46-10.1186/gm262.

Hood L: Biological complexity under attack: a personal view of systems biology and the coming of “big science”. Genet Eng Biotechnol News. 2011, 31: 17-

Tripp S, Grueber M: Economic Impact of the Human Genome Project. 2011, Columbus: Battelle Memorial Institute

International HapMap Consortium: A haplotype map of the human genome. Nature. 2005, 437: 1299-1320. 10.1038/nature04226.

The International HapMap3 Consortium: Integrating common and rare genetic variation in diverse human populations. Nature. 2010, 467: 52-58. 10.1038/nature09298.

Abbott A: Neuroscience: solving the brain. Nature. 2013, 499: 272-274. 10.1038/499272a.

The 1000 Genomes Project Consortium: An integrated map of genetic variation from 1,092 human genomes. Nature. 2012, 491: 56-65. 10.1038/nature11632.

Article   PubMed Central   Google Scholar  

A Catalog of Published Genome-wide Association Studies. http://www.genome.gov/gwastudies/ ,

Roach JC, Glusman G, Smit AF, Huff CD, Hubley R, Shannon PT, Rowen L, Pant KP, Goodman N, Bamshad M, Shendure J, Drmanac R, Jorde LB, Hood L, Galas DJ: Analysis of genetic inheritance in a family quartet by whole-genome sequencing. Science. 2010, 328: 636-639. 10.1126/science.1186802.

Levy S, Sutton G, Ng PC, Feuk L, Halpern AL, Walenz BP, Axelrod N, Huang J, Kirkness EF, Denisov G, Lin Y, MacDonald JR, Pang AW, Shago M, Stockwell TB, Tsiamouri A, Bafna V, Bansal V, Kravitz SA, Busam DA, Beeson KY, McIntosh TC, Remington KA, Abril JF, Gill J, Borman J, Rogers YH, Frazier ME, Scherer SW, Strausberg RL, et al: The diploid genome sequence of an individual human. PLoS Biol. 2007, 5: e254-10.1371/journal.pbio.0050254.

Wheeler DA, Srinivasian M, Egholm M, Shen Y, Chen L, McGuire A, He W, Chen Y-J, Makhijani V, Roth GT, Gomes X, Tartaro K, Niazi F, Turcotte CL, Irzyk GP, Lupski JR, Chinault C, Song X, Liu Y, Yuan Y, Nazareth L, Qin X, Muzny DM, Margulies M, Weinstock GM, Gibbs RA, Rothberg JM: The complete genome of an individual by massively parallel DNA sequencing. Nature. 2008, 452: 872-876. 10.1038/nature06884.

International Cancer Genome Consortium. http://icgc.org/ ,

The Cancer Genome Atlas. http://cancergenome.nih.gov/ ,

Pandey A: Preparing for the 21 st century patient. JAMA. 2013, 309: 1471-1472. 10.1001/jama.2012.116971.

Hood L, Flores M: A personal view on systems medicine and the emergence of proactive P4 medicine: predictive, preventive, personalized and participatory. Nat Biotechnol. 2012, 29: 613-624.

CAS   Google Scholar  

Price ND, Edelman LB, Lee I, Yoo H, Hwang D, Carlson G, Galas DJ, Heath JR, Hood L: Systems biology and the emergence of systems medicine. Genomic and Personalized Medicine: From Principles to Practice. Volume 1. Edited by: Ginsburg G, Willard H. 2009, Philadelphia: Elsevier, 131-141.

Green RC, Berg JS, Grody WW, Kalia SS, Korf BR, Martin CL, McGuire A, Nussbaum RL, O’Daniel JM, Ormond KE, Rehm HL, Watson MS, Williams MS, Biesecker LG: ACMG Recommendations for Reporting of Incidental Findings in Clinical Exome and Genome Sequencing. 2013, Bethesda: American College of Medical Genetics and Genomics

Meyerson M, Gabriel S, Getz G: Advances in understanding cancer genomes through second-generation sequencing. Nat Rev Genet. 2010, 11: 685-696. 10.1038/nrg2841.

Qin S, Zhou Y, Lok AS, Tsodikov A, Yan X, Gray L, Yuan M, Moritz RL, Galas D, Omenn GS, Hood L: SRM targeted proteomics in search for biomarkers of HCV-induced progression of fibrosis to cirrhosis in HALT-C patients. Proteomics. 2012, 12: 1244-1252. 10.1002/pmic.201100601.

Li X-J, Hayward C, Fong P-Y, Dominguez M, Hunsucker SW, Lee LW, McClean M, Law S, Butler H, Schirm M, Gingras O, Lamontague J, Allard R, Chelsky D, Price ND, Lam S, Massion PP, Pass H, Rom WN, Vachani A, Fang KC, Hood L, Kearney P: A blood-based proteomic classifier for the molecular characterization of pulmonary nodules. Sci Transl Med. in press

Knoppers BM, Thorogood A, Chadwick R: The Human Genome Organisation: towards next-generation ethics. Genome Med. 2013, 5: 38-10.1186/gm442.

Hood L: Who we are: the book of life. Commencement Address. Whitman College Magazine. 2002, 4-7.

Foster MW, Sharp RR: Beyond race: towards a whole-genome perspective on human populations and genetic variation. Nat Rev Genet. 2004, 5: 790-796. 10.1038/nrg1452.

Royal CDM, Dunston GM: Changing the paradigm from ‘race’ to human genetic variation. Nat Genet. 2004, 36: S5-S7. 10.1038/ng1454.

Witherspoon DJ, Wooding S, Rogers AR, Marchani EE, Watkins WS, Batzer MA, Jorde LB: Genetic similarities within and between populations. Genetics. 2007, 176: 351-359. 10.1534/genetics.106.067355.

Genovese G, Handsaker RE, Li H, Altemose N, Lindgren AM, Chambert K, Pasaniuk B, Price AL, Reich D, Morton CC, Pollak MR, Wilson JG, McCarroll SA: Using population admixture to help complete maps of the human genome. Nat Genet. 2013, 45: 406-414. 10.1038/ng.2565.

Fernandez-Suarez XM, Galperin MY: The, Nucleic Acids Research Database Issue and the online Molecular Biology Database Collection. Nucleic Acids Res. 2013, 2013: D1-D7.

Human Proteome Project. http://www.hupo.org/research/hpp/ ,

Hood LE, Omenn GS, Moritz RL, Aebersold R, Yamamoto KR, Amos M, Hunter-Cevera J, Locascio L, Workshop Participants: New and improved proteomics technologies for understanding complex biological systems: addressing a grand challenge in the life sciences. Proteomics. 2012, 12: 2773-2783. 10.1002/pmic.201270086.

Editorial: The call of the human proteome. Nat Methods. 2010, 7: 661-

Schadt E, Turner S, Kasarskis A: A window into third-generation sequencing. Hum Mol Genet. 2010, 19: R227-R240. 10.1093/hmg/ddq416.

Kim JK, Samaranayake M, Pradhan S: Epigenetic mechanisms in mammals. Cell Mol Life Sci. 2009, 66: 596-612. 10.1007/s00018-008-8432-4.

Hon G, Ren B, Wang W: ChromaSig: a probabilistic approach to finding common chromatin signatures in the human genome. PLoS Comput Biol. 2008, 4: e1000201-10.1371/journal.pcbi.1000201.

Hayden EC: Nanopore genome sequencer makes its debut. Nature News. 2012,  -10.1038/nature.2012.10051.

Download references

Acknowledgements

The authors gratefully acknowledge support from the Luxembourg Centre for Systems Biomedicine and the University of Luxembourg; from the NIH, through award 2P50GM076547-06A; and the US Department of Defense (DOD), through award W911SR-09-C-0062. LH receives support from NIH P01 NS041997; 1U54CA151819-01; and DOD awards W911NF-10-2-0111 and W81XWH-09-1-0107.

Author information

Authors and affiliations.

Institute for Systems Biology, 401 Terry Ave N., Seattle, WA, 98109, USA

Leroy Hood & Lee Rowen

You can also search for this author in PubMed   Google Scholar

Corresponding authors

Correspondence to Leroy Hood or Lee Rowen .

Additional information

Competing interests.

The authors declare that they have no competing interests.

Rights and permissions

Reprints and permissions

About this article

Cite this article.

Hood, L., Rowen, L. The Human Genome Project: big science transforms biology and medicine. Genome Med 5 , 79 (2013). https://doi.org/10.1186/gm483

Download citation

Published : 13 September 2013

DOI : https://doi.org/10.1186/gm483

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Human Genome Sequence
  • Human Brain Project
  • Small Science
  • Individual Genome Sequence

Genome Medicine

ISSN: 1756-994X

human genome project research paper pdf

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • v.5(9); 2013

Logo of genmed

The Human Genome Project: big science transforms biology and medicine

1 Institute for Systems Biology, 401 Terry Ave N., Seattle, WA 98109, USA

The Human Genome Project has transformed biology through its integrated big science approach to deciphering a reference human genome sequence along with the complete sequences of key model organisms. The project exemplifies the power, necessity and success of large, integrated, cross-disciplinary efforts - so-called ‘big science’ - directed towards complex major objectives. In this article, we discuss the ways in which this ambitious endeavor led to the development of novel technologies and analytical tools, and how it brought the expertise of engineers, computer scientists and mathematicians together with biologists. It established an open approach to data sharing and open-source software, thereby making the data resulting from the project accessible to all. The genome sequences of microbes, plants and animals have revolutionized many fields of science, including microbiology, virology, infectious disease and plant biology. Moreover, deeper knowledge of human sequence variation has begun to alter the practice of medicine. The Human Genome Project has inspired subsequent large-scale data acquisition initiatives such as the International HapMap Project, 1000 Genomes, and The Cancer Genome Atlas, as well as the recently announced Human Brain Project and the emerging Human Proteome Project.

Origins of the human genome project

The Human Genome Project (HGP) has profoundly changed biology and is rapidly catalyzing a transformation of medicine [ 1 - 3 ]. The idea of the HGP was first publicly advocated by Renato Dulbecco in an article published in 1984, in which he argued that knowing the human genome sequence would facilitate an understanding of cancer [ 4 ]. In May 1985 a meeting focused entirely on the HGP was held, with Robert Sinsheimer, the Chancellor of the University of California, Santa Cruz (UCSC), assembling 12 experts to debate the merits of this potential project [ 5 ]. The meeting concluded that the project was technically possible, although very challenging. However, there was controversy as to whether it was a good idea, with six of those assembled declaring themselves for the project, six against (and those against felt very strongly). The naysayers argued that big science is bad science because it diverts resources from the ‘real’ small science (such as single investigator science); that the genome is mostly junk that would not be worth sequencing; that we were not ready to undertake such a complex project and should wait until the technology was adequate for the task; and that mapping and sequencing the genome was a routine and monotonous task that would not attract appropriate scientific talent. Throughout the early years of advocacy for the HGP (mid- to late 1980s) perhaps 80% of biologists were against it, as was the National Institutes of Health (NIH) [ 6 ]. The US Department of Energy (DOE) initially pushed for the HGP, partly using the argument that knowing the genome sequence would help us understand the radiation effects on the human genome resulting from exposure to atom bombs and other aspects of energy transmission [ 7 ]. This DOE advocacy was critical to stimulating the debate and ultimately the acceptance of the HGP. Curiously, there was more support from the US Congress than from most biologists. Those in Congress understood the appeal of international competitiveness in biology and medicine, the potential for industrial spin-offs and economic benefits, and the potential for more effective approaches to dealing with disease. A National Academy of Science committee report endorsed the project in 1988 [ 8 ] and the tide of opinion turned: in 1990, the program was initiated, with the finished sequence published in 2004 ahead of schedule and under budget [ 9 ].

What did the human genome project entail?

This 3-billion-dollar, 15-year program evolved considerably as genomics technologies improved. Initially, the HGP set out to determine a human genetic map, then a physical map of the human genome [ 10 ], and finally the sequence map. Throughout, the HGP was instrumental in pushing the development of high-throughput technologies for preparing, mapping and sequencing DNA [ 11 ]. At the inception of the HGP in the early 1990s, there was optimism that the then-prevailing sequencing technology would be replaced. This technology, now called ‘first-generation sequencing’, relied on gel electrophoresis to create sequencing ladders, and radioactive- or fluorescent-based labeling strategies to perform base calling [ 12 ]. It was considered to be too cumbersome and low throughput for efficient genomic sequencing. As it turned out, the initial human genome reference sequence was deciphered using a 96-capillary (highly parallelized) version of first-generation technology. Alternative approaches such as multiplexing [ 13 ] and sequencing by hybridization [ 14 ] were attempted but not effectively scaled up. Meanwhile, thanks to the efforts of biotech companies, successive incremental improvements in the cost, throughput, speed and accuracy of first-generation automated fluorescent-based sequencing strategies were made throughout the duration of the HGP. Because biologists were clamoring for sequence data, the goal of obtaining a full-fledged physical map of the human genome was abandoned in the later stages of the HGP in favor of generating the sequence earlier than originally planned. This push was accelerated by Craig Venter’s bold plan to create a company (Celera) for the purpose of using a whole-genome shotgun approach [ 15 ] to decipher the sequence instead of the piecemeal clone-by-clone approach using bacterial artificial chromosome (BAC) vectors that was being employed by the International Consortium. Venter’s initiative prompted government funding agencies to endorse production of a clone-based draft sequence for each chromosome, with the finishing to come in a subsequent phase. These parallel efforts accelerated the timetable for producing a genome sequence of immense value to biologists [ 16 , 17 ].

As a key component of the HGP, it was wisely decided to sequence the smaller genomes of significant experimental model organisms such as yeast, a small flowering plant ( Arabidopsis thaliana ), worm and fruit fly before taking on the far more challenging human genome. The efforts of multiple centers were integrated to produce these reference genome sequences, fostering a culture of cooperation. There were originally 20 centers mapping and sequencing the human genome as part of an international consortium [ 18 ]; in the end five large centers (the Wellcome Trust Sanger Institute, the Broad Institute of MIT and Harvard, The Genome Institute of Washington University in St Louis, the Joint Genome Institute, and the Whole Genome Laboratory at Baylor College of Medicine) emerged from this effort, with these five centers continuing to provide genome sequence and technology development. The HGP also fostered the development of mathematical, computational and statistical tools for handling all the data it generated.

The HGP produced a curated and accurate reference sequence for each human chromosome, with only a small number of gaps, and excluding large heterochromatic regions [ 9 ]. In addition to providing a foundation for subsequent studies in human genomic variation, the reference sequence has proven essential for the development and subsequent widespread use of second-generation sequencing technologies, which began in the mid-2000s. Second-generation cyclic array sequencing platforms produce, in a single run, up to hundreds of millions of short reads (originally approximately 30 to 70 bases, now up to several hundred bases), which are typically mapped to a reference genome at highly redundant coverage [ 19 ]. A variety of cyclic array sequencing strategies (such as RNA-Seq, ChIP-Seq, bisulfite sequencing) have significantly advanced biological studies of transcription and gene regulation as well as genomics, progress for which the HGP paved the way.

Impact of the human genome project on biology and technology

First, the human genome sequence initiated the comprehensive discovery and cataloguing of a ‘parts list’ of most human genes [ 16 , 17 ], and by inference most human proteins, along with other important elements such as non-coding regulatory RNAs. Understanding a complex biological system requires knowing the parts, how they are connected, their dynamics and how all of these relate to function [ 20 ]. The parts list has been essential for the emergence of ‘systems biology’, which has transformed our approaches to biology and medicine [ 21 , 22 ].

As an example, the ENCODE (Encyclopedia Of DNA Elements) Project, launched by the NIH in 2003, aims to discover and understand the functional parts of the genome [ 23 ]. Using multiple approaches, many based on second-generation sequencing, the ENCODE Project Consortium has produced voluminous and valuable data related to the regulatory networks that govern the expression of genes [ 24 ]. Large datasets such as those produced by ENCODE raise challenging questions regarding genome functionality. How can a true biological signal be distinguished from the inevitable biological noise produced by large datasets [ 25 , 26 ]? To what extent is the functionality of individual genomic elements only observable (used) in specific contexts (for example, regulatory networks and mRNAs that are operative only during embryogenesis)? It is clear that much work remains to be done before the functions of poorly annotated protein-coding genes will be deciphered, let alone those of the large regions of the non-coding portions of the genome that are transcribed. What is signal and what is noise is a critical question.

Second, the HGP also led to the emergence of proteomics, a discipline focused on identifying and quantifying the proteins present in discrete biological compartments, such as a cellular organelle, an organ or the blood. Proteins - whether they act as signaling devices, molecular machines or structural components - constitute the cell-specific functionality of the parts list of an organism’s genome. The HGP has facilitated the use of a key analytical tool, mass spectrometry, by providing the reference sequences and therefore the predicted masses of all the tryptic peptides in the human proteome - an essential requirement for the analysis of mass-spectrometry-based proteomics [ 27 ]. This mass-spectrometry-based accessibility to proteomes has driven striking new applications such as targeted proteomics [ 28 ]. Proteomics requires extremely sophisticated computational techniques, examples of which are PeptideAtlas [ 29 ] and the Trans-Proteomic Pipeline [ 30 ].

Third, our understanding of evolution has been transformed. Since the completion of the HGP, over 4,000 finished or quality draft genome sequences have been produced, mostly from bacterial species but including 183 eukaryotes [ 31 ]. These genomes provide insights into how diverse organisms from microbes to human are connected on the genealogical tree of life - clearly demonstrating that all of the species that exist today descended from a single ancestor [ 32 ]. Questions of longstanding interest with implications for biology and medicine have become approachable. Where do new genes come from? What might be the role of stretches of sequence highly conserved across all metazoa? How much large-scale gene organization is conserved across species and what drives local and global genome reorganization? Which regions of the genome appear to be resistant (or particularly susceptible) to mutation or highly susceptible to recombination? How do regulatory networks evolve and alter patterns of gene expression [ 33 ]? The latter question is of particular interest now that the genomes of several primates and hominids have been or are being sequenced [ 34 , 35 ] in hopes of shedding light on the evolution of distinctively human characteristics. The sequence of the Neanderthal genome [ 36 ] has had fascinating implications for human evolution; namely, that a few percent of Neanderthal DNA and hence the encoded genes are intermixed in the human genome, suggesting that there was some interbreeding while the two species were diverging [ 36 , 37 ].

Fourth, the HGP drove the development of sophisticated computational and mathematical approaches to data and brought computer scientists, mathematicians, engineers and theoretical physicists together with biologists, fostering a more cross-disciplinary culture [ 1 , 21 , 38 ]. It is important to note that the HGP popularized the idea of making data available to the public immediately in user-friendly databases such as GenBank [ 39 ] and the UCSC Genome Browser [ 40 ]. Moreover, the HGP also promoted the idea of open-source software, in which the source code of programs is made available to and can be edited by those interested in extending their reach and improving them [ 41 , 42 ]. The open-source operating system of Linux and the community it has spawned have shown the power of this approach. Data accessibility is a critical concept for the culture and success of biology in the future because the ‘democratization of data’ is critical for attracting available talent to focus on the challenging problems of biological systems with their inherent complexity [ 43 ]. This will be even more critical in medicine, as scientists need access to the data cloud available from each individual human to mine for the predictive medicine of the future - an effort that could transform the health of our children and grandchildren [ 44 ].

Fifth, the HGP, as conceived and implemented, was the first example of ‘big science’ in biology, and it clearly demonstrated both the power and the necessity of this approach for dealing with its integrated biological and technological aims. The HGP was characterized by a clear set of ambitious goals and plans for achieving them; a limited number of funded investigators typically organized around centers or consortia; a commitment to public data/resource release; and a need for significant funding to support project infrastructure and new technology development. Big science and smaller-scope individual-investigator-oriented science are powerfully complementary, in that the former generates resources that are foundational for all researchers while the latter adds detailed experimental clarification of specific questions, and analytical depth and detail to the data produced by big science. There are many levels of complexity in biology and medicine; big science projects are essential to tackle this complexity in a comprehensive and integrative manner [ 45 ].

The HGP benefited biology and medicine by creating a sequence of the human genome; sequencing model organisms; developing high-throughput sequencing technologies; and examining the ethical and social issues implicit in such technologies. It was able to take advantage of economies of scale and the coordinated effort of an international consortium with a limited number of players, which rendered the endeavor vastly more efficient than would have been possible if the genome were sequenced on a gene-by-gene basis in small labs. It is also worth noting that one aspect that attracted governmental support to the HGP was its potential for economic benefits. The Battelle Institute published a report on the economic impact of the HGP [ 46 ]. For an initial investment of approximately $3.5 billion, the return, according to the report, has been about $800 billion - a staggering return on investment.

Even today, as budgets tighten, there is a cry to withdraw support from big science and focus our resources on small science. This would be a drastic mistake. In the wake of the HGP there are further valuable biological resource-generating projects and analyses of biological complexity that require a big science approach, including the HapMap Project to catalogue human genetic variation [ 47 , 48 ], the ENCODE project, the Human Proteome Project (described below) and the European Commission’s Human Brain Project, as well as another brain-mapping project recently announced by President Obama [ 49 ]. Similarly to the HGP, significant returns on investment will be possible for other big science projects that are now under consideration if they are done properly. It should be stressed that discretion must be employed in choosing big science projects that are fundamentally important. Clearly funding agencies should maintain a mixed portfolio of big and small science - and the two are synergistic [ 1 , 45 ].

Last, the HGP ignited the imaginations of unusually talented scientists - Jim Watson, Eric Lander, John Sulston, Bob Waterston and Sydney Brenner to mention only a few. So virtually every argument initially posed by the opponents of the HGP turned out to be wrong. The HGP is a wonderful example of a fundamental paradigm change in biology: initially fiercely resisted, it was ultimately far more transformational than expected by even the most optimistic of its proponents.

Impact of the human genome project on medicine

Since the conclusion of the HGP, several big science projects specifically geared towards a better understanding of human genetic variation and its connection to human health have been initiated. These include the HapMap Project aimed at identifying haplotype blocks of common single nucleotide polymorphisms (SNPs) in different human populations [ 47 , 48 ], and its successor, the 1000 Genomes project, an ongoing endeavor to catalogue common and rare single nucleotide and structural variation in multiple populations [ 50 ]. Data produced by both projects have supported smaller-scale clinical genome-wide association studies (GWAS), which correlate specific genetic variants with disease risk of varying statistical significance based on case–control comparisons. Since 2005, over 1,350 GWAS have been published [ 51 ]. Although GWAS analyses give hints as to where in the genome to look for disease-causing variants, the results can be difficult to interpret because the actual disease-causing variant might be rare, the sample size of the study might be too small, or the disease phenotype might not be well stratified. Moreover, most of the GWAS hits are outside of coding regions - and we do not have effective methods for easily determining whether these hits reflect the mis-functioning of regulatory elements. The question as to what fraction of the thousands of GWAS hits are signal and what fraction are noise is a concern. Pedigree-based whole-genome sequencing offers a powerful alternative approach to identifying potential disease-causing variants [ 52 ].

Five years ago, a mere handful of personal genomes had been fully sequenced (for example, [ 53 , 54 ]). Now there are thousands of exome and whole-genome sequences (soon to be tens of thousands, and eventually millions), which have been determined with the aim of identifying disease-causing variants and, more broadly, establishing well-founded correlations between sequence variation and specific phenotypes. For example, the International Cancer Genome Consortium [ 55 ] and The Cancer Genome Atlas [ 56 ] are undertaking large-scale genomic data collection and analyses for numerous cancer types (sequencing both the normal and cancer genome for each individual patient), with a commitment to making their resources available to the research community.

We predict that individual genome sequences will soon play a larger role in medical practice. In the ideal scenario, patients or consumers will use the information to improve their own healthcare by taking advantage of prevention or therapeutic strategies that are known to be appropriate for real or potential medical conditions suggested by their individual genome sequence. Physicians will need to educate themselves on how best to advise patients who bring consumer genetic data to their appointments, which may well be a common occurrence in a few years [ 57 ].

In fact, the application of systems approaches to disease has already begun to transform our understanding of human disease and the practice of healthcare and push us towards a medicine that is predictive, preventive, personalized and participatory: P4 medicine. A key assumption of P4 medicine is that in diseased tissues biological networks become perturbed - and change dynamically with the progression of the disease. Hence, knowing how the information encoded by disease-perturbed networks changes provides insights into disease mechanisms, new approaches to diagnosis and new strategies for therapeutics [ 58 , 59 ].

Let us provide some examples. First, pharmacogenomics has identified more than 70 genes for which specific variants cause humans to metabolize drugs ineffectively (too fast or too slow). Second, there are hundreds of ‘actionable gene variants’ - variants that cause disease but whose consequences can be avoided by available medical strategies with knowledge of their presence [ 60 ]. Third, in some cases, cancer-driving mutations in tumors, once identified, can be counteracted by treatments with currently available drugs [ 61 ]. And last, a systems approach to blood protein diagnostics has generated powerful new diagnostic panels for human diseases such as hepatitis [ 62 ] and lung cancer [ 63 ].

These latter examples portend a revolution in blood diagnostics that will lead to early detection of disease, the ability to follow disease progression and responses to treatment, and the ability to stratify a disease type (for instance, breast cancer) into its different subtypes for proper impedance match against effective drugs [ 59 ]. We envision a time in the future when all patients will be surrounded by a virtual cloud of billions of data points, and when we will have the analytical tools to reduce this enormous data dimensionality to simple hypotheses to optimize wellness and minimize disease for each individual [ 58 ].

Impact of the human genome project on society

The HGP challenged biologists to consider the social implications of their research. Indeed, it devoted 5% of its budget to considering the social, ethical and legal aspects of acquiring and understanding the human genome sequence [ 64 ]. That process continues as different societal issues arise, such as genetic privacy, potential discrimination, justice in apportioning the benefits from genomic sequencing, human subject protections, genetic determinism (or not), identity politics, and the philosophical concept of what it means to be human beings who are intrinsically connected to the natural world.

Strikingly, we have learned from the HGP that there are no race-specific genes in humans [ 65 - 68 ]. Rather, an individual’s genome reveals his or her ancestral lineage, which is a function of the migrations and interbreeding among population groups. We are one race and we honor our species’ heritage when we treat each other accordingly, and address issues of concern to us all, such as human rights, education, job opportunities, climate change and global health.

What is to come?

There remain fundamental challenges for fully understanding the human genome. For example, as yet at least 5% of the human genome has not been successfully sequenced or assembled for technical reasons that relate to eukaryotic islands being embedded in heterochromatic repeats, copy number variations, and unusually high or low GC content [ 69 ]. The question of what information these regions contain is a fascinating one. In addition, there are highly conserved regions of the human genome whose functions have not yet been identified; presumably they are regulatory, but why they should be strongly conserved over a half a billion years of evolution remains a mystery.

There will continue to be advances in genome analysis. Developing improved analytical techniques to identify biological information in genomes and decipher what this information relates to functionally and evolutionarily will be important. Developing the ability to rapidly analyze complete human genomes with regard to actionable gene variants is essential. It is also essential to develop software that can accurately fold genome-predicted proteins into three dimensions, so that their functions can be predicted from structural homologies. Likewise, it will be fascinating to determine whether we can make predictions about the structures of biological networks directly from the information of their cognate genomes. Indeed, the idea that we can decipher the ‘logic of life’ of an organism solely from its genome sequence is intriguing. While we have become relatively proficient at determining static and stable genome sequences, we are still learning how to measure and interpret the dynamic effects of the genome: gene expression and regulation, as well as the dynamics and functioning of non-coding RNAs, metabolites, proteins and other products of genetically encoded information.

The HGP, with its focus on developing the technology to enumerate a parts list, was critical for launching systems biology, with its concomitant focus on high-throughput ‘omics’ data generation and the idea of ‘big data’ in biology [ 21 , 38 ]. The practice of systems biology begins with a complete parts list of the information elements of living organisms (for example, genes, RNAs, proteins and metabolites). The goals of systems biology are comprehensive yet open ended because, as seen with the HGP, the field is experiencing an infusion of talented scientists applying multidisciplinary approaches to a variety of problems. A core feature of systems biology, as we see it, is to integrate many different types of biological information to create the ‘network of networks’ - recognizing that networks operate at the genomic, the molecular, the cellular, the organ, and the social network levels, and that these are integrated in the individual organism in a seamless manner [ 58 ]. Integrating these data allows the creation of models that are predictive and actionable for particular types of organisms and individual patients. These goals require developing new types of high-throughput omic technologies and ever increasingly powerful analytical tools.

The HGP infused a technological capacity into biology that has resulted in enormous increases in the range of research, for both big and small science. Experiments that were inconceivable 20 years ago are now routine, thanks to the proliferation of academic and commercial wet lab and bioinformatics resources geared towards facilitating research. In particular, rapid increases in throughput and accuracy of the massively parallel second-generation sequencing platforms with their correlated decreases in cost of sequencing have resulted in a great wealth of accessible genomic and transcriptional sequence data for myriad microbial, plant and animal genomes. These data in turn have enabled large- and small-scale functional studies that catalyze and enhance further research when the results are provided in publicly accessible databases [ 70 ].

One descendant of the HGP is the Human Proteome Project, which is beginning to gather momentum, although it is still poorly funded. This exciting endeavor has the potential to be enormously beneficial to biology [ 71 - 73 ]. The Human Proteome Project aims to create assays for all human and model organism proteins, including the myriad protein isoforms produced from the RNA splicing and editing of protein-coding genes, chemical modifications of mature proteins, and protein processing. The project also aims to pioneer technologies that will achieve several goals: enable single-cell proteomics; create microfluidic platforms for thousands of protein enzyme-linked immunosorbent assays (ELISAs) for rapid and quantitative analyses of, for example, a fraction of a droplet of blood; develop protein-capture agents that are small, stable, easy to produce and can be targeted to specific protein epitopes and hence avoid extensive cross-reactivity; and develop the software that will enable the ordinary biologist to analyze the massive amounts of proteomics data that are beginning to emerge from human and other organisms.

Newer generations of DNA sequencing platforms will be introduced that will transform how we gather genome information. Third-generation sequencing [ 74 ] will employ nanopores or nanochannels, utilize electronic signals, and sequence single DNA molecules for read lengths of 10,000 to 100,000 bases. Third-generation sequencing will solve many current problems with human genome sequences. First, contemporary short-read sequencing approaches make it impossible to assemble human genome sequences de novo ; hence, they are usually compared against a prototype reference sequence that is itself not fully accurate, especially with respect to variations other than SNPs. This makes it extremely difficult to precisely identify the insertion-deletion and structural variations in the human genome, both for our species as a whole and for any single individual. The long reads of third-generation sequencing will allow for the de novo assembly of human (and other) genomes, and hence delineate all of the individually unique variability: nucleotide substitutions, indels, and structural variations. Second, we do not have global techniques for identifying the 16 different chemical modifications of human DNA (epigenetic marks, reviewed in [ 75 ]). It is increasingly clear that these epigenetic modifications play important roles in gene expression [ 76 ]. Thus, single-molecule analyses should be able to identify all the epigenetic marks on DNA. Third, single-molecule sequencing will facilitate the full-length sequencing of RNAs; thus, for example, enhancing interpretation of the transcriptome by enabling the identification of RNA editing, alternative splice forms with a given transcript, and different start and termination sites. Last, it is exciting to contemplate that the ability to parallelize this process (for example, by generating millions of nanopores that can be used simultaneously) could enable the sequencing of a human genome in 15 minutes or less [ 77 ]. The high-throughput nature of this sequencing may eventually lead to human genome costs of $100 or under. The interesting question is how long it will take to make third-generation sequencing a mature technology.

The HGP has thus opened many avenues in biology, medicine, technology and computation that we are just beginning to explore.

Abbreviations

BAC: Bacterial artificial chromosome; DOE: Department of Energy; ELISA: Enzyme-linked immunosorbent assay; GWAS: Genome-wide association studies; HGP: Human Genome Project; NIH: National Institutes of Health; SNP: Single nucleotide polymorphism; UCSC: University of California, Santa Cruz.

Competing interests

The authors declare that they have no competing interests.

Acknowledgements

The authors gratefully acknowledge support from the Luxembourg Centre for Systems Biomedicine and the University of Luxembourg; from the NIH, through award 2P50GM076547-06A; and the US Department of Defense (DOD), through award W911SR-09-C-0062. LH receives support from NIH P01 NS041997; 1U54CA151819-01; and DOD awards W911NF-10-2-0111 and W81XWH-09-1-0107.

  • Hood L. Acceptance remarks for Fritz J. and Delores H. Russ Prize. The Bridge. 2011; 5 :46–49. [ Google Scholar ]
  • Collins FS, McKusick VA. Implications of the Human Genome Project for medical science. JAMA. 2001; 5 :540–544. doi: 10.1001/jama.285.5.540. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Green ED, Guyer MS. National Human Genome Research Institute. Charting a course for genomic medicine from base to bedside. Nature. 2011; 5 :204–213. doi: 10.1038/nature09764. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Dulbecco R. A turning point in cancer research: sequencing the human genome. Science. 1984; 5 :1055–1056. [ PubMed ] [ Google Scholar ]
  • Sinsheimer RL. The Santa Cruz workshop - May 1985. Genomics. 1989; 5 :954–956. doi: 10.1016/0888-7543(89)90142-0. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Cooke-Degan RM. The Gene Wars: Science, Politics and the Human Genome. New York: WW Norton; 1994. [ Google Scholar ]
  • Report on the Human Genome Initiative for the Office of Health and Environmental Research. http://www.ornl.gov/sci/techresources/Human_Genome/project/herac2.shtml .
  • National Academy of Science. Report of the Committee on Mapping and Sequencing the Human Genome. Washington DC: National Academy Press; 1988. [ Google Scholar ]
  • Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome. Nature. 2004; 5 :931–945. doi: 10.1038/nature03001. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Understanding Our Genetic Inheritance. The United States Human Genome Project, The First Five Years: Fiscal Years. 1991–1995. http://www.genome.gov/10001477 .
  • Collins FS, Galas D. A new five-year plan for the U.S. Human Genome Program. Science. 1993; 5 :43–46. doi: 10.1126/science.8211127. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Smith LM, Sanders JZ, Kaiser RJ, Hughes P, Dodd C, Connell CR, Heiner C, Kent SBH, Hood LE. Fluorescence detection in automated DNA sequence analysis. Nature. 1986; 5 :674–679. doi: 10.1038/321674a0. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Church G, Kieffer-Higgins S. Multiplex DNA sequencing. Science. 1988; 5 :185–188. doi: 10.1126/science.3353714. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Strezoska Z, Paunesku T, Radosavljević D, Labat I, Drmanac R, Crkvenjakov R. DNA sequencing by hybridization: 100 bases read by a non-gel-based method. Proc Natl Acad Sci USA. 1991; 5 :10089–10093. doi: 10.1073/pnas.88.22.10089. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Venter JC, Adams MD, Sutton GG, Kerlavage AR, Smith HO, Hunkapiller M. Shotgun sequencing of the human genome. Science. 1998; 5 :1540–1542. doi: 10.1126/science.280.5369.1540. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature. 2001; 5 :860–921. doi: 10.1038/35057062. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, Smith HO, Yandell M, Evans CA, Holt RA, Gocayne JD, Amanatides P, Ballew RM, Huson DH, Wortman JR, Zhang Q, Kodira CD, Zheng XH, Chen L, Skupski M, Subramanian G, Thomas PD, Zhang J, Miklos GLG, Nelson C, Broder S, Clark AG, Nadeau J, McKusick VA, Zinder N. et al. The sequence of the human genome. Science. 2001; 5 :1304–1351. doi: 10.1126/science.1058040. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • International Human Genome Sequencing Consortium. http://www.genome.gov/11006939 .
  • Shendure J, Aiden ER. The expanding scope of DNA sequencing. Nat Biotechnol. 2012; 5 :1084–1094. doi: 10.1038/nbt.2421. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Hood L. A personal journey of discovery: developing technology and changing biology. Annu Rev Anal Chem. 2008; 5 :1–43. doi: 10.1146/annurev.anchem.1.031207.113113. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Committee on a New Biology for the 21st Century. A New Biology for the 21st Century. Washington DC: The National Academies Press; 2009. [ PubMed ] [ Google Scholar ]
  • Ideker T, Galitski T, Hood L. A new approach to decoding life: systems biology. Annu Rev Genomics Hum Genet. 2001; 5 :343–372. doi: 10.1146/annurev.genom.2.1.343. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Encyclopedia of DNA Elements. http://encodeproject.org/ENCODE/
  • ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012; 5 :57–74. doi: 10.1038/nature11247. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Editorial. Form and function. Nature. 2013; 5 :141–142. [ Google Scholar ]
  • ENCODE Project Consortium. A user’s guide to the Encyclopedia of DNA Elements (ENCODE) PLoS Biol. 2011; 5 :e1001046. doi: 10.1371/journal.pbio.1001046. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Aebersold R, Mann M. Mass spectrometry-based proteomics. Nature. 2003; 5 :198–207. doi: 10.1038/nature01511. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Picotti P, Aebersold R. Selected reaction monitoring-based proteomics: workflows, potential, pitfalls and future directions. Nat Methods. 2012; 5 :555–566. doi: 10.1038/nmeth.2015. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Desiere F, Deutsch EW, King NL, Nesvizhskii AI, Mallick P, Eng J, Chen S, Eddes J, Loevenich SN, Aebersold R. The PeptideAtlas Project. Nucleic Acids Res. 2006; 5 :D655–D658. doi: 10.1093/nar/gkj040. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Deutsch ED, Mendoza L, Shteynberg D, Farrah T, Lam H, Tasman N, Sun Z, Nilsson E, Pratt B, Prazen B, Eng JK, Martin DB, Nesvizhskii A, Aebersold R. A guided tour of the Trans-Proteomic Pipeline. Proteomics. 2010; 5 :1150–1159. doi: 10.1002/pmic.200900375. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Genomes Online Database: complete genome projects. http://www.genomesonline.org/cgi-bin/GOLD/index.cgi?page_requested=Complete+Genome+Projects .
  • Theobald DL. A formal test of the theory of universal common ancestry. Nature. 2010; 5 :219–222. doi: 10.1038/nature09014. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Wolfe KE, Li W-H. Molecular evolution meets the genomics evolution. Nat Genet. 2003; 5 :255–265. [ PubMed ] [ Google Scholar ]
  • Marques-Bonet T, Ryder OA, Eichler EE. Sequencing primate genomes: what have we learned? Annu Rev Genomics Hum Genet. 2009; 5 :355–386. doi: 10.1146/annurev.genom.9.081307.164420. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Noonan JP. Neanderthal genomics and the evolution of modern human. Genome Res. 2010; 5 :547–553. doi: 10.1101/gr.076000.108. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Stoneking M, Krause J. Learning about human population history from ancient and modern genomes. Nat Rev Genet. 2011; 5 :603–614. [ PubMed ] [ Google Scholar ]
  • Sankararaman S, Patterson N, Li H, Paabo S, Reich D. The date of interbreeding between Neanderthals and Modern Humans. PLoS Genet. 2012; 5 :e1002947. doi: 10.1371/journal.pgen.1002947. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Schatz MC. Computational thinking in the era of big data biology. Genome Biol. 2012; 5 :177. doi: 10.1186/gb-2012-13-11-177. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Mizrachi I. In: The NCBI Handbook. McEntyre J, Ostell J, editor. Bethesda: National Center for Biotechnology Information; 2002. GenBank: the Nucleotide Sequence Database. [ Google Scholar ]
  • Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D. The human genome browser at UCSC. Genome Res. 2002; 5 :996–1006. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • SourceForge. http://sourceforge.net/
  • Bioconductor: open source software for bioinformatics. http://www.bioconductor.org/
  • Field D, Sansone S-A, Collina A, Booth T, Dukes P, Gregurick SK, Kennedy K, Kolar P, Kolker E, Maxon M, Millard S, Mugabushaka M, Perrin N, Remacle JE, Remington K, Rocca-Serra P, Taylor CF, Thorley M, Tiwari B, Wilbanks J. Omics data sharing. Science. 2009; 5 :234–236. doi: 10.1126/science.1180598. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Knoppers BM, Harris JR, Tasse AM, Budin-Ljosne I, Kaye J, Deschenes M, Zawati M. Towards a data-sharing Code of Conduct for international genomic research. Genome Med. 2011; 5 :46. doi: 10.1186/gm262. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Hood L. Biological complexity under attack: a personal view of systems biology and the coming of “big science” Genet Eng Biotechnol News. 2011; 5 :17. [ Google Scholar ]
  • Tripp S, Grueber M. Economic Impact of the Human Genome Project. Columbus: Battelle Memorial Institute; 2011. [ Google Scholar ]
  • International HapMap Consortium. A haplotype map of the human genome. Nature. 2005; 5 :1299–1320. doi: 10.1038/nature04226. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • The International HapMap3 Consortium. Integrating common and rare genetic variation in diverse human populations. Nature. 2010; 5 :52–58. doi: 10.1038/nature09298. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Abbott A. Neuroscience: solving the brain. Nature. 2013; 5 :272–274. doi: 10.1038/499272a. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • The 1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature. 2012; 5 :56–65. doi: 10.1038/nature11632. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • A Catalog of Published Genome-wide Association Studies. http://www.genome.gov/gwastudies/
  • Roach JC, Glusman G, Smit AF, Huff CD, Hubley R, Shannon PT, Rowen L, Pant KP, Goodman N, Bamshad M, Shendure J, Drmanac R, Jorde LB, Hood L, Galas DJ. Analysis of genetic inheritance in a family quartet by whole-genome sequencing. Science. 2010; 5 :636–639. doi: 10.1126/science.1186802. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Levy S, Sutton G, Ng PC, Feuk L, Halpern AL, Walenz BP, Axelrod N, Huang J, Kirkness EF, Denisov G, Lin Y, MacDonald JR, Pang AW, Shago M, Stockwell TB, Tsiamouri A, Bafna V, Bansal V, Kravitz SA, Busam DA, Beeson KY, McIntosh TC, Remington KA, Abril JF, Gill J, Borman J, Rogers YH, Frazier ME, Scherer SW, Strausberg RL. et al. The diploid genome sequence of an individual human. PLoS Biol. 2007; 5 :e254. doi: 10.1371/journal.pbio.0050254. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Wheeler DA, Srinivasian M, Egholm M, Shen Y, Chen L, McGuire A, He W, Chen Y-J, Makhijani V, Roth GT, Gomes X, Tartaro K, Niazi F, Turcotte CL, Irzyk GP, Lupski JR, Chinault C, Song X, Liu Y, Yuan Y, Nazareth L, Qin X, Muzny DM, Margulies M, Weinstock GM, Gibbs RA, Rothberg JM. The complete genome of an individual by massively parallel DNA sequencing. Nature. 2008; 5 :872–876. doi: 10.1038/nature06884. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • International Cancer Genome Consortium. http://icgc.org/
  • The Cancer Genome Atlas. http://cancergenome.nih.gov/
  • Pandey A. Preparing for the 21 st century patient. JAMA. 2013; 5 :1471–1472. doi: 10.1001/jama.2012.116971. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Hood L, Flores M. A personal view on systems medicine and the emergence of proactive P4 medicine: predictive, preventive, personalized and participatory. Nat Biotechnol. 2012; 5 :613–624. [ PubMed ] [ Google Scholar ]
  • Price ND, Edelman LB, Lee I, Yoo H, Hwang D, Carlson G, Galas DJ, Heath JR, Hood L. In: Genomic and Personalized Medicine: From Principles to Practice. Volume 1. Ginsburg G, Willard H, editor. Philadelphia: Elsevier; 2009. Systems biology and the emergence of systems medicine; pp. 131–141. [ Google Scholar ]
  • Green RC, Berg JS, Grody WW, Kalia SS, Korf BR, Martin CL, McGuire A, Nussbaum RL, O’Daniel JM, Ormond KE, Rehm HL, Watson MS, Williams MS, Biesecker LG. ACMG Recommendations for Reporting of Incidental Findings in Clinical Exome and Genome Sequencing. Bethesda: American College of Medical Genetics and Genomics; 2013. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Meyerson M, Gabriel S, Getz G. Advances in understanding cancer genomes through second-generation sequencing. Nat Rev Genet. 2010; 5 :685–696. doi: 10.1038/nrg2841. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Qin S, Zhou Y, Lok AS, Tsodikov A, Yan X, Gray L, Yuan M, Moritz RL, Galas D, Omenn GS, Hood L. SRM targeted proteomics in search for biomarkers of HCV-induced progression of fibrosis to cirrhosis in HALT-C patients. Proteomics. 2012; 5 :1244–1252. doi: 10.1002/pmic.201100601. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Li X-J, Hayward C, Fong P-Y, Dominguez M, Hunsucker SW, Lee LW, McClean M, Law S, Butler H, Schirm M, Gingras O, Lamontague J, Allard R, Chelsky D, Price ND, Lam S, Massion PP, Pass H, Rom WN, Vachani A, Fang KC, Hood L, Kearney P. A blood-based proteomic classifier for the molecular characterization of pulmonary nodules. Sci Transl Med. in press. [ PMC free article ] [ PubMed ]
  • Knoppers BM, Thorogood A, Chadwick R. The Human Genome Organisation: towards next-generation ethics. Genome Med. 2013; 5 :38. doi: 10.1186/gm442. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Hood L. Who we are: the book of life. Commencement Address. Whitman College Magazine. 2002. pp. 4–7.
  • Foster MW, Sharp RR. Beyond race: towards a whole-genome perspective on human populations and genetic variation. Nat Rev Genet. 2004; 5 :790–796. doi: 10.1038/nrg1452. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Royal CDM, Dunston GM. Changing the paradigm from ‘race’ to human genetic variation. Nat Genet. 2004; 5 :S5–S7. doi: 10.1038/ng1454. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Witherspoon DJ, Wooding S, Rogers AR, Marchani EE, Watkins WS, Batzer MA, Jorde LB. Genetic similarities within and between populations. Genetics. 2007; 5 :351–359. doi: 10.1534/genetics.106.067355. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Genovese G, Handsaker RE, Li H, Altemose N, Lindgren AM, Chambert K, Pasaniuk B, Price AL, Reich D, Morton CC, Pollak MR, Wilson JG, McCarroll SA. Using population admixture to help complete maps of the human genome. Nat Genet. 2013; 5 :406–414. doi: 10.1038/ng.2565. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Fernandez-Suarez XM, Galperin MY. The, Nucleic Acids Research Database Issue and the online Molecular Biology Database Collection. Nucleic Acids Res. 2013; 5 :D1–D7. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Human Proteome Project. http://www.hupo.org/research/hpp/
  • Hood LE, Omenn GS, Moritz RL, Aebersold R, Yamamoto KR, Amos M, Hunter-Cevera J, Locascio L. Workshop Participants. New and improved proteomics technologies for understanding complex biological systems: addressing a grand challenge in the life sciences. Proteomics. 2012; 5 :2773–2783. doi: 10.1002/pmic.201270086. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Editorial. The call of the human proteome. Nat Methods. 2010; 5 :661. [ PubMed ] [ Google Scholar ]
  • Schadt E, Turner S, Kasarskis A. A window into third-generation sequencing. Hum Mol Genet. 2010; 5 :R227–R240. doi: 10.1093/hmg/ddq416. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Kim JK, Samaranayake M, Pradhan S. Epigenetic mechanisms in mammals. Cell Mol Life Sci. 2009; 5 :596–612. doi: 10.1007/s00018-008-8432-4. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Hon G, Ren B, Wang W. ChromaSig: a probabilistic approach to finding common chromatin signatures in the human genome. PLoS Comput Biol. 2008; 5 :e1000201. doi: 10.1371/journal.pcbi.1000201. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Hayden EC. Nanopore genome sequencer makes its debut. Nature News. 2012. p.  . [ CrossRef ]
  • Introduction to Genomics
  • Educational Resources
  • Policy Issues in Genomics

The Human Genome Project

  • Funding Opportunities
  • Funded Programs & Projects
  • Division and Program Directors
  • Scientific Program Analysts
  • Contact by Research Area
  • News & Events
  • Research Areas
  • Research investigators
  • Research Projects
  • Clinical Research
  • Data Tools & Resources
  • Genomics & Medicine
  • Family Health History
  • For Patients & Families
  • For Health Professionals
  • Jobs at NHGRI
  • Training at NHGRI
  • Funding for Research Training
  • Professional Development Programs
  • NHGRI Culture
  • Social Media
  • Broadcast Media
  • Image Gallery
  • Press Resources
  • Organization
  • NHGRI Director
  • Mission & Vision
  • Policies & Guidance
  • Institute Advisors
  • Strategic Vision
  • Leadership Initiatives
  • Diversity, Equity, and Inclusion
  • Partner with NHGRI
  • Staff Search

The Human Genome Project (HGP) is one of the greatest scientific feats in history. The project was a voyage of biological discovery led by an international group of researchers looking to comprehensively study all of the DNA (known as a genome) of a select set of organisms. Launched in October 1990 and completed in April 2003, the Human Genome Project’s signature accomplishment – generating the first sequence of the human genome – provided fundamental information about the human blueprint, which has since accelerated the study of human biology and improved the practice of medicine.

Learn more about the Human Genome Project below.

Virtual Exhibit

A virtual exhibit exploring the 1990 letter writing campaign to oppose the HGP.

G5 Reunion

A virtual discussion with the leaders of the five genome-sequencing centers that provides the untold story on how they got the HGP across the finish line in 2003.

DNA sequencing by gel electrophoresis

A fact sheet detailing how the project began and how it shaped the future of research and technology.

Human Genome Project Timeline of Events | NHGRI

An interactive timeline listing key moments from the history of the project.

HGP Timeline

A downloadable poster containing major scientific landmarks before and throughout the project.

Francis Collins

Prominent scientists involved in the project reflect on the lessons learned.

HGP Banbury Meeting

Commentary in the journal Nature written by NHGRI leaders discussing the legacies of the project.

Science and Nature Covers

Lecture-oriented slides telling the story of the project by a front-line participant.

Human Genome Project

Related Content

Jay Shendure

Last updated: May 14, 2024

main logo

Economic Benefits

In 2011, Battelle and the Life Technologies Foundation issued a report titled, “ Economic Impact of the Human Genome Project .” This was the first major report of this type. It offered five conclusions on the impact of the HGP. An update was issued in 2013 with more recent economic figures.

Conclusions on the HGP’s Impact

  • The economic and functional impacts generated by the sequencing of the human genome are already large and widespread. Between 1988 and 2010 the human genome sequencing projects, associated research and industry activity—directly and indirectly—generated an economic (output) impact of $796 billion, personal income exceeding $244 billion, and 3.8 million job-years of employment. In the 2013 update, these numbers increased to economic (output) impact of $965 billion, personal income exceeding $293 billion, and 4.3 million job-years of employment.
  • The federal government invested $3.8 billion in the HGP through its completion in 2003 ($5.6 billion in 2010 $). This investment was foundational in generating the economic output of $796 billion above, and thus shows a return on investment (ROI) to the U.S. economy of 141 to 1—every $1 of federal HGP investment has contributed to the generation of $141 in the economy.
  • In 2010 alone, the genomics-enabled industry generated over $3.7 billion in federal taxes and $2.3 billion in U.S. state and local taxes. Thus in one year, revenues returned to government nearly equaled the entire 13-year investment in the HGP.
  • Overall, however, the impacts of the human genome sequencing are just beginning—large scale benefits in human medicine, agriculture, energy, and environment are still in their early stages. The best is truly yet to come.
  • The HGP is arguably the single most influential investment to have been made in modern science and a foundation for progress in the biological sciences moving forward.

Cover of The Impact of Genomics on the U.S. Economy 2013.

Download 2013 Update

Cover of Economic Impact of the Human Genome Project.

Download 2011 PDF

The study for the report was done using an ‘input–output’ economic model, which was questioned in a related  Nature  News item (May 11, 2011).

Graphic showing the effects associated with human genome sequencing.

The structure of forward and backward linkage impacts associated with human genome sequencing.

  • Open access
  • Published: 13 May 2024

A novel approach to exploring the dark genome and its application to mapping of the vertebrate virus fossil record

  • Daniel Blanco-Melo 1 , 2   na1 ,
  • Matthew A. Campbell 3   na1 ,
  • Henan Zhu 4 ,
  • Tristan P. W. Dennis 4 ,
  • Sejal Modha 4 ,
  • Spyros Lytras 4 ,
  • Joseph Hughes 4 ,
  • Anna Gatseva 4 &
  • Robert J. Gifford   ORCID: orcid.org/0000-0003-4028-9884 4 , 5  

Genome Biology volume  25 , Article number:  120 ( 2024 ) Cite this article

1699 Accesses

83 Altmetric

Metrics details

Genomic regions that remain poorly understood, often referred to as the dark genome, contain a variety of functionally relevant and biologically informative features. These include endogenous viral elements (EVEs)—virus-derived sequences that can dramatically impact host biology and serve as a virus fossil record. In this study, we introduce a database-integrated genome screening (DIGS) approach to investigate the dark genome in silico, focusing on EVEs found within vertebrate genomes.

Using DIGS on 874 vertebrate genomes, we uncover approximately 1.1 million EVE sequences, with over 99% originating from endogenous retroviruses or transposable elements that contain EVE DNA. We show that the remaining 6038 sequences represent over a thousand distinct horizontal gene transfer events across 10 virus families, including some that have not previously been reported as EVEs. We explore the genomic and phylogenetic characteristics of non-retroviral EVEs and determine their rates of acquisition during vertebrate evolution. Our study uncovers novel virus diversity, broadens knowledge of virus distribution among vertebrate hosts, and provides new insights into the ecology and evolution of vertebrate viruses.

Conclusions

We comprehensively catalog and analyze EVEs within 874 vertebrate genomes, shedding light on the distribution, diversity, and long-term evolution of viruses and reveal their extensive impact on vertebrate genome evolution. Our results demonstrate the power of linking a relational database management system to a similarity search-based screening pipeline for in silico exploration of the dark genome.

Introduction

The availability of whole genome sequence (WGS) data from a broad range of species provides unprecedented scope for comparative genomic investigations [ 1 , 2 , 3 ]. However, these investigations rely to a large extent on annotation —the process of identifying and labeling genome features—which usually lags far behind the generation of sequence data. Consequently, most whole genome sequences are comprised of DNA that is incompletely understood in terms of its evolutionary origins and functional significance. The portion of sequenced genome space that lacks annotations is sometimes referred to as the “dark genome” [ 4 ] and contains a wide variety of yet-to-be-characterized genome features. Some of these may have functional roles, such as encoding proteins [ 5 ] or regulating gene expression [ 6 ]. Others, such as non-expressed pseudogenes, may not but can nonetheless provide valuable insights into genome biology and evolution.

Within the dark genome, endogenous viral elements (EVEs) constitute a particularly intriguing group of genome features. EVEs are virus-derived DNA sequences that become integrated into the germline genome of host species and are stably inherited as host alleles—a form of horizontal gene transfer [ 7 , 8 , 9 , 10 , 11 , 12 , 13 , 14 ]. While once considered genetic “junk”, it has become evident over recent years that EVEs can profoundly impact host biology and genome evolution, with many now known to have physiologically relevant roles [ 15 , 16 , 17 , 18 , 19 ]. In addition, EVE sequences (whether functional or not) provide a rare source of retrospective information about ancient viruses, akin to a viral “fossil record” [ 7 , 20 , 21 , 22 ].

Identifying genome features contained within the dark genome, such as EVEs, often relies on the use of sequence similarity searches, such as those implemented in the Basic Local Alignment Search Tool (BLAST) [ 23 , 24 ], to search WGS databases. Because sequence similarity reflects homology (evolutionary relatedness), novel genome features can often be identified based on their resemblance to ones that have been described previously. One example of this approach is implemented in the PSI-BLAST [ 5 ] and HMMR [ 8 ] programs, in which iterated search strategies are used to progressively increase sensitivity so that novel homologs of previously characterized genes may be detected. A related approach is “systematic in silico genome screening” which extends the basic concept of a similarity search in two ways: (i) inclusion of multiple query sequences and/or target databases and (ii) similarity-based classification of matching sequences (“hits”) via comparison to a reference sequence library (Fig.  1 a). Hits may also be further investigated using additional comparative or experimental approaches (Fig.  1 b, Table  1 ). Thus, screening can provide one component of a broader analytical pipeline.

figure 1

Exploring the dark genome using in silico screening. a Overview of sequence similarity search-based screening. Screening aims to identify and classify sequences similar to a set of query sequences within a target database (TDb) comprising whole genome sequence assemblies of multiple species. The schematic shows the steps that comprise a single round of screening, as follows: (i) a BLAST search is performed using a probe sequence selected from a curated “reference sequence library” (RSL) and a “target” file is selected from the TDb; (ii) matching sequences (referred to as “hits”) identified in this screen are classified via similarity search-based comparison to the RSL; and (iii) a non-redundant set of classified hits is compiled, incorporating hits from previous rounds of screening. b Comparative analysis of screen output. Sequences recovered via screening can be investigated using a wide range of comparative approaches, as follows: (i) analysis of feature distribution—e.g., annotating host phylogeny to show frequency of occurrence (colored circles); (ii) phylogenetic screening, in which sequences obtained via similarity search-based screening are investigated in phylogenetic reconstructions (e.g., to identify novel lineages not present in the RSL, as shown here); (iii) pairwise sequence comparisons—these can be used to identify differences in sequences obtained via screening, relative to reference sequences; and (iv) comparative phylogenetic analysis—the genetic properties of novel homologs can be inferred via comparative analysis (e.g., pairwise comparisons), while their phenotypic properties can potentially be investigated experimentally (e.g., via transcriptome sequencing)

While straightforward in principle , in silico genome screening is computationally expensive and can be difficult to implement efficiently. Moreover, large-scale screens can produce copious output data that are difficult to manage and interpret without an appropriate analytical framework. To address these issues, we developed a database-oriented approach to in silico screening, called database-integrated genome screening (DIGS). To demonstrate the use of this approach, we first created an open software framework for performing it, then used this framework to search published vertebrate genomes for EVE loci. Besides demonstrating that DIGS provides a powerful, flexible approach for exploring the dark genome, our analysis provides a comprehensive and detailed overview of EVE diversity in vertebrate genomes and reveals new information about the long-term evolutionary relationships between viruses and vertebrate hosts.

A database-integrated approach to exploring the dark genome

We developed a robust, database-integrated approach to systematic in silico genome screening, referred to as database-integrated genome screening (DIGS). This approach integrates a similarity search-based screening pipeline with a relational database management system (RDBMS) to enable efficient exploration of the dark genome. The rationale for this integration is twofold: it not only provides a solid foundation for conducting large-scale, automated screens in an efficient and non-redundant manner but also allows for the structured querying of screening output using SQL, a powerful and well-established tool for database interrogation [ 41 ]. Additionally, an RDBMS offers advantages such as data recoverability, multi-user support, and networked data access.

The DIGS process comprises three key input data components:

Target database (TDb): A collection of whole genome sequence assemblies (or other large sequence datasets such as transcriptomes) that will serve as the target for sequence similarity searches.

Query sequences (Probes): A set of sequences to be used as input for similarity searches of the TDb.

Reference sequence library (RSL): The RSL represents the broad range of genetic diversity associated with the genome feature(s) under investigation. Its composition varies according to the analysis context (see Table  1 ). It should always include sequences representing diversity within the genome feature under investigation. It may also include genetic marker sequences and potentially cross-matching genome features. Probes are typically a subset of sequences contained in the RSL.

As illustrated in Fig.  2 , the DIGS process involves systematic searching of a user-defined TDb with user-defined probes, merging fragmented hits, and classifying merged sequences through BLAST-based comparison to the RSL. The output—a set of non-redundant, defragmented hits—is captured in a project-specific relational database. Importantly, this integration allows database queries to be employed in real time, with SQL queries referencing any information captured by the database schema. SQL-based querying of screening databases facilitates the identification of loci of interest, which can then be explored further using comparative approaches (see Fig.  1 b).

figure 2

The database-integrated genome screening (DIGS) process as implemented in the DIGS tool. (i) Screening. a On initiation of screening a list of searches, composed of each query sequence versus each target database (TDb) file is composed based on the probe and TDb paths supplied to the DIGS program. Subsequently, screening proceeds systematically as follows: b the status table of the project-associated screening database is queried to determine which searches have yet to be performed. if there are no outstanding searches then screening ends, otherwise it proceeds to step b wherein the next outstanding search of the TDb is performed using the selected probe and the appropriate BLAST+ program. Results are recorded in the data processing table (“active set”); c results in the processing table are compared to those (if any) obtained previously to derive a non-redundant set of non-overlapping loci, and an updated set of non-redundant hits is created, with each hit being represented by a single results table row. To create this non-redundant set, hits that overlap, or occur within a given range of one another, are merged to create a single entry. d Nucleotide sequences associated with results table rows are extracted from TDb files and stored in the results table; e extracted sequences are classified via BLAST-based comparison to the RSL using the appropriate BLAST program. f The header-encoded details of the best-matching sequence (species name, gene name) are recorded in the results table. g The status table is updated to create a record of the search having been performed, and the next round of screening is initiated. (ii) Reclassification: hits in the results table can be reclassified following an update to the reference sequence library

It is important to note that screening is usually an iterative discovery process, wherein initial results inform the development of subsequent screens. For instance, novel diversity detected by an initial screen can subsequently be incorporated into the RSL and hits within the screening database can be reclassified using the updated library (Fig.  2 ). Additionally, probe sets used in initial searches can be expanded to incorporate sequences identified during screening, broadening the range of sequences detected in subsequent screens [ 42 ]. However, care must be taken when using this approach, since it can potentially produce misleading results, or generate excessive hits (e.g., if highly repetitive sequences are contained within the new probes). Database integration allows screening results to be observed and interrogated in real time—as they are being generated. This means that configuration issues (e.g., badly composed RSL, inappropriate choice of probes) can be detected early on—potentially saving a significant amount of time and effort. Furthermore, it facilitates the implementation of agile, heuristic screening strategies, in which approaches are adjusted in line with results.

An open software framework for implementing DIGS

We constructed a software framework for implementing DIGS, called “the DIGS tool”. The DIGS tool is implemented using the PERL scripting language. It uses the BLAST + program suite [ 24 ] to perform similarity searches and the MySQL RDBMS (to capture their output). Accessible through a text-based console interface, it simplifies the complex process of large-scale genome screening and provides a versatile basis for implementing screens.

To initiate screening using the DIGS tool, researchers provide a project-specific command file (Additional file 1 : Fig. S1) that serves as the blueprint for the screening process. This command file specifies parameters for BLAST searches, the user-defined name of the screening database, and file paths to the TDb, RSL, and probe sequences. When a screen is initiated, a project-specific database is created. This core schema (Additional file 2 : Fig. S2) can subsequently be extended to include any relevant “side data”—e.g., taxonomic information related to the species and sequences included in the screen—increasing the power of SQL queries to reveal informative patterns (Additional file 3 : Fig. S3).

Systematic screening proceeds automatically until all searches have been completed. If the process is interrupted at any point, or if novel probe/target sequences are incorporated into the project, screening will proceed in a non-redundant way on restarting. Thus, screening projects can expand as required to incorporate new TDb files (e.g., recently published WGS assemblies) or novel probe/reference sequences. The DIGS tool console allows reclassification of sequences held in the results table (e.g., following an RSL update). To increase efficiency, this process can be tailored to specific subsets of database sequences by supplying SQL constraints via the DIGS tool console.

BLAST algorithms emphasize local similarity and consequently tend to fragment contiguous matches into several separate hits if similarity across some internal regions of the match is low. The DIGS tool allows screening pipelines to be configured with respect to how overlapping/adjacent hits are handled, so that the screening process can be tailored to the specific needs of diverse projects. The DIGS tool also provides a “consolidation” function that concatenates, rather than merges, adjacent hits and stores concatenated results, along with information about their structure, in a new screening database table.

For program validation, we mined mammalian genomes for sequences disclosing similarity to the antiviral restriction factor tetherin [ 43 , 44 ]. Tetherin provides a useful test case as it is a relatively distinctive gene and its evolution, distribution and diversity have previously been examined [ 43 , 44 ]. Results were compared with those provided by two alternative, widely used genome mining pipelines: OrthoDB [ 45 ] and Ensembl [ 46 ] and found to overlap by > 99% (Additional file 4 : Fig. S4).

The DIGS tool provides functionality for exporting FASTA-formatted sequences and managing screening database tables (e.g., add/drop tables, import table data). Further information regarding program installation and usage is provided online, in a repository associated website [ 47 ]. In the sections below, we illustrate the application of the DIGS tool to cataloging of EVEs in vertebrate genomes, focussing on both high and low copy number elements.

Use of DIGS to catalog RT-encoding endogenous retroviruses

Unusually among vertebrate viruses, retroviruses (family Retroviridae ) integrate their genome into the nuclear genome of infected cells as an obligate part of their life cycle. As a result, retroviruses gain more opportunities to become a permanent part of the host germline. Furthermore, the initial integrated form of a retrovirus genome, called a provirus, is typically replication competent. ERVs can therefore increase their germline copy number through reinfection of germ line cells or (after adaptation) by intracellular retrotransposition [ 48 , 49 ]. Accordingly, “endogenous retroviruses” (ERVs) are by far the most common type of EVE found in vertebrate genomes [ 7 , 50 ].

Retrovirus genomes contain a pol coding domain that encodes a reverse transcriptase (RT) gene. The RT gene can be used to reconstruct phylogenetic relationships across the entire Retroviridae and hence provides the lynchpin for unraveling the evolutionary history and origins of ERV loci [ 51 , 52 ]. We therefore implemented a screening procedure to detect RT-encoding ERV loci, based on an RSL comprised of previously classified exogenous retrovirus and ERV RT sequences (see “ Materials and methods ”). Screening involved more than 1.5 million discrete tBLASTn searches and resulted in the identification of 1,073,137 ERV RT hits. This set was filtered based on higher BLAST bitscore cutoff to obtain a high confidence set of 702,167 loci (Table  2 ).

High confidence ERV RT hits were identified in all vertebrate classes. However, the frequency among classes was found to vary dramatically (Fig.  3 ). ERVs occur most frequently in mammals (class Mammalia) and amphibians (class Amphibia), and at relatively similar, intermediate frequencies in the genomes of reptiles (class Squamata) and birds (class Aves). By contrast, RT-encoding ERVs are relatively rare in the genomes of fish, including ray-finned fish (class Actinopterygii) and jawless fish (class Agnatha). Cartilaginous fish (class Chondrichthyes) represent a possible exception, although only a few genomes were available for this group (Fig.  3 ). These findings are broadly consistent with previous studies, conducted using a smaller number of species genomes [ 50 , 53 , 54 , 55 ].

figure 3

Counts of ERV RT loci identified by identified via database integrated genome screening of 874 vertebrate species. Box plots show the distribution of endogenous retrovirus (ERV) reverse transcriptase (RT) counts in distinct vertebrate classes. Median and range of values are indicated. Circles indicate counts for individual species. Counts are shown against a log scale. Figure created in R using ggplot2 and geom_boxplot. RT hits identified as likely contaminants are not shown

ERVs have been taxonomically grouped into three clades (I, II, and III) based on their phylogenetic relatedness in the RT gene to the exogenous Gammaretrovirus , Betaretrovirus , and Spumavirus genera, respectively [ 1 , 2 ]. We incorporated into our RT screening database taxonomic information for (i) host species examined in our screen and (ii) RSL RT sequences. We then used an SQL query referencing these tables to summarize the frequency of clade I, II and III ERVs in distinct vertebrate classes (Additional file 3 : Fig. S3). Whereas clade I and III ERVs are found in all vertebrate groups, clade II ERVs appear to have a more restricted distribution, occurring only at low frequency in amphibians, and being completely absent from agnathans and cartilaginous fish (Table  2 ). A few clade II ERVs were identified in ray-finned fish, but these were very closely related to mammalian ERVs and likely represent contamination of WGS builds with mammalian genomic DNA. While RT-encoding ERV copy number is quite high in cartilaginous fish, RT diversity is relatively low, with the majority of ERV RT sequences belonging to clade III.

Use of DIGS to catalog non-retroviral EVEs vertebrate genomes

To identify non-retroviral EVEs, we first obtained an RSL representing all known viruses [ 56 ]. From this library, a set of representative probes was selected. Probes comprised representative proteomes of all known vertebrate viruses except retroviruses. Screening entailed > 1.5 million discrete tBLASTn searches, and initial results comprised 33,654 hits. However, many of these represented matches to host genes and TEs. We identified these spurious matches by interrogating screening results with a combination of SQL queries, BLAST-based comparisons to curated sequence databases, and ad hoc phylogenetic analysis.

We excluded hits that contained intact coding regions and lacked evidence of integration into host DNA, since these may be derived from contaminating exogenous viruses (Additional file 5 : Table S1). We also excluded other virus-derived DNA sequences that appeared likely to represent diet-related contamination of WGS data. For example, SQL-generated summaries of our initial screen results revealed several sequences disclosing similarity plant viruses, including geminiviruses (family Geminiviridae ) and potyviruses (family Potyviridae ) (Additional file 3 : Fig. S3). These sequences contained multiple stop codons and frameshifts, suggesting they might represent EVEs embedded within contaminating DNA, particularly since EVEs derived from both these virus groups are known occur in plant genomes [ 57 , 58 ]. Other unexpected matches to plant virus groups were contained within large contigs and thus could not be dismissed as contaminating DNA. For example, a sequence identified in the genome of the pig-nosed turtle ( Carettochelys insculpta ) disclosed similarity to caulimoviruses (family Caulimoviridae ). However, genomic analysis revealed this sequence in fact represents an unusual ERV (Additional file 6 : Fig. S5).

Next we removed matches to recognized transposons that are wholly or partly comprised of virus-derived DNA, such as polintons/mavericks [ 59 , 60 , 61 ] and teratorns [ 62 ] (Additional file 3 : Fig. S3). Once these EVE-like TEs had been removed, results comprised 6038 putative non-retroviral EVE sequences, representing 10 virus families (Table  3 , [ 63 ]). We did not identify any EVEs derived from vertebrate viruses with genomes comprised of double-stranded RNA (e.g., order Reovirales) or circular single-stranded RNA (e.g., genus Deltavirus ). However, all other virus genome “classes” were represented including reverse-transcribing DNA (DNArt) viruses, double-stranded DNA (DNAds) viruses, single-stranded DNA (DNAss) viruses, single-stranded negative sense RNA (RNAss-ve) viruses, and single-stranded positive sense RNA (RNAss + ve) viruses. Plotting the distribution of EVEs and exogenous viruses from distinct virus families and genera across vertebrate phyla revealed that many virus groups have had a broader distribution across vertebrate hosts than recognized on the basis of previously identified exogenous viruses (Fig.  4 ).

figure 4

Exogenous versus endogenous distribution of virus families that have been incorporated into the vertebrate germline. Circles indicate the known presence of exogenous viruses in vertebrate groups, determined through reference to the NCBI virus genomes resource [ 56 ], supplemented with information obtained from recently published papers [ 64 , 65 , 66 , 67 , 68 , 69 , 70 ]. Shaded boxes indicate the presence of endogenous viral elements, as determined in the present study. RT retroviruses, DNArt reverse transcribing DNA viruses, DNAss single-stranded DNA viruses, DNAds double-stranded DNA viruses, RNAds double-stranded RNA viruses, RNAss-ve single-stranded negative sense RNA viruses, RNAss + ve single-stranded positive sense RNA viruses

We examined all EVE loci identified in our study to determine their coding potential. We identified numerous EVE loci encoding open reading frames (ORFs) > 300 amino acids (aa) in length (Additional file 7 : Fig. S6). Among these, 4 encoded ORFs longer than 1000 aa. One of these—a 1718aa ORF encoded by an endogenous borna-like L-protein (EBLL) element in bats (EBLL-Cultervirus.29-EptFus) —has been reported previously [ 71 ]. However, we also identified an endogenous chuvirus-like L-protein (ECLL) element encoding an ~ 1400 aa ORF in livebearers (subfamily Poeciliinae). This element encodes long ORFs in two distinct livebearer species ( P. formosa and P. latapina ), indicating its coding capacity has been conserved for > 10 million years [ 72 ]. We also detected herpesvirus and alloherpesvirus EVEs encoding ORFs > 1000 aa, but as discussed below, the integration status of these sequences remains unclear.

Diversity of non-retroviral EVEs in vertebrate genomes

Eves derived from viruses with double-stranded dna genomes.

We detected DNA derived from herpesviruses (family Herpesviridae ) in mammalian and reptilian genomes (Fig.  4 , Table  3 , [ 63 ]). DNA sequences derived from betaherpesviruses (subfamily Betaherpesvirinae ) and gammaherpesviruses (subfamily Gammaherpesvirinae ) have previously been reported in WGS assemblies of the tarsier ( Carlito syrichta ) and aye-aye ( Daubentonia madagascensis ), respectively [ 73 ]. In addition to these sequences, we detected gammaherpesvirus DNA in WGS data of red squirrels ( Sciurus vulgaris ) and the Amazon river dolphin ( Inia geoffrensis ), while betaherpesvirus DNA was detected in the stoat ( Mustela ermina ) WGS assembly, and DNA derived from an alphaherpesvirus (subfamily Alphaherpesvirinae ) in the Oriximina lizard ( Tretioscincus oriximinensis ) WGS (Additional file 8 : Fig. S7). Germline integration of human betaherpesviruses has been demonstrated [ 74 , 75 ], and the presence of a betaherpesvirus-derived EVE in the tarsier genome EVE has been established [ 73 ]. However, herpesviruses can also establish latent infections and might be considered likely to occur as contaminants of DNA samples used to generate whole genome sequence assemblies. Due to the limitations of the WGS assemblies in which they were identified, it was not possible to confirm that the novel herpesvirus DNA sequences detected here represent EVEs rather than DNA derived from contaminating exogenous viruses.

DNA derived from alloherpesviruses (family Alloherpesviridae ) was detected in fish and amphibians. In ray-finned fish, most of these sequences belonged to the “teratorn” lineage of transposable elements, which have arisen via fusion of alloherpesvirus genomes and piggyBac transposons, and have been intragenomically amplified in the genomes of teleost fish (infraclass Teleostei) [ 62 ]. Additional alloherpesvirus-related elements were identified in three amphibian species and five ray-finned fish species [ 63 ]. One of these elements, identified in the Asiatic toad ( Bufo gargarizans ) occurred within a contig that was significantly larger than a herpesvirus genome, demonstrating that it represents an EVE rather than an exogenous virus. Phylogenetic analysis revealed that alloherpesvirus-like sequences identified in amphibian genomes clustered robustly with amphibian alloherpesviruses, while those identified in fish genomes clustered with fish alloherpesviruses (Additional file 8 : Fig. S7).

EVEs derived from viruses with single-stranded DNA genomes

EVEs derived from parvoviruses (family Parvoviridae ) and circoviruses (family Circoviridae ) are widespread in vertebrate genomes, being found in the majority of vertebrate classes (Fig.  4 ). Both endogenous circoviral elements (ECVs) and endogenous parvoviral elements (EPVs) are only absent in major vertebrate groups represented by a relatively small number of sequenced species genomes (i.e., between 1 and 6). No ECVs or EPVs were identified in the tuatara (order Rhynchocephalia) or in crocodiles (order Crocodilia). EPVs were not identified in agnathans, while ECVs were not identified in cartilaginous fish.

We identified a total of 1192 ECVs, most of which are derived from elements in carnivore (class Mammalia: order Carnivora) genomes that are embedded within non-LTR retrotransposons and have undergone intragenomic amplification (Additional file 9 : Fig. S8). While many of the ECVs identified in our screen have been reported in previous publications [ 7 , 32 , 36 , 42 , 76 ], we also identified novel loci in mammals, reptiles, amphibians, and ray-finned fish [ 63 ]. Phylogenetic analysis (see Additional file 8 : Fig. S7) revealed that a novel ECV locus in turtles groups with avian circoviruses, while amphibian ECV elements grouped with fish circoviruses, though bootstrap support for this relationship was lacking. A circovirus-like sequence detected in the WGS data of Allen’s wood mouse ( Hylomyscus alleni ) grouped robustly with exogenous rodent circoviruses, but integration of this sequence into the H. alleni genome could not be confirmed.

We identified 627 EPVs, representing two distinct subfamilies within the Parvoviridae and five distinct genera (see Fig.  4 ). The majority of these loci have been reported in a previous study [ 32 ] or are orthologs of these loci. However, we identified novel EPVs in reptiles, amphibians and mammals (Table  3 , [ 63 ]). In reptiles the novel elements derived from genus Dependoparvovirus while the amphibian elements were more closely related to viruses in genus Protoparvovirus . Notably, the novel amphibian EPVs clustered basally within a clade of protoparvovirus-related viruses in phylogenetic reconstructions (Additional file 8 : Fig. S7), consistent with previous analyses indicating that protoparvovirus ancestors may have broadly co-diverged with vertebrate phyla [ 32 ].

EVEs derived from reverse-transcribing DNA viruses

EVEs derived from hepadnaviruses (family Hepadnaviridae ), which are reverse-transcribing DNA viruses, were identified in reptiles, birds and amphibians (Table  3 , [ 63 ]). Most of these EVEs, commonly referred to as “endogenous hepatitis B viruses” (eHBVs), have been reported previously [ 35 , 77 ]. However, we identified novel elements in the plateau fence lizard ( Sceloporus tristichus ) and others in vertebrate classes where eHBVs have not been reported previously. These include one element identified in a cartilaginous fish, the Australian ghostshark ( Callorhinchus milii ), and another identified in an amphibian, the common coquí ( Eleutherodactylus coqui ).

Phylogenetic analysis (see Additional file 8 : Fig. S7) revealed that novel eHBV elements identified in lizards (suborder Lacertilia) group robustly with the exogenous skink hepadnavirus (SkHBV), while the amphibian element groups within a clade comprised of the exogenous spiny lizard hepadnavirus (SlHBV), Tibetan frog hepadnavirus (TfHBV) and eHBV elements identified in crocodile genomes. The eHBV identified in sharks was relatively short and not amenable to phylogenetic analysis but nonetheless provides the first evidence that hepadnaviruses infect this host group.

EVEs derived from viruses with single-stranded, negative sense RNA genomes

Screening revealed that vertebrate genomes contain numerous EVEs derived from mononegaviruses (order Mononegavirales ), which are characterized by non-segmented ssRNA-ve genomes. These EVEs derive from four mononegavirus families: bornaviruses (family Bornaviridae ), filoviruses (family Filoviridae ), paramxyoviruses (family Paramyxoviridae ) and chuviruses (family Chuviridae ) (Fig.  4 , Table  3 , [ 63 ]). We did not detect any EVEs derived from other mononegavirus families that infect vertebrates ( Pneumoviridae , Rhabdoviridae , Nyamiviridae , Sunviridae ), nor any EVEs derived from virus families with segmented, negative sense RNA genomes (e.g., Peribunyaviridae , Orthomyxoviridae ).

The majority of mononegavirus EVEs identified in our screen were derived from bornaviruses and filoviruses and have been described in previous reports [ 7 , 32 , 35 , 36 , 78 ]. However, we also identified novel EVEs derived from these groups, as well as previously unreported EVEs derived from paramyxoviruses and chuviruses (Table  3 ).

Germline integration of DNA derived from mononegaviruses can occur if, in an infected germline cell, viral mRNA sequences are reverse transcribed and integrated into the nuclear genome by cellular retroelements [ 79 ]. EVE loci generated in this way preserve the sequences of individual genes of ancient mononegaviruses, but not entire viral genomes. Among mononegavirus-derived EVEs, regardless of which family, elements derived from the nucleoprotein (NP) and large polymerase (L) genes predominate. However, other genes are also represented, including the glycoprotein (GP) genes of filoviruses, bornaviruses, and chuviruses, the VP30 and VP35 genes of filoviruses, and the hemagglutinin-neuraminidase (HA-NM) gene of paramyxoviruses.

Paramyxovirus-like EVEs were identified in ray-finned fish, amphibians, and sharks (Fig.  4 , Table  3 , [ 63 ]). Many of these EVEs were highly divergent and/or degenerated and consequently their evolutionary relationships to contemporary paramyxoviruses were poorly resolved in phylogenetic analysis. However, an L polymerase-derived sequence identified in the pobblebonk frog ( Limnodynastes dumerilii ) genome was found to group robustly with Sunshine Coast virus, a contemporary paramyxovirus of Australian pythons [ 80 ] in phylogenetic trees (Additional file 8 : Fig. S7).

Chuvirus-like sequences were identified in agnathans, ray-finned fish, reptiles, amphibians, and mammals (Fig.  4 , Table  3 , [ 63 ]). The majority of the mammalian elements were identified in marsupials, but we also identified a single chuvirus-like EVE in the genome of a laurasiatherian mammal—the bottlenose dolphin ( Tursiops truncatus ). Phylogenetic trees reconstructed using alignments of NP-derived chuvirus EVEs and NP genes of contemporary chuviruses revealed evidence for the existence of distinct clades specific to particular vertebrate classes (Additional file 8 : Fig. S7). These included a clade including both a snake EVE and an exogenous chuviruses of snakes, and two clades comprised of EVEs and viruses of teleost fish. In addition, these phylogenies revealed a robustly supported relationship between chuvirus EVEs in the Tibetan frog ( Nanorana parkeri ) and zebrafish ( Danio rerio ) genomes. Taken together, these results provide evidence for the existence of numerous diverse lineages of chuviruses in vertebrates, adding to recent evidence for the presence of exogenous chuviruses in marsupials [ 64 ].

Filovirus-derived EVEs were mainly identified in mammals (Fig.  4 , Table  3 , [ 63 ]). However, we also identified one filovirus-derived EVE in an amphibian—the mimic poison frog ( Ranitomeya imitator ) —providing the first evidence that filoviruses infect this vertebrate group (Table  1 ). Among mammals, we identified novel, ancient filovirus EVEs in anteaters (family Myrmecophagidae) and spiny mice (genus Acomys ).

Strikingly, the inclusion of Tapajos virus (TAPV), a snake filovirus, in phylogenetic reconstructions revealed evidence for the existence of two highly distinct filovirus lineages in mammals (Fig.  5 ). These two lineages, which are robustly separated from one another by TAPV, are evident in phylogenies constructed for both the NP and VP35 genes. One lineage (here labeled “Mammal-1”) is comprised of EVEs and all contemporary mammalian filoviruses, whereas the other (“Mammal-2”) is comprised exclusively of EVEs. Notably, within the Mammal-1 group, EVEs identified in host species groups that are indigenous to Southern Hemisphere continents (e.g., marsupials, xenarthrans) cluster basally, whereas EVEs and viruses isolated from “Old World”-associated placental mammals occupy a more derived position.

figure 5

Evolutionary relationships of filoviruses and filovirus-derived EVEs. Bootstrapped maximum likelihood phylogenies showing the evolutionary relationships between filoviruses and filovirus EVEs in the nucleoprotein (NP) and viral protein 35 (VP35) genes. Phylogenies were constructed using maximum likelihood as implemented in RAxML, and codon-aligned nucleotides for each gene. Numbers adjacent internal nodes indicate bootstrap support (1000 bootstrap replicates). The scale bar indicates evolutionary distance in substitutions per site. Virus taxon names are shown in regular font, EVE names are shown bold. EVE names follow standardized nomenclature (see “ Materials and methods ”). Brackets to the right of each tree indicate virus genera (italics) and major lineages (bold). Silhouettes indicate host groups following the key. For Ebola virus, Bundibugyo virus, and Tai Forest virus, the main reservoir hosts are unknown. The inset box adjacent these taxa show host species in which one or more of these viruses has been isolated [ 81 , 82 ], following the key. *Experimentally investigated locus [ 83 , 84 ]

The “Mammal-2” clade contains filovirus EVEs from rodents, primates, and bats. Because EVEs belonging to this clade were obtained from several distinct lineages, and show conservation across these groups, we can be reasonably confident they represent a bona fide lineage within the Filoviridae, rather than just a set of highly degraded filo-like EVEs that group together due to long branch attraction [ 85 ]. One member of this group (eflp-filo.1-Myotis) encodes an intact VP35 protein, the properties of which have been experimentally investigated in recent studies [ 83 , 84 ]. Interestingly, we found that spiny mice also harbor a filovirus EVE encoding an intact VP35 protein (eflp-filo.3-Acomys); however, this insertion belongs to the “Mammal 1” clade and is relatively closely related to the VP35 proteins found in contemporary mammalian filoviruses (Fig.  5 b).

Bornavirus-like EVEs were identified in all vertebrate classes except Chondrichthyes (Fig.  4 , Table  3 , [ 63 ]). The majority have been reported previously or are orthologs of previously reported EVEs. However, we identified novel bornavirus-like EVEs in the genomes of ray-finned fish and amphibians. The amphibian EVEs grouped robustly with culterviruses in phylogenetic reconstructions (Additional file 8 : Fig. S7).

EVEs derived from viruses with single-stranded, positive sense RNA genomes

EVEs derived from positive sense RNA viruses are rare in vertebrate genomes (Fig.  4 , Table  3 , [ 63 ]). The only examples we identified were a small number of sequences derived from flavivirids (family Flaviviridae ). These include an EVE derived from the Pestivirus genus of flavivirids, the reference genome of the Indochinese shrew ( Crocidura indochinensis ), as reported previously [ 86 ], and EVEs identified in ray-finned fish, also reported previously [ 31 ]. In fish genomes, flavivirid EVEs derive from the proposed “Tamanavirus” genus, and a lineage labeled “X2” that groups as a sister taxon to the proposed “Jingmenvirus” genus. However, jingmenviruses are actually segmented, RNAss-ve viruses whose genomes include flavivirid-derived segments [ 87 ]. Since it is possible that the X2 lineage shares a common RNAss-ve ancestor with jingmenviruses, EVEs belonging to this lineage may in fact be derived from viruses with ssRNA-ve genomes.

Frequency of germline incorporation events across distinct vertebrate phyla

We used the DIGS framework to dissect the history of horizontal gene transfer events involving germline incorporation of DNA derived from non-retroviral viruses. We excluded EVEs derived from Polinton-like viruses (Adintoviruses) and teratorn elements, both of which exhibit relatively high copy numbers due to intragenomic amplification [ 60 , 61 , 62 , 88 ]. For these groups, the large number of insertions, and the fact that amplified lineages appear to have been independently established on multiple occasions, meant that such an analysis would be beyond the scope of this study.

To examine the rate of germline incorporation in the remaining groups of non-retroviral EVEs, we compiled an expanded RSL containing a single reference sequence for each putative (or previously confirmed) ortholog. By classifying our hits against this expanded RSL, we could discriminate novel EVE loci (paralogs) from orthologs of previously described EVE loci. Where novel paralogs were identified, we incorporated these into our RSL and then reclassified related sequences in our screening database against this updated library. By investigating loci in this way, and iteratively reclassifying database sequences, we progressively resolved the various non-retroviral EVEs identified in our screen into sets of putatively orthologous insertions. Via this analysis, we estimated that the non-retroviral EVEs identified in our study (excluding those derived from DNAds viruses) represent ~1137 distinct germline incorporation events (Table  3 ). Using orthology information, we calculated minimum age estimates for all non-retroviral EVEs identified in two or more species [ 63 ]. We applied standardized nomenclature to EVE loci (see “ Materials and methods ”), capturing information about EVE orthology, taxonomy and host distribution [ 63 ].

Next, we estimated the rate of germline incorporation for each endogenized virus family, in all vertebrate classes represented by at least ten species (Fig.  6 ). Rates were found to vary dramatically across each of the vertebrate groups examined. Overall, rates were highest in mammals and lowest in reptiles. Fish and amphibians disclosed similar rates with DNAss and ssRNA-ve viruses being incorporated at similar, intermediate rates. Birds were generally similar to reptiles but show a higher rate of DNAss virus incorporations and a markedly elevated rate of hepadnavirus incorporation. Rates of parvovirus, filovirus, and bornavirus infiltration were very high in mammals compared to other vertebrate classes, with bornaviruses being incorporated at a particularly high rate (> 0.03 per million years of species evolution). A relatively high rate of incorporation of RNAss + ve viruses was observed in ray-finned fish, but since the elements in question are closely related to jingmenviruses, as described above, they may in fact reflect incorporation of DNA derived from an RNAss-ve virus group [ 87 ].

figure 6

Comparison of germline infiltration rates in five vertebrate classes. Infiltration rates represent the rate of incorporation and fixation per million years (MY) of species branch length sampled. Rates are shown for each non-retroviral family represented by vertebrate EVEs. Colors indicate reverse transcribing DNA (DNArt) viruses, single-stranded DNA (DNAss) viruses, single-stranded negative sense RNA (RNAss-ve) viruses, and single-stranded positive sense RNA (RNAss-ve) viruses, following the key

In addition to estimating the frequency of germline incorporation of non-retroviral viruses, we used our screening data to reconstruct a time-calibrated overview of virus integration throughout vertebrate evolutionary history (Fig.  7 , Additional file 10 : Table S2, Additional file 11 : Fig S9). Among putatively orthologous groups of EVEs for which we were able to estimate minimum dates of integration, the majority were found to have been incorporated in the Cenozoic Era (1-66 Mya). So far, the oldest integration event identified involves a metahepadnavirus (genus Metahepadnavirus )-derived EVE that appears to be orthologous in tuataras and birds, indicating it was incorporated into the saurian germline > 280–300 Mya (see [ 35 ]). Other ancient EVEs include circovirus and herpetohepadnavirus (genus Herpetohepadnavirus )-derived EVEs in turtles (order Testudines) (see [ 77 ]), a circovirus-derived EVE in frogs (order Anura), and bornavirus integrations in placental mammals (see [ 78 ]). Besides revealing the landscape of non-retroviral EVE integration throughout vertebrate history, plotting EVE distribution in this way clearly reveals the main differences in EVE distribution across host groups (Fig.  7 ).

figure 7

Overview of germline incorporation in vertebrates. A time-calibrated phylogeny of vertebrate species examined in this study, obtained via TimeTree [ 89 ]. Minimum ages of endogenization events are indicated by diamonds on internal nodes for EVE loci present as orthologs in multiple species. The presence of EVE sequences in each species genome is indicated by circles at phylogeny tips. Circles and diamonds nodes are scaled by the number of sequences detected and color-coded by virus family as indicated in legend. For circles, scaling indicates the total number of EVE sequences detected within each species genome, including both unique and shared endogenization events

Sequencing of genomes is advancing rapidly but deciphering the complex layers of information they contain is a challenging, long-term endeavor [ 78 , 79 ]. Genomes are not only inherently complex but they also exhibit remarkable dynamism, with phenomena such as recombination, transposition, and horizontal gene transfer contributing to the creation of genomic “churn” that makes feature distribution difficult to map [ 80 ]. These issues, combined with rapid data accumulation, coverage limitations, and assembly errors—make generation of complete and accurate annotations difficult [ 83 , 85 ]. Consequently, labor-intensive manual genome annotation remains important [ 64 , 78 ], and most published whole genome sequences are comprised of genomic “dark matter”.

An exciting aspect of these circumstances is that they provide immense scope to make interesting biological discoveries using low cost, approaches. While experimental studies are generally required to characterize genome features at a functional level, approaches based solely on comparative sequence analysis (see Fig.  1 b) can often reveal useful insights into their biology and evolution [ 1 , 90 ]. Furthermore, comparative investigations in silico can often be productively combined with functional genomics or experimental approaches (Fig.  1 b, Table  1 ).

Systematic in silico genome screening is computational approach that facilitates investigation of the dark genome (Fig.  1 ). However, it can be challenging to implement efficiently. Automated pipelines are generally required to implement large-scale screens [ 91 ], and these can produce copious output data that are difficult to manage and interpret without an appropriate analytical framework. Here, we introduce DIGS—a robust analytical platform for conducting large-scale in silico screens—and describe an open software framework (the DIGS tool) for implementing it.

EVEs constitute one interesting and informative group of genome features that can be found within the dark genome [ 22 ]. They are poorly annotated for several reasons. Firstly, they arise sporadically via horizontal gene transfer, and consequently their distribution is unpredictable [ 7 , 22 ]. Additionally, some uncharacterized EVE loci may be hard to recognize due to their being highly degraded or fragmented or because their exogenous virus counterparts are either unknown or extinct [ 92 , 93 ]. Finally, there are numerous potential sources of confounding or artefactual results that can arise during EVE screening, including host genes that exhibit similarity to virus genes, and contamination of WGS assemblies with DNA derived from other sources, including exogenous viruses.

To illustrate how DIGS facilitates identification and characterization of features hidden within the dark genome, we used the DIGS tool to perform a broad-based investigation of EVE diversity in vertebrates. We first focussed on high-copy number EVEs—which in vertebrate genomes mainly comprise ERVs. We screened 874 vertebrate genomes for RT-encoding ERVs and identified 702,167 high confidence matches. This screen revealed marked differences in ERV RT copy number between vertebrate classes. An in-depth investigation of ERV diversity in vertebrates—for example, examining their composition in finer detail or incorporating insertions that lack RT sequences, was considered beyond the scope of this study. However, the RT dataset generated here provides a robust foundation for further ERV studies that are underpinned by phylogenetic analysis. For example, we have previously used RT data in combination with other in silico approaches for in-depth, phylogenetical characterization of ERVs within discrete mammalian subgroups (e.g., see [ 38 ]).

ERVs constitute an unusual type of EVE, in that they can remain replication-competent following integration and may increase their germline copy number through continued replication as viruses or TEs [ 94 ]. However, the germline copy number of any EVE can potentially increase through interactions with TEs—this has been described for ERVs [ 48 , 95 , 96 ], as well as for EVEs derived from DNAds viruses [ 59 , 61 , 62 ]. In addition, data obtained here and in our previous investigations show that EVEs derived from hepadnaviruses have been amplified in cormorants [ 35 ], while circovirus-derived sequences have been amplified in carnivore genomes [ 36 ], apparently in association with LINE1 activity [ 63 ]. Fusion between EVEs and vertebrate transposons has notably influenced vertebrate genome evolution—it has occurred on multiple independent occasions and involves a diverse range of vertebrate viruses. Interestingly, our investigations of LINE1-associated circovirus EVEs in carnivore genomes suggested that LINE1-like retroelements have also been incorporated into gammaherpesvirus genomes and possibly even into Chikungunya virus (Additional file 10 : Fig. S8). These findings suggest that retroelement-mediated transposition can establish a complex network of horizontal gene transfer events linking virus and transposon genomes with those of their vertebrate hosts.

DIGS is well-suited to exploring the distribution and diversity of high copy number genome features such as ERVs and TEs but can also be used in “beach combing” searches of WGS data sets that aim to identify rare and unusual genome features. These kinds of screens typically require a rigorous filtering process to distinguish genuine from spurious matches, and as shown here, this is facilitated by database integration. DIGS enabled the efficient identification of EVEs derived from non-retroviral viruses (which are relatively rare and diverse) and provided a powerful framework for filtering spurious results (Additional file 3 : Fig. S3).

Via DIGS, we established a broad overview of non-retroviral EVE diversity in vertebrate genomes (Table 1, Figs.  4 and 6 ), shedding new light on virus distribution and diversity in vertebrates. Notably, our findings extend the known host range of important virus families. For example, we identify a filovirus-derived EVE in a frog (order Anura), providing the first evidence for the existence of amphibian filoviruses. In addition, we provide the first evidence for the presence (at least historically) of hepadnaviruses in sharks and chuviruses in placental mammals (Fig.  4 ). In addition, we reveal novel virus diversity. For example, we identify novel lineages of parvoviruses and circoviruses in amphibians, as well as a novel circovirus lineage in turtles and a novel hepadnavirus lineage in frogs. We also identify novel paramyxovirus, chuvirus and bornavirus lineages in fish and amphibians.

Mammalian filoviruses include some of the most lethal viruses in the world [ 97 ], and while the natural reservoirs of some are known, they remain unclear for the highly pathogenic ebolavirus (EBOV) and its closest relatives (Fig.  5 ). EBOV is assumed to have a zoonotic origin, but it has rarely been possible to formally link outbreaks to a given animal reservoir, limiting understanding of its emergence. So far, efforts to identify the true reservoirs of ebolaviruses have tended to focus on bats [ 81 ]. However, the widespread presence of filovirus EVEs in rodents [ 63 ], including some groups that have not been examined as potential EBOV reservoirs, such as spiny mice, suggests that the potential of this group to serve as a reservoir should not be overlooked.

Previous studies have noted that filovirus EVE sequences in the genomes of cricetid rodents (family Cricetidae) robustly split the Ebolavirus and Cuevavirus genera from the Marburgvirus and Dianlovirus genera, demonstrating that these groups diverged > 20 million years ago (Mya) [ 98 ], rather than within the past 10,000 years as suggested by molecular clock-based analysis of contemporary filovirus genomes [ 99 ]. Here, we found that TAPV, an exogenous virus of snakes, robustly separates two clades of mammalian filoviruses in phylogenetic reconstructions. Since transmission of filoviruses between reptiles and mammals is likely quite rare, and both lineages contain ancient EVEs (Fig.  5 , Additional file 9 : Table S2), these findings support the long-term existence of two highly distinct filovirus lineages in mammals, which we labeled “mammal 1” and “mammal 2”. Notably, basal taxa within the “mammal 1” lineage—which also includes all known contemporary filoviruses of mammals—disclose associations with Southern Hemisphere continents (Australia, South America) that were largely isolated throughout extensive periods of the Cenozoic Era. These data suggest that filoviruses were present in ancestral mammals inhabiting Gondwanaland (an ancient supercontinent comprised of South America, Africa, India, and Australia) and diversified into at least two major lineages as mammalian populations became compartmentalized in distinct continental regions during the early to mid-Cenozoic. An interesting question is whether the “mammal 2” group represents filoviruses that evolved in Northern hemisphere-associated, boreoeutherian mammals (magnorder Boreoeutheria), while “mammal 1” represents filoviruses that initially evolved in Southern hemisphere-associated marsupials (infraclass Marsupialia) and xenarthrans (magnorder Xenarthra) before disseminating throughout the globe (possibly in association with volant mammals—i.e., bats).

While several previous studies have described EVE diversity in vertebrates [ 50 , 53 , 100 ], our investigation is significantly larger in scale and breadth. Furthermore, for non-retroviral viruses, we introduced a higher level of order to EVE data, making use of the DIGS framework to discriminate orthologous versus paralogous EVE loci and to identify intra-genomically amplified EVE lineages. This allowed us to establish a panoramic view of germline incorporation by non-retroviral viruses during vertebrate evolution (Fig.  7 ). Furthermore, discriminating orthologous and paralogous EVEs enabled us to infer the rates of germline infiltration by non-retroviral virus families with greater accuracy than in previous studies (Figs.  6 and 7 ). Notably, we did not find strong evidence for a reduced rate of germline infiltration in avian genomes, as suggested by a previous study [ 101 ]. Incorporation of DNArt viruses is higher in birds than in any other vertebrate class (Fig.  6 ), and while acquisition of EVEs derived from ssRNA-ve viruses does appear to be limited in this group, they closely resemble reptiles in this respect. Avian hosts also appear similar overall to reptiles with regard to ERV RT copy number (Fig.  3 ).

The absence, or near absence, of many virus groups from our catalog of vertebrate EVEs is noteworthy. For example, many distinct families of ssRNA + ve viruses infect vertebrates [ 65 ], but of these, only flaviviruses appear to have generated any EVEs (Fig.  4 ), and these only occur quite rarely compared to other virus groups (Table  3 ). Furthermore, EVEs derived from viruses with circular RNA genomes, or double-stranded RNA genomes, were not detected at all. EVEs derived from all other virus genome types do occur in the vertebrate germline, but their distribution is patchy and limited to a relatively small number of virus families (Figs.  4 and 7 ). For example, among ssRNA-ve viruses, only mononegaviruses were detected, with no evidence for germline integration of segmented ssRNA-ve viruses such as orthomyxoviruses and bunyaviruses. The limited presence of EVEs originating from specific vertebrate virus groups within vertebrate genomes implies that certain aspects of these groups’ biology in vertebrate hosts restrict their ability to be integrated into the germline. These aspects likely include cell tropism (whether germline cells are typically infected) and the site of cellular replication (with viruses that replicate in the nucleus being more likely to be incorporated) [ 21 ]. Additionally, vertebrate germline cells may present strong intrinsic barriers to the replication of certain virus groups.

The most ancient EVE identified in our study predates the divergence of birds and reptiles, nearly 300 Mya. More ancient EVEs will likely be difficult to identify due to sequence degradation. However, it is conceivable that progress in genome sequencing, EVE screening and virus discovery will enable the implementation of more sensitive screens that yield even older EVEs, potentially predating the emergence of vertebrates.

Besides identifying EVEs, our screen identified several sequences that appeared likely to derive from exogenous viruses (Additional file 5 : Table S1). These overwhelmingly represented DNA virus families that contain at least some species that are capable of establishing chronic, latent infections and/or integrating into host cell chromosomes [ 102 , 103 , 104 ]. Potentially, the occurrence of contaminating DNA derived from specific exogenous virus groups in WGS data might serve as an indication of their tendency to establish chronic or latent infections. Our screen also uncovered virus-like sequences that seemed likely to derive from diet-related contamination of WGS data, either by viruses or EVEs (see Additional file 3 : Fig. S3). It is worth noting that, in our data, these sequences stood out as potential contaminants because they derived from virus groups that infect plants, not animals (e.g., Geminiviridae , Potyviridae ). However, similar contaminants might be more difficult to identify if they derived from animal viruses or EVEs, as may be expected to occur in diet-related contamination of WGS assemblies of carnivorous or insectivorous animal species.

The catalog of EVE loci generated here provides a foundation for further investigations in virology, genomics, and human health. From the virology perspective, EVEs provide information about the long-term evolutionary history viruses, which greatly influences how we understand their biology. As well as enabling future studies of vertebrate “paleoviruses”, the EVE catalog can inform efforts to identify and characterize new viruses (both by providing ecological and evolutionary insights [ 76 ] and by helping identify “false positive” hits arising from genomic DNA) [ 105 ].

From the genomics side, EVEs are of interest due to their important roles in physiology and genome evolution [ 106 ]. These include roles in antiviral immunity [ 11 , 107 , 108 ] as well as a diverse range of other physiological processes [ 18 , 83 , 84 , 109 , 110 , 111 , 112 ]. Notably, we identified numerous non-retroviral EVEs encoding ORFs longer than 300 aa (Additional file 7 : Fig. S6), indicating that their coding capacity has been conserved during vertebrate evolution. One of these—a chuvirus-derived L-protein identified in livebearers—adds to previous evidence that viral RdRp sequences have been co-opted by vertebrate genomes [ 71 ]. Mapping of EVE loci can also inform efforts to develop new medical treatments—in a recent study, EVE loci identified using DIGS were used to identify potential genomic safe harbors for human transgene therapy applications [ 33 ].

The EVE screen performed here has several important limitations. Firstly, it relied on published WGS data generated for extant species. Secondly, our results have likely been influenced by aspects of our screening configuration, such as the composition of the probe set with respect to viral taxa and polypeptide probe length [ 113 , 114 ]. This might mean that we failed to detect some of the potentially recognizable EVE loci present in our TDb. For example, counts of RT-encoding ERV loci were found to be generally lower in ray-finned fish and jawless fish (Fig.  3 ), but previous studies have shown that RT loci related to other families of reverse-transcribing virus, such as metaviruses (family Metaviridae ) [ 115 ] and “lokiretroviruses” [ 116 ] are relatively common in these hosts. These would likely have been missed in our search because they were not included in our RT RSL. Finally, previous studies have indicated that vertebrate genomes contain EVEs that lack any clear homology to extant viruses [ 117 ], and these would not be detected using a sequence similarity-based approach.

As vertebrate genome sequencing progresses, further opportunities to identify novel EVEs will arise, since: (i) any novel genome could in theory contain a lineage-specific EVE and (ii) ongoing characterization of exogenous virus diversity may allow for detection of previously undetectable EVEs, by providing new probe sequences. The DIGS project created here, which is openly available online, can be reused to accommodate newly sequenced vertebrate genomes (TDb expansion) and newly discovered vertebrate virus diversity (RSL/probe set expansion). In addition, similar projects can readily be created to screen for EVEs in other host groups.

The use of DIGS is not limited to investigations of EVEs. DIGS can be used to investigate any sufficiently conserved genome feature lurking within the dark genome, including both coding and non-coding elements (Table  1 ). Many of the most interesting genes have evolved relatively rapidly and are difficult to annotate reliably using automated approaches [ 118 ]. Furthermore, even relatively conserved genes may be incompletely annotated by automated pipelines. DIGS has previously been used to broadly survey the distribution of interferon stimulated genes in mammals [ 30 ] and for in-depth investigation of specific genes and gene families, such as OAS1 [ 27 ] and APOBEC3 [ 28 ]. While DIGS is best suited to investigations of genome features that comprise a single contiguous unit and contain relatively long, easily recognized regions, it can also be used to investigate genome features that are shorter or are comprised of several short sub-components, providing that a careful approach is used. For example, when investigating interferon lambda (IFNL) genes, which are expressed from multiple, short exons, we included conserved flanking features in our RSL and probe set [ 30 ] (Table  1 ). This enabled more confident matching of IFNL exons based on their positional relationships relative to conserved markers. We have also used DIGS in functional genomics studies to investigate the locations of short nucleotide motifs identified in binding assays (e.g., CHiP-seq) relative to other genomic features such as ERVs [ 25 , 26 ].

The framework described here for implementing DIGS could be further developed and improved, for example, by including the option to use other sequence similarity search tools, such as Diamond [ 119 ] and ElasticBLAST [ 120 ], or RNA structure based search tools such as INFERNAL [ 121 ]. Integrating with functional genomics resources could provide further dimensionality to the kinds of investigations that may be performed using DIGS [ 122 ].

We demonstrate how a relational database management system can be linked to a similarity search-based screening pipeline to investigate the dark genome in silico. Using this approach, we catalog and analyze EVEs throughout vertebrate genomes, providing a broad range of novel insights into the evolution of ancient viruses and their interactions with host species.

Materials and methods

Whole genome sequence and taxonomic data.

Whole genome shotgun (WGS) sequence assemblies of 874 vertebrate species were obtained from the NCBI genomes resource [ 123 ]. Taxonomic data for the vertebrate species included in our screen and the viruses in our reference sequence library were obtained from the NCBI taxonomy database [ 124 ], using PERL scripts included with the DIGS tool package.

Database-integrated screening for RT-encoding ERVs

An RT RSL was collated to represent diversity within the Retroviridae. We included representatives of previously identified ERV lineages and exogenous retrovirus species. A subset of these sequences was used as probes in similarity search-based screens [ 63 ]. For initial screening, we used a bitscore cutoff of 60. For comparisons of ERV RT copy number across species we filtered initial results using a more conservative bitscore cutoff of 90. Our previous, DIGS-based studies of ERVs have shown that spurious matches (i.e., to sequences other than retroviral RTs) do not arise when this cutoff is applied, although some genuine ERV RT hits may be excluded [ 38 ].

Database-integrated screening for non-retroviral EVEs

We obtained an RSL representing the proteome of eukaryotic viruses from the NCBI virus genomes database [ 56 ]. We supplemented this with sequence likely to cross-match to virus probes during screening. These included the teratorn transposon found in fish, which contains multiple alloherpesvirus-derived genes [ 125 ]. We included the polypeptide sequences of these genes, obtained from the subtype 1 Teratorn reference (Accession #: LC199500) in our RSL. We also included representatives of the maverick/polinton lineage of transposons, derived from sequences defined in a previous study, since these elements are now recognized to derive from a group of midsize eukaryotic linear DNAds viruses referred to as “polinton-like viruses” or “adintoviruses” [ 59 , 60 , 61 ]. Probes constituted a subset of 685 sequences contained within our RSL and incorporated polypeptide sequences representing all major protein-coding genes of representative species of all recognized or provisional vertebrate virus families. We used a bit score cutoff of 60 as a threshold for counting non-retroviral EVE loci. This threshold was established through previous experience searching for non-retroviral EVEs using DIGS [ 31 , 32 , 35 , 36 ]. Experience from previous studies had shown that nearly 100% of matches with bit scores ≥ 60 were either virus-derived or represented genuine similarity between virus genes and their cellular orthologs. By contrast, investigation of a subset of 100 hits with bit scores of b 40–59 showed that ~ 50% could not be confidently confirmed as having a viral origin (data not shown).

Artefactual hits to host DNA can occur in EVE screens since some virus genomes contain genes that have cellular homologs [ 126 ], and some virus genomes contain captured host DNA [ 127 ]. To distinguish host from virus-derived DNA in these cases, we exported such hits from the screening database and virtually translated them to obtain a polypeptide sequence. We then used the translated sequences as query input to online BLAST searches of GenBank’s non-redundant (nr) database. If searches revealed closer matching to host genes than to known viral genes, the input sequences were assumed to be host derived. Wherever this occurred, we incorporated representatives of the matching host sequences into the RSL, so that they would be recognized as host hits on reclassification. By updating hit classifications in this way, we could progressively filter out host-derived hits from our final screening output.

Filtering sequences-derived from exogenous viruses

Sequences derived from exogenous viruses are occasionally incorporated into WGS assemblies. We used SQL queries to identify and exclude these sequences based on hit characteristics. Where hits derived from virus species or species groups that have been sequenced previously, they could be discriminated on the basis of sequence identity (i.e., 98–99% nucleotide-level identity known viruses. The “extract start” field could be used to identify sequences that lacked flanking genomic sequences, indicating a potential exogenous origin. We also examined the virtually translated sequences to look for evidence of long-term presence in the host germline (e.g., stop codons, frameshifting mutations).

Filtering of cross-matching retrovirus-derived sequences

Hits that match more closely to virus genomes than to host DNA, and are clearly inserted into host DNA, are most likely bona fide EVE sequences. However, they may not necessarily be non-retroviral EVEs because some filoviruses and arenaviruses (family Arenaviridae) contain glycoprotein genes that are distantly related to those found in certain retroviruses [ 128 , 129 ]. When such hits were investigated and found to correspond to ERVs (established through the presence of proviral genome features adjacent to the hit), we included the putative sequences of glycoproteins encoded by these ERVs into our RSL and reclassified hits, so that spurious matches could be recognized as ERV-derived.

Genomic analysis

Previous studies of presence/absence patterns have shown that non-retroviral EVEs are present in many genomes due to orthology (ancient insertions) rather than paralogy (recent independent insertion) [ 32 , 35 , 36 , 77 ]. To differentiate orthologs of previously described EVEs from newly identified paralogs, we expanded our RSL to include consensus/reference sequences representing unique EVE loci. This set of EVE loci was comprised of insertions identified in previous studies [ 32 , 35 , 36 , 78 , 130 ], as well as a set of clearly novel EVEs identified in the present screen. For high-copy number, amplified lineages within this set (see Additional file 9 : Fig. S8), we only included a single reference sequence, rather than attempting to represent each individual ortholog, since it was clear that these elements derive from a single germline incorporation event. EVEs were considered novel if: (i) they derived from a virus group not previously reported in the host group in which they were identified or (ii) occurred in species only distantly related to species in which similar EVEs had been identified previously (e.g., an entirely distinct host class). Whenever novel EVEs were defined, results were reclassified using the updated RSL (see Fig.  2 ). Orthologs of previously identified EVEs could be inferred by using SQL queries to summarize screening results, as they disclosed high similarity to these EVE sequences and occurred in host species relatively closely related to the species in which the putatively orthologous EVEs had previously been identified. By contrast, novel paralogs either disclosed only limited similarity to previously identified EVE sequences or occurred in distantly related host species. This approach to discriminating between paralogs and orthologs has limitations but can guide further investigations that use more reliable approaches (e.g., via investigation of flanking sequences, or phylogeny) to infer orthology [ 35 ]. Se-Al (version 2.0a11) was used to inspect multiple sequence alignments of EVEs and genomic flanking sequences. Minimum age estimates were obtained for orthologous EVEs by using host species divergence time estimates collated in TimeTree [ 89 ]. We identified open reading frames and open coding regions within EVEs using PERL scripts available on request.

Phylogenetic analysis

Phylogenies were reconstructed using the maximum likelihood approach implemented in RAxML (version 8.2.12) [ 131 ] and model parameters selected using IQ-TREE model selection function [ 132 ]. Support for phylogenies was assessed via 1000 non-parametric bootstrap replicates. A time-calibrated vertebrate phylogeny was obtained via TimeTree, an open database of species divergence time estimates [ 89 ]. To determine germline infiltration rate, we divided the total number of distinct EVE orthologs identified in each vertebrate class by the total amount of branch length sampled for that class (obtained from the time-calibrated phylogeny).

Application of standardized nomenclature to EVE loci

We assigned all non-retroviral EVEs identified in our study unique identifiers (IDs), following a convention developed for ERVs [ 133 ]. Each was assigned a unique identifier (ID) constructed from three components. The first component is a classifier denoting the type of EVE. The second component comprises: (i) the name of the taxonomic group of viruses the element derived from and (ii) a numeric ID that uniquely identifies a specific integration locus, or for multicopy lineages, a unique founding event. The final component denotes the taxonomic distribution of the element. This approach has been applied in several previous studies of vertebrate EVEs [ 31 , 32 , 35 , 78 ] and we maintained consistency with these studies with respect to the numeric ID. Where our study revealed new information about the taxonomic relationship of an EVE to contemporary viruses, or its distribution across taxa, the ID was updated accordingly.

Availability of data and materials

Source code for the DIGS tool is freely available under the GNU AGPL-3.0 license:

GitHub: https://github.com/giffordlabcvr/DIGS-tool [ 134 ]

Zenodo: 10.5281/zenodo.10948938 [ 135 ]

All data generated in this study are openly available via GitHub:

https://github.com/giffordlabcvr/DIGS-for-EVEs [ 136 ]

Margulies EH, Birney E. Approaches to comparative sequence analysis: towards a functional view of vertebrate genomes. Nat Rev Genet. 2008;9(4):303–13.

Article   CAS   PubMed   Google Scholar  

Cheng JF, Priest JR, Pennacchio LA. Comparative genomics: a tool to functionally annotate human DNA. Methods Mol Biol. 2007;366:229–51.

Nobrega MA, Pennacchio LA. Comparative genomic analysis as a tool for biological discovery. J Physiol. 2004;554(Pt 1):31–9.

Guan D, Lazar MA. Shining light on dark matter in the genome. Proc Natl Acad Sci U S A. 2019;116(50):24919–21.

Article   CAS   PubMed   PubMed Central   Google Scholar  

Wright BW, et al. The dark proteome: translation from noncanonical open reading frames. Trends Cell Biol. 2022;32(3):243–58.

Eisenstein M. Drug hunters uncloak the non-coding ‘hidden’ genome. Nat Biotechnol. 2021;39(10):1169–71.

Katzourakis A, Gifford RJ. Endogenous viral elements in animal genomes. PLoS Genet. 2010;6(11):e1001191.

Article   PubMed   PubMed Central   Google Scholar  

Chiba S, et al. Widespread endogenization of genome sequences of non-retroviral RNA viruses into plant genomes. PLoS Pathog. 2011;7(7):e1002146.

Diop SI, et al. Tracheophyte genomes keep track of the deep evolution of the Caulimoviridae. Sci Rep. 2018;8(1):572.

Soucy SM, Huang J, Gogarten JP. Horizontal gene transfer: building the web of life. Nat Rev Genet. 2015;16(8):472–82.

Parrish NF, Tomonaga K. Endogenized viral sequences in mammals. Curr Opin Microbiol. 2016;31:176–83.

de Tomás C, Vicient CM. Genome-wide identification of reverse transcriptase domains of recently inserted endogenous plant pararetrovirus (Caulimoviridae). Front Plant Sci. 2022;13:1011565.

Gong Z, Zhang Y, Han GZ. Molecular fossils reveal ancient associations of dsDNA viruses with several phyla of fungi. Virus Evol. 2020;6(1):veaa008.

Bellas C, et al. Large-scale invasion of unicellular eukaryotic genomes by integrating DNA viruses. Proc Natl Acad Sci U S A. 2023;120(16):e2300465120.

Dewannieux M, Heidmann T. Endogenous retroviruses: acquisition, amplification and taming of genome invaders. Curr Opin Virol. 2013;3(6):646–56.

Geis FK, Goff SP. Silencing and transcriptional regulation of endogenous retroviruses: an overview. Viruses. 2020;12(8):884.

SrinivasacharBadarinarayan S, Sauter D. Switching sides: how endogenous retroviruses protect us from viral infections. J Virol. 2021;95(12):e02299–20.

Google Scholar  

Fujino K, et al. A human endogenous bornavirus-like nucleoprotein encodes a mitochondrial protein associated with cell viability. J Virol. 2021;95(14):e0203020.

Article   PubMed   Google Scholar  

Ophinni Y, et al. piRNA-guided CRISPR-like immunity in eukaryotes. Trends Immunol. 2019;40(11):998–1010.

Patel MR, Emerman M, Malik HS. Paleovirology - ghosts and gifts of viruses past. Curr Opin Virol. 2011;1(4):304–9.

Holmes EC. The evolution of endogenous viral elements. Cell Host Microbe. 2011;10(4):368–77.

Feschotte C, Gilbert C. Endogenous viruses: insights into viral evolution and impact on host biology. Nat Rev Genet. 2012;13(4):283–96.

Altschul SF, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nuc Acids Res. 1997;25:3389–402.

Article   CAS   Google Scholar  

Camacho C, et al. BLAST+: architecture and applications. BMC Bioinformatics. 2009;10:421.

Fernandes LP, et al. A satellite DNA array barcodes chromosome 7 and regulates totipotency via ZFP819. Sci Adv. 2022;8(43):eabp8085.

Enriquez-Gasca R, et al. Co-option of endogenous retroviruses through genetic escape from TRIM28 repression. Cell Rep. 2023;42(6):112625.

Wickenhagen A, et al. A prenylated dsRNA sensor protects against severe COVID-19. Science. 2021;374(6567):eabj3624.

Ito J, Gifford RJ, Sato K. Retroviruses drive the rapid evolution of mammalian APOBEC3 genes. Proc Natl Acad Sci U S A. 2020;117(1):610–8.

Shaw AE, et al. Fundamental properties of the mammalian innate immune system revealed by multispecies comparison of type I interferon responses. PLoS Biol. 2017;15(12):e2004086.

Bamford CGG, et al. Partial gene conversion shapes the emergence of functional novelty in the placental mammal interferon lambda system. In: Infectious diseases through an evolutionary lens. London: British Medical Association House; 2023.

Bamford CGG, et al. Comparative analysis of genome-encoded viral sequences reveals the evolutionary history of flavivirids (family Flaviviridae). Virus Evol. 2022;8(2):veac085.

Campbell MA, Loncar S, Kotin RM, Gifford RJ. Comparative analysis reveals the long-term coevolutionary history of parvoviruses and vertebrates. PLoS Biol. 2022;20(11):e3001867. https://doi.org/10.1371/journal.pbio.3001867 .

Quezada-Ramírez MA, et al. Identification of genome safe harbor loci for human gene therapy based on evolutionary biology and comparative genomics. bioRxiv. 2023:2023.09.08.556857.

Callaway HM, et al. Examination and reconstruction of three ancient endogenous parvovirus capsid protein gene remnants found in rodent genomes. J Virol. 2019;93(6):e01542–18.

Lytras S, Arriagada G, Gifford RJ. Ancient evolution of hepadnaviral paleoviruses and their impact on host genomes. Virus Evol. 2021;7(1):veab012.

Dennis TPW, et al. The evolution, distribution and diversity of endogenous circoviral elements in vertebrate genomes. Virus Res. 2019;262:15–23.

Kambol R, Gatseva A, Gifford RJ. An endogenous lentivirus in the germline of a rodent. Retrovirology. 2022;19(1):30.

Zhu H, Gifford RJ, Murcia PR. Distribution, diversity, and evolution of endogenous retroviruses in perissodactyl genomes. J Virol. 2018;92(23):e00927–18.

Blanco-Melo D, Gifford RJ, Bieniasz PD. Co-option of an endogenous retrovirus envelope for host defense in hominid ancestors. Elife. 2017;6:e22519.

Blanco-Melo D, Gifford RJ, Bieniasz PD. Reconstruction of a replication-competent ancestral murine endogenous retrovirus-L. Retrovirology. 2018;15(1):34.

Pearson WR, Mackey AJ. Using SQL databases for sequence similarity searching and analysis. Curr Protoc Bioinformatics. 2017;59:9.4.1–9.4.22.

Belyi VA, Levine AJ, Skalka AM. Sequences from ancestral single-stranded DNA viruses in vertebrate genomes: the parvoviridae and circoviridae are more than 40 to 50 million years old. J Virol. 2010;84(23):12458–62.

Heusinger E, et al. Early vertebrate evolution of the host restriction factor tetherin. J Virol. 2015;89(23):12154–65.

Blanco-Melo D, Venkatesh S, Bieniasz PD. Origins and evolution of tetherin, an orphan antiviral gene. Cell Host Microbe. 2016;20(2):189–201.

Waterhouse RM, et al. OrthoDB: a hierarchical catalog of animal, fungal and bacterial orthologs. Nucleic Acids Res. 2013;41(Database issue):D358–65.

Cunningham F, et al. Ensembl 2015. Nucleic Acids Res. 2015;43(Database issue):D662–9.

Gifford RJ. Database-integrated genome screening (DIGS) tool. 2022. Available from: https://giffordlabcvr.github.io/DIGS-tool/ .

Belshaw R, et al. High copy number in human endogenous retrovirus families is associated with copying mechanisms in addition to reinfection. Mol Biol Evol. 2005;22(4):814–7.

Johnson WE. Origins and evolutionary consequences of ancient endogenous retroviruses. Nat Rev Microbiol. 2019;17(6):355–70.

Hayward A, Grabherr M, Jern P. Broad-scale phylogenomics provides insights into retrovirus-host evolution. Proc Natl Acad Sci U S A. 2013;110(50):20146–51.

Xiong Y, Eickbush TH. Origin and evolution of retroelements based upon their reverse transcriptase sequences. EMBO J. 1990;9(10):3353–62.

Tristem M. Identification and characterization of novel human endogenous retrovirus families by phylogenetic screening of the human genome mapping project database. J Virol. 2000;74(8):3715–30.

Hayward A, Cornwallis CK, Jern P. Pan-vertebrate comparative genomics unmasks retrovirus macroevolution. Proc Natl Acad Sci U S A. 2015;112(2):464–9.

Han GZ. Extensive retroviral diversity in shark. Retrovirology. 2015;12:34.

Xu X, et al. Endogenous retroviruses of non-avian/mammalian vertebrates illuminate diversity and deep history of retroviruses. PLoS Pathog. 2018;14(6):e1007072.

Brister JR, et al. NCBI viral genomes resource. Nucleic Acids Res. 2015;43(Database issue):D571–7.

Sharma V, et al. Large-scale survey reveals pervasiveness and potential function of endogenous geminiviral sequences in plants. Virus Evol. 2020;6(2):veaa071.

Tanne E, Sela I. Occurrence of a DNA sequence of a non-retro RNA virus in a host plant genome and its expression: evidence for recombination between viral and host RNAs. Virology. 2005;332(2):614–22.

Koonin EV, Krupovic M, Yutin N. Evolution of double-stranded DNA viruses of eukaryotes: from bacteriophages to transposons to giant viruses. Ann N Y Acad Sci. 2015;1341(1):10–24.

Barreat JGN, Katzourakis A. Phylogenomics of the Maverick virus-like mobile genetic elements of vertebrates. Mol Biol Evol. 2021;38(5):1731–43.

Starrett GJ, et al. Adintoviruses: a proposed animal-tropic family of midsize eukaryotic linear dsDNA (MELD) viruses. Virus Evol. 2021;7(1):veaa055.

Inoue Y, Takeda H. Teratorn and its relatives - a cross-point of distinct mobile elements, transposons and viruses. Front Vet Sci. 2023;10:1158023.

Gifford RJ. DIGS-for-EVEs. 2023. Available from: https://github.com/giffordlabcvr/DIGS-for-EVEs .

Harvey E, et al. Divergent hepaciviruses, delta-like viruses and a chu-like virus in Australian marsupial carnivores (dasyurids). Virus Evol. 2023;9(2):vead061.

Harvey E, Holmes EC. Diversity and evolution of the animal virome. Nat Rev Microbiol. 2022;20(6):321–34.

Ariel E. Viruses in reptiles. Vet Res. 2011;42(1):100.

Waller SJ, et al. Cloacal virome of an ancient host lineage - the tuatara (Sphenodon punctatus) - reveals abundant and diverse diet-related viruses. Virology. 2022;575:43–53.

Soto E, et al. First isolation of a novel aquatic flavivirus from Chinook Salmon (Oncorhynchus tshawytscha) and its in vivo replication in a piscine animal model. J Virol. 2020;94(15):e00337–20.

Koda SA, et al. Complete genome sequences of infectious spleen and kidney necrosis virus isolated from farmed albino rainbow sharks Epalzeorhynchos frenatum in the United States. Virus Genes. 2021;57(5):448–52.

Harding EF, et al. Revealing the uncharacterised diversity of amphibian and reptile viruses. ISME Commun. 2022;2(1):95.

Horie M, et al. An RNA-dependent RNA polymerase gene in bat genomes derived from an ancient negative-strand RNA virus. Sci Rep. 2016;6(1):25873.

Ho ALFC, Pruett CL, Lin J. Phylogeny and biogeography of Poecilia (Cyprinodontiformes: Poeciliinae) across Central and South America based on mitochondrial and nuclear DNA markers. Mol Phylogenet Evol. 2016;101:32–45.

Aswad A, Katzourakis A. The first endogenous herpesvirus, identified in the tarsier genome, and novel sequences from primate rhadinoviruses and lymphocryptoviruses. PLoS Genet. 2014;10(6):e1004332.

Aswad A, et al. Evolutionary history of endogenous human herpesvirus 6 reflects human migration out of Africa. Mol Biol Evol. 2021;38(1):96–107.

Liu X, et al. Endogenization and excision of human herpesvirus 6 in human genomes. PLoS Genet. 2020;16(8):e1008915.

Dennis TPW, et al. Insights into circovirus host range from the genomic fossil record. J Virol. 2018;92(16):e00145–18.

Suh A, et al. Early mesozoic coexistence of amniotes and hepadnaviridae. PLoS Genet. 2014;10(12):e1004559.

Kawasaki J, et al. 100-My history of bornavirus infections hidden in vertebrate genomes. Proc Natl Acad Sci U S A. 2021;118(20):e2026235118.

Horie M, et al. Endogenous non-retroviral RNA virus elements in mammalian genomes. Nature. 2010;463(7277):84–7.

Hyndman TH, et al. Isolation and molecular identification of Sunshine virus, a novel paramyxovirus found in Australian snakes. Infect Genet Evol. 2012;12(7):1436–46.

Mari Saez A, et al. Investigating the zoonotic origin of the West African Ebola epidemic. EMBO Mol Med. 2015;7(1):17–23.

Leroy EM, et al. Multiple Ebola virus transmission events and rapid decline of central African wildlife. Science. 2004;303(5656):387–90.

Edwards MR, et al. Conservation of structure and immune antagonist functions of filoviral VP35 homologs present in microbat genomes. Cell Rep. 2018;24(4):861–872.e6.

Kondoh T, et al. Putative endogenous filovirus VP35-like protein potentially functions as an IFN antagonist but not a polymerase cofactor. PLoS One. 2017;12(10):e0186450.

Gorbalenya AE, Lauber C. Phylogeny of viruses. In: Reference module in biomedical sciences. 2017.

Li Y, et al. Endogenous viral elements in shrew genomes provide insights into pestivirus ancient history. Mol Biol Evol. 2022;39(10):msac190.

Qin XC, et al. A tick-borne segmented RNA virus contains genome segments derived from unsegmented viral ancestors. Proc Natl Acad Sci U S A. 2014;111(18):6744–9.

Krupovic M, Koonin EV. Polintons: a hotbed of eukaryotic virus, transposon and plasmid evolution. Nat Rev Microbiol. 2015;13(2):105–15.

Kumar S, et al. TimeTree 5: an expanded resource for species divergence times. Mol Biol Evol. 2022;39(8):msac174.

Birney E, et al. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature. 2007;447(7146):799–816.

Schattner P. Automated querying of genome databases. PLoS Comput Biol. 2007;3(1):e1.

Obbard DJ. Expansion of the metazoan virosphere: progress, pitfalls, and prospects. Curr Opin Virol. 2018;31:17–23.

Zhang YZ, Shi M, Holmes EC. Using metagenomics to characterize an expanding virosphere. Cell. 2018;172(6):1168–72.

Koonin EV, Dolja VV. Virus world as an evolutionary network of viruses and capsidless selfish elements. Microbiol Mol Biol Rev. 2014;78(2):278–303.

Reus K, et al. HERV-K(OLD): ancestor sequences of the human endogenous retrovirus family HERV-K(HML-2). J Virol. 2001;75(19):8917–26.

Pavlícek A, et al. Processed pseudogenes of human endogenous retroviruses generated by LINEs: their integration, stability, and distribution. Genome Res. 2002;12(3):391–9.

Mahanty S, Bray M. Pathogenesis of filoviral haemorrhagic fevers. Lancet Infect Dis. 2004;4(8):487–98.

Taylor DJ, et al. Evidence that ebolaviruses and cuevaviruses have been diverging from marburgviruses since the Miocene. PeerJ. 2014;2:e556.

Carroll SA, et al. Molecular evolution of viruses of the family Filoviridae based on 97 whole-genome sequences. J Virol. 2013;87(5):2608–16.

Kryukov K, et al. Systematic survey of non-retroviral virus-like elements in eukaryotic genomes. Virus Res. 2019;262:30–6.

Cui J, et al. Low frequency of paleoviral infiltration across the avian phylogeny. Genome Biol. 2014;15(12):539.

Osterrieder N, Wallaschek N, Kaufer BB. Herpesvirus genome integration into telomeric repeats of host cell chromosomes. Annu Rev Virol. 2014;1(1):215–35.

McBride AA, Warburton A. The role of integration in oncogenic progression of HPV-associated cancers. PLoS Pathog. 2017;13(4):e1006211.

Janovitz T, et al. Parvovirus B19 integration into human CD36+ erythroid progenitor cells. Virology. 2017;511:40–8.

Brait N, et al. A tale of caution: how endogenous viral elements affect virus discovery in transcriptomic data. Virus Evol. 2023;10(1):vead088.

Frank JA, Feschotte C. Co-option of endogenous viral sequences for host cell function. Curr Opin Virol. 2017;25:81–9.

Aswad A, Katzourakis A. Paleovirology and virally derived immunity. Trends Ecol Evol. 2012;27(11):627–36.

Bravo A, et al. Antiviral activity of an endogenous parvoviral element. Viruses. 2023;15(7):1420.

Lavialle C, et al. Paleovirology of ‘syncytins’, retroviral env genes exapted for a role in placentation. Philos Trans R Soc Lond B Biol Sci. 2013;368(1626):20120507.

Valencia-Herrera I, et al. Molecular properties and evolutionary origins of a parvovirus-derived myosin fusion gene in guinea pigs. J Virol. 2019;93(17):e00404–19.

Pastuzyn ED, et al. The neuronal gene arc encodes a repurposed retrotransposon gag protein that mediates intercellular RNA transfer. Cell. 2018;172(1–2):275–288.e18.

Koonin EV, Krupovic M. The depths of virus exaptation. Curr Opin Virol. 2018;31:1–8.

Hu G, Kurgan L. Sequence similarity searching. Curr Protoc Protein Sci. 2019;95(1):e71.

Pearson WR. An introduction to sequence similarity (“homology”) searching. Curr Protoc Bioinformatics. 2013;Chapter 3:Unit3.1.

Miller K, et al. Identification of multiple Gypsy LTR-retrotransposon lineages in vertebrate genomes. J Mol Evol. 1999;49(3):358–66.

Wang J, Han GZ. A sister lineage of sampled retroviruses corroborates the complex evolution of retroviruses. Mol Biol Evol. 2021;38(3):1031–9.

Kojima S, et al. Virus-like insertions with sequence signatures similar to those of endogenous nonretroviral RNA viruses in the human genome. Proc Natl Acad Sci U S A. 2021;118(5):e2010758118.

Bruno M, Mahgoub M, Macfarlan TS. The arms race between KRAB-zinc finger proteins and endogenous retroelements and its impact on mammals. Annu Rev Genet. 2019;53:393–416.

Buchfink B, Reuter K, Drost HG. Sensitive protein alignments at tree-of-life scale using DIAMOND. Nat Methods. 2021;18(4):366–8.

Camacho C, et al. ElasticBLAST: accelerating sequence search via cloud computing. BMC Bioinformatics. 2023;24(1):117.

Nawrocki EP, Kolbe DL, Eddy SR. Infernal 1.0: inference of RNA alignments. Bioinformatics. 2009;25(10):1335–7.

Grabowski P, Rappsilber J. A primer on data analytics in functional genomics: how to move from data to insight? Trends Biochem Sci. 2019;44(1):21–32.

Kitts PA, et al. Assembly: a resource for assembled genomes at NCBI. Nucleic Acids Res. 2016;44(D1):D73–80.

Schoch CL, et al. NCBI Taxonomy: a comprehensive update on curation, resources and tools. Database (Oxford). 2020;2020:baaa062.

Inoue Y, et al. Fusion of piggyBac-like transposons and herpesviruses occurs frequently in teleosts. Zoological Lett. 2018;4:6.

Koonin EV. On the origin of cells and viruses: primordial virus world scenario. Ann N Y Acad Sci. 2009;1178(1):47–64.

Becher P, Tautz N. RNA recombination in pestiviruses: cellular RNA sequences in viral genomes highlight the role of host factors for viral persistence and lethal disease. RNA Biol. 2011;8(2):216–24.

Benit L, Dessen P, Heidmann T. Identification, phylogeny, and evolution of retroviral elements based on their envelope genes. J Virol. 2001;75(23):11709–19.

Gallaher WR, DiSimone C, Buchmeier MJ. The viral transmembrane superfamily: possible divergence of Arenavirus and Filovirus glycoproteins from a common RNA virus ancestor. BMC Microbiol. 2001;1:1.

Hildebrandt E, et al. Evolution of dependoparvoviruses across geological timescales – implications for design of AAV-based gene therapy vectors. Virus Evol. 2020;6(2):veaa043.

Stamatakis A. RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics. 2006;22(21):2688–90.

Minh BQ, et al. IQ-TREE 2: new models and efficient methods for phylogenetic inference in the genomic era. Mol Biol Evol. 2020;37(5):1530–4.

Gifford RJ, et al. Nomenclature for endogenous retrovirus (ERV) loci. Retrovirology. 2018;15(1):59.

Blanco-Melo D, et al. DIGS-tool: database-integrated genome screening. Github; 2023. https://github.com/giffordlabcvr/DIGS-tool .

Blanco-Melo D, et al. DIGS-tool version 1.0.4. Zenodo; 2024. https://zenodo.org/records/10948938 .

Blanco-Melo D, et al. DIGS datasets. Github; 2023. https://github.com/giffordlabcvr/DIGS-for-EVEs .

Download references

Acknowledgements

We thank Connor Bamford, Paul Bieniasz, and Jamie Henzy for helpful discussions. Additional thanks to Ade Tukuru (Aaron Diamond AIDS Research Centre) and Scott Arkinson (MRC-University of Glasgow Centre for Virus Research) for bioinformatics support.

Peer review information

Kevin Pang was the primary editor of this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

Review history

The review history is available as Additional file 12 .

DB-M is supported by the V Foundation for Cancer Research and the Searle Scholars Program. TD, SL, SM, JH, and RJG were funded by the Medical Research Council of the United Kingdom (MC_UU_12014/12).

Author information

Daniel Blanco-Melo and Matthew A. Campbell contributed equally to this work.

Authors and Affiliations

Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Center, 1100 Fairview Ave N, Seattle, WA, 98109, USA

Daniel Blanco-Melo

Herbold Computational Biology Program, Public Health Sciences Division, Fred Hutchinson Cancer Center, 1100 Fairview Ave N, Seattle, WA, 98109, USA

University of California, Davis, 1 Shields Ave, Davis, CA, 95616, USA

Matthew A. Campbell

MRC-University of Glasgow Centre for Virus Research, 464 Bearsden Rd, Bearsden, Glasgow, G61 1QH, UK

Henan Zhu, Tristan P. W. Dennis, Sejal Modha, Spyros Lytras, Joseph Hughes, Anna Gatseva & Robert J. Gifford

Centre for Epidemic Response and Innovation (CERI), School of Data Science and Computational Thinking, Stellenbosch University, Stellenbosch, South Africa

Robert J. Gifford

You can also search for this author in PubMed   Google Scholar

Contributions

Conceptualization, RJG. Methodology, RJG. Investigation, RJG, DBM, and MAC. Data curation, RJG, DBM, and SL. Formal analysis, RJG, DBM, MAC, TD, HZ, AG, SM, and JH. Visualization, RJG, MAC, and AG. Writing, RJG. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Robert J. Gifford .

Ethics declarations

Ethics approval and consent to participate.

Not applicable.

Consent for publication

Competing interests.

None declared.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1: figure s1..

An annotated example of a DIGS tool control file.

Additional file 2: Figure S2.

The DIGS tool framework for in silico genome screening.

Additional file 3: Figure S3.

Examples of SQL-based querying of DIGS results.

Additional file 4: Figure S4.

Validation of the DIGS tool.

Additional file 5: Table S1.

Putatively exogenous viruses identified in WGS data.

Additional file 6: Figure S5.

Genomic analysis of a superficially caulimovirus-like EVE.

Additional file 7: Figure S6.

Summary of vertebrate EVE coding potential.

Additional file 8: Figure S7.

Evolutionary relationships of vertebrate EVEs and viruses.

Additional file 9: Figure S8.

Amplified lineages of endogenous viral elements.

Additional file 10: Table S2.

Minimum ages of EVEs.

Additional file 11: Figure S9.

Germline incorporation through time shown separately for each virus family.

Additional file 12.

Review history.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Blanco-Melo, D., Campbell, M.A., Zhu, H. et al. A novel approach to exploring the dark genome and its application to mapping of the vertebrate virus fossil record. Genome Biol 25 , 120 (2024). https://doi.org/10.1186/s13059-024-03258-y

Download citation

Received : 03 November 2023

Accepted : 22 April 2024

Published : 13 May 2024

DOI : https://doi.org/10.1186/s13059-024-03258-y

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Genome Biology

ISSN: 1474-760X

human genome project research paper pdf

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Published: 17 August 2023

Genetic insights into human cortical organization and development through genome-wide analyses of 2,347 neuroimaging phenotypes

  • Varun Warrier   ORCID: orcid.org/0000-0003-4532-8571 1 , 2 ,
  • Eva-Maria Stauffer 1 ,
  • Qin Qin Huang   ORCID: orcid.org/0000-0003-3073-717X 3 ,
  • Emilie M. Wigdor 3 ,
  • Eric A. W. Slob   ORCID: orcid.org/0000-0002-1644-0735 4 , 5 , 6 ,
  • Jakob Seidlitz   ORCID: orcid.org/0000-0002-8164-7476 7 , 8 , 9 ,
  • Lisa Ronan 1 ,
  • Sofie L. Valk 10 , 11 , 12 ,
  • Travis T. Mallard 13 , 14 ,
  • Andrew D. Grotzinger   ORCID: orcid.org/0000-0001-7852-9244 15 , 16 ,
  • Rafael Romero-Garcia   ORCID: orcid.org/0000-0002-5199-4573 1 , 17 ,
  • Simon Baron-Cohen   ORCID: orcid.org/0000-0001-9217-2544 1 , 2 ,
  • Daniel H. Geschwind   ORCID: orcid.org/0000-0003-2896-3450 18 , 19 , 20 , 21 ,
  • Madeline A. Lancaster   ORCID: orcid.org/0000-0003-2324-8853 22 ,
  • Graham K. Murray   ORCID: orcid.org/0000-0001-8296-1742 1 , 23 , 24 ,
  • Michael J. Gandal   ORCID: orcid.org/0000-0001-5800-5128 7 , 8 ,
  • Aaron Alexander-Bloch   ORCID: orcid.org/0000-0001-6554-1893 7 , 8 , 9 ,
  • Hyejung Won   ORCID: orcid.org/0000-0003-3651-0566 25   na1 ,
  • Hilary C. Martin   ORCID: orcid.org/0000-0002-4454-9084 3   na1 ,
  • Edward T. Bullmore 1 , 23   na1 &
  • Richard A. I. Bethlehem   ORCID: orcid.org/0000-0002-0714-0685 1 , 2  

Nature Genetics volume  55 ,  pages 1483–1493 ( 2023 ) Cite this article

10k Accesses

11 Citations

319 Altmetric

Metrics details

  • Genome-wide association studies
  • Magnetic resonance imaging

Our understanding of the genetics of the human cerebral cortex is limited both in terms of the diversity and the anatomical granularity of brain structural phenotypes. Here we conducted a genome-wide association meta-analysis of 13 structural and diffusion magnetic resonance imaging-derived cortical phenotypes, measured globally and at 180 bilaterally averaged regions in 36,663 individuals and identified 4,349 experiment-wide significant loci. These phenotypes include cortical thickness, surface area, gray matter volume, measures of folding, neurite density and water diffusion. We identified four genetic latent structures and causal relationships between surface area and some measures of cortical folding. These latent structures partly relate to different underlying gene expression trajectories during development and are enriched for different cell types. We also identified differential enrichment for neurodevelopmental and constrained genes and demonstrate that common genetic variants associated with cortical expansion are associated with cephalic disorders. Finally, we identified complex interphenotype and inter-regional genetic relationships among the 13 phenotypes, reflecting the developmental differences among them. Together, these analyses identify distinct genetic organizational principles of the cortex and their correlates with neurodevelopment.

This is a preview of subscription content, access via your institution

Access options

Access Nature and 54 other Nature Portfolio journals

Get Nature+, our best-value online-access subscription

24,99 € / 30 days

cancel any time

Subscribe to this journal

Receive 12 print issues and online access

195,33 € per year

only 16,28 € per issue

Buy this article

  • Purchase on Springer Link
  • Instant access to full article PDF

Prices may be subject to local taxes which are calculated during checkout

human genome project research paper pdf

Similar content being viewed by others

human genome project research paper pdf

Cortical gene expression architecture links healthy neurodevelopment to the imaging, transcriptomics and genetics of autism and schizophrenia

human genome project research paper pdf

Multivariate genomic architecture of cortical thickness and surface area at multiple levels of analysis

human genome project research paper pdf

Genetic correlations and genome-wide associations of cortical structure in general population samples of 22,824 adults

Data availability.

All summary statistics for the GWAS meta-analyses are available for access here: https://portal.ide-cam.org.uk/overview/483 . To prevent potential misuse, the data are available under controlled access after approval by the research team for educational and research purposes only. Data from the UKB and ABCD can be applied for and accessed by approved researchers. GWAS summary statistics for other brain imaging phenotypes can be obtained from: The Oxford Brain Imaging Genetics PheWeb (PheWeb ( ox.ac.uk )), GWAS catalog (GWAS Catalog ( ebi.ac.uk )), GWAS ATLAS (Genome-wide association study ATLAS ( ctglab.nl )) and Brain Imaging Genetics Knowledge Portal Brain Imaging Genetics Summary Statistics . The SPARK dataset can be obtained by application to SFARIbase ( SFARI | SFARI Base ). The DDD dataset can be obtained via EGA (deciphering developmental disorders (DDD)—EGA European Genome-Phenome Archive ( ega-archive.org )).

Code availability

Code used are available at https://github.com/ucam-department-of-psychiatry/UKB (ref. 136 ), https://github.com/ucam-department-of-psychiatry/ABCD (ref. 137 ), vwarrier/ABCD_geneticQC ( github.com ; ref. 138 ) and vwarrier/Imaging_genetics_analyses ( github.com ; ref. 139 ).

Bethlehem, R. A. I. et al. Brain charts for the human lifespan. Nature 604 , 525–533 (2022).

Article   CAS   PubMed   PubMed Central   Google Scholar  

Thompson, P. M. et al. ENIGMA and global neuroscience: a decade of large-scale studies of the brain in health and disease across more than 40 countries. Transl. Psychiatry 10 , 100 (2020).

Article   PubMed   PubMed Central   Google Scholar  

Gilmore, J. H., Knickmeyer, R. C. & Gao, W. Imaging structural and functional brain development in early childhood. Nat. Rev. Neurosci. 19 , 123–137 (2018).

Paus, T., Keshavan, M. & Giedd, J. N. Why do many psychiatric disorders emerge during adolescence? Nat. Rev. Neurosci. 9 , 947–957 (2008).

Elliott, L. T. et al. Genome-wide association studies of brain imaging phenotypes in UK Biobank. Nature 562 , 210–216 (2018).

Grasby, K. L. et al. The genetic architecture of the human cerebral cortex. Science 367 , eaay6690 (2020).

Stein, J. L. et al. Identification of common variants associated with human hippocampal and intracranial volumes. Nat. Genet. 44 , 552–561 (2012).

Zhao, B. et al. Common genetic variation influencing human white matter microstructure. Science 372 , eabf3736 (2021).

Makowski, C. et al. Discovery of genomic loci of the human cerebral cortex using genetically informed brain atlases. Science 375 , 522–528 (2022).

Jansen, P. R. et al. Genome-wide meta-analysis of brain volume identifies genomic loci and genes shared with intelligence. Nat. Commun. 11 , 5606 (2020).

Smith, S. M. et al. An expanded set of genome-wide association studies of brain imaging phenotypes in UK Biobank. Nat. Neurosci. 24 , 737–745 (2021).

Naqvi, S. et al. Shared heritability of human face and brain shape. Nat. Genet. 53 , 830–839 (2021).

Jayaraman, D., Bae, B.-I. & Walsh, C. A. The genetics of primary microcephaly. Annu. Rev. Genomics Hum. Genet. 19 , 177–200 (2018).

Article   CAS   PubMed   Google Scholar  

Li, M. et al. Integrative functional genomic analysis of human brain development and neuropsychiatric risks. Science 362 , eaat7615 (2018).

Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562 , 203–209 (2018).

Casey, B. J. et al. The Adolescent Brain Cognitive Development (ABCD) study: imaging acquisition across 21 sites. Dev. Cogn. Neurosci. 32 , 43–54 (2018).

Glasser, M. F. et al. A multi-modal parcellation of human cerebral cortex. Nature 536 , 171–178 (2016).

Bulik-Sullivan, B. K. et al. An atlas of genetic correlations across human diseases and traits. Nat. Genet. 47 , 1236–1241 (2015).

Hedman, A. M., van Haren, N. E. M., Schnack, H. G., Kahn, R. S. & Hulshoff Pol, H. E. Human brain changes across the life span: a review of 56 longitudinal magnetic resonance imaging studies. Hum. Brain Mapp. 33 , 1987–2002 (2012).

Article   PubMed   Google Scholar  

Brouwer, R. M. et al. Genetic variants associated with longitudinal changes in brain structure across the lifespan. Nat. Neurosci. 25 , 421–432 (2022).

Willer, C. J., Li, Y. & Abecasis, G. R. METAL: fast and efficient meta-analysis of genomewide association scans. Bioinformatics 26 , 2190–2191 (2010).

Bulik-Sullivan, B. K. et al. LD score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat. Genet. 47 , 291–295 (2015).

Sodini, S. M., Kemper, K. E., Wray, N. R. & Trzaskowski, M. Comparison of genotypic and phenotypic correlations: Cheverud’s conjecture in humans. Genetics 209 , 941–948 (2018).

Grotzinger, A. D. et al. Genomic structural equation modelling provides insights into the multivariate genetic architecture of complex traits. Nat. Hum. Behav. 3 , 513–525 (2019).

Sanderson, E. et al. Mendelian randomization. Nat. Rev. Methods Primers 2 , 6 (2022).

Rakic, P. Specification of cerebral cortical areas. Science 241 , 170–176 (1988).

Ronan, L. et al. Differential tangential expansion as a mechanism for cortical gyrification. Cereb. Cortex 24 , 2219–2228 (2014).

Garcia, K. E., Kroenke, C. D. & Bayly, P. V. Mechanics of cortical folding: stress, growth and stability. Philos. Trans. R. Soc. Lond. B Biol. Sci. 373 , 20170321 (2018).

Richman, D. P., Stewart, R. M., Hutchinson, J. W. & Caviness, V. S. Jr. Mechanical model of brain convolutional development. Science 189 , 18–21 (1975).

Tallinen, T., Chung, J. Y., Biggins, J. S. & Mahadevan, L. Gyrification from constrained cortical expansion. Proc. Natl Acad. Sci. USA 111 , 12667–12672 (2014).

Reillo, I., de Juan Romero, C., García-Cabezas, M. Á. & Borrell, V. A role for intermediate radial glia in the tangential expansion of the mammalian cerebral cortex. Cereb. Cortex 21 , 1674–1694 (2011).

Kriegstein, A., Noctor, S. & Martínez-Cerdeño, V. Patterns of neural stem and progenitor cell division may underlie evolutionary cortical expansion. Nat. Rev. Neurosci. 7 , 883–890 (2006).

De Leeuw, C. A., Mooij, J. M., Heskes, T. & Posthuma, D. MAGMA: generalized gene-set analysis of GWAS data. PLoS Comput. Biol. 11 , e1004219 (2015).

Sey, N. Y. A. et al. A computational tool (H-MAGMA) for improved prediction of brain-disorder risk genes by incorporating brain chromatin interaction profiles. Nat. Neurosci. 23 , 583–593 (2020).

Akbarian, S. et al. The PsychENCODE project. Nat. Neurosci. 18 , 1707–1712 (2015).

Eze, U. C., Bhaduri, A., Haeussler, M., Nowakowski, T. J. & Kriegstein, A. R. Single-cell atlas of early human brain development highlights heterogeneity of human neuroepithelial cells and early radial glia. Nat. Neurosci. 24 , 584–594 (2021).

Polioudakis, D. et al. A single-cell transcriptomic atlas of human neocortical development during mid-gestation. Neuron 103 , 785–801 (2019).

Ziffra, R. S. et al. Single-cell epigenomics reveals mechanisms of human cortical development. Nature 598 , 205–213 (2021).

Florio, M. & Huttner, W. B. Neural progenitors, neurogenesis and the evolution of the neocortex. Development 141 , 2182–2194 (2014).

Geschwind, D. H. & Rakic, P. Cortical evolution: judge the brain by its cover. Neuron 80 , 633–647 (2013).

Gertz, C. C., Lui, J. H., LaMonica, B. E., Wang, X. & Kriegstein, A. R. Diverse behaviors of outer radial glia in developing ferret and human cortex. J. Neurosci. 34 , 2559–2570 (2014).

Nott, A. et al. Brain cell type-specific enhancer-promoter interactome maps and disease-risk association. Science 366 , 1134–1139 (2019).

Fukutomi, H. et al. Neurite imaging reveals microstructural variations in human cerebral cortical gray matter. Neuroimage 182 , 488–499 (2018).

Zeng, J. et al. Widespread signatures of natural selection across human complex traits and functional genomic categories. Nat. Commun. 12 , 1164 (2021).

Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581 , 434–443 (2020).

Fu, J. M. et al. Rare coding variation provides insight into the genetic architecture and phenotypic context of autism. Nat. Genet. 54 , 1320–1331 (2022).

Prevalence and architecture of de novo mutations in developmental disorders. Nature . 542 , 433–438 (2017).

Niemi, M. E. K. et al. Common genetic variants contribute to risk of rare severe neurodevelopmental disorders. Nature 562 , 268–271 (2018).

SPARK Consortium. SPARK: a US cohort of 50,000 families to accelerate autism research. Neuron 97 , 488–493 (2018).

Article   Google Scholar  

Weissbrod, O. et al. Functionally informed fine-mapping and polygenic localization of complex trait heritability. Nat. Genet. 52 , 1355–1363 (2020).

Kabeche, L., Nguyen, H. D., Buisson, R. & Zou, L. A mitosis-specific and R loop-driven ATR pathway promotes faithful chromosome segregation. Science 359 , 108–114 (2018).

Kaczmarczyk, A. & Sullivan, K. F. CENP-W plays a role in maintaining bipolar spindle structure. PLoS ONE 9 , e106464 (2014).

Koolen, D. A. et al. The Koolen-de Vries syndrome: a phenotypic comparison of patients with a 17q21.31 microdeletion versus a KANSL1 sequence variant. Eur. J. Hum. Genet. 24 , 652–659, (2016).

Zhou, X. et al. Cellular and molecular properties of neural progenitors in the developing mammalian hypothalamus. Nat. Commun. 11 , 4063 (2020).

Kuwayama, N. et al. A role for Hmga2 in the early-stage transition of neural stem-progenitor cell properties during mouse neocortical development. Preprint at bioRxiv https://doi.org/10.1101/2020.05.14.086330 (2021).

De Crescenzo, A. et al. A splicing mutation of the HMGA2 gene is associated with Silver–Russell syndrome phenotype. J. Hum. Genet. 60 , 287–293 (2015).

Chenn, A. & Walsh, C. A. Regulation of cerebral cortical size by control of cell cycle exit in neural precursors. Science 297 , 365–369 (2002).

Xiang, Y.-Y. et al. Versican G3 domain regulates neurite growth and synaptic transmission of hippocampal neurons by activation of epidermal growth factor receptor. J. Biol. Chem. 281 , 19358–19368 (2006).

Dobyns, W. B. et al. MACF1 mutations encoding highly conserved zinc-binding residues of the GAR domain cause defects in neuronal migration and axon guidance. Am. J. Hum. Genet. 103 , 1009–1021 (2018).

Aschard, H., Vilhjálmsson, B. J., Joshi, A. D., Price, A. L. & Kraft, P. Adjusting for heritable covariates can bias effect estimates in genome-wide association studies. Am. J. Hum. Genet. 96 , 329–339 (2015).

Chen, S. et al. A genome-wide mutational constraint map quantified from variation in 76,156 human genomes. Preperint at bioRxiv https://doi.org/10.1101/2022.03.20.485034 (2022).

Demange, P. A. et al. Investigating the genetic architecture of noncognitive skills using GWAS-by-subtraction. Nat. Genet. 53 , 35–44 (2021).

Bhaduri, A. et al. An atlas of cortical arealization identifies dynamic molecular signatures. Nature 598 , 200–204 (2021).

Yeo, B. T. T. et al. The organization of the human cerebral cortex estimated by intrinsic functional connectivity. J. Neurophysiol. 106 , 1125–1165 (2011).

Mesulam, M. M. From sensation to cognition. Brain 121 , 1013–1052 (1998).

Alexander-Bloch, A. F. et al. On testing for spatial correspondence between maps of human brain structure and function. Neuroimage 178 , 540–551 (2018).

Sha, Z. et al. The genetic architecture of structural left–right asymmetry of the human brain. Nat. Hum. Behav. 5 , 1226–1239 (2021).

Rubenstein, J. L. & Rakic, P. Genetic control of cortical development. Cereb. Cortex 9 , 521–523 (1999).

Cox, S. R. et al. Ageing and brain white matter structure in 3,513 UK Biobank participants. Nat. Commun. 7 , 13629 (2016).

Sexton, C. E. et al. Accelerated changes in white matter microstructure during aging: a longitudinal diffusion tensor imaging study. J. Neurosci. 34 , 15425–15436 (2014).

Pletikos, M. et al. Temporal specification and bilaterality of human neocortical topographic gene expression. Neuron 81 , 321–332 (2014).

Zhu, Y. et al. Spatiotemporal transcriptomic divergence across human and macaque brain development. Science 362 , eaat8077 (2018).

Yoon, B., Shim, Y.-S., Lee, K.-S., Shon, Y.-M. & Yang, D.-W. Region-specific changes of cerebral white matter during normal aging: a diffusion-tensor analysis. Arch. Gerontol. Geriatr. 47 , 129–138 (2008).

Shi, Y. et al. Diffusion tensor imaging-based characterization of brain neurodevelopment in primates. Cereb. Cortex 23 , 36–48 (2012).

Coalson, T. S., Van Essen, D. C. & Glasser, M. F. The impact of traditional neuroimaging methods on the spatial localization of cortical areas. Proc. Natl Acad. Sci. USA 115 , E6356–E6365 (2018).

Kharabian Masouleh, S. et al. Influence of processing pipeline on cortical thickness measurement. Cereb. Cortex 30 , 5014–5027 (2020).

Alfaro-Almagro, F. et al. Confound modelling in UK Biobank brain imaging. NeuroImage 224 , 117002 (2021).

Barch, D. M. et al. Demographic, physical and mental health assessments in the adolescent brain and cognitive development study: rationale and description. Dev. Cogn. Neurosci. 32 , 55–66 (2018).

Fischl, B. et al. Automatically parcellating the human cerebral cortex. Cereb. Cortex 14 , 11–22 (2004).

Van Essen, D. C., Glasser, M. F., Dierker, D. L., Harwell, J. & Coalson, T. Parcellations and hemispheric asymmetries of human cerebral cortex analyzed on surface-based atlases. Cereb. Cortex 22 , 2241–2262 (2012).

Rosen, A. F. G. et al. Quantitative assessment of structural image quality. Neuroimage 169 , 407–418 (2018).

Alfaro-Almagro, F. et al. Image processing and quality control for the first 10,000 brain imaging datasets from UK Biobank. Neuroimage 166 , 400–424 (2018).

Hagler, D. J. Jr et al. Image processing and analysis methods for the Adolescent Brain Cognitive Development Study. Neuroimage 202 , 116091 (2019).

Daducci, A. et al. Accelerated microstructure imaging via convex optimization (AMICO) from diffusion MRI data. Neuroimage 105 , 32–44 (2015).

Schaer, M. et al. How to measure cortical folding from MR images: a step-by-step tutorial to compute local gyrification index. J. Vis. Exp. 2 , e3417 (2012).

Google Scholar  

Knussmann, G. N. et al. Test-retest reliability of FreeSurfer-derived volume, area and cortical thickness from MPRAGE and MP2RAGE brain MRI images. Neuroimage Rep. 2 , 100086 (2022).

Haddad, E. et al. Multisite test-retest reliability and compatibility of brain metrics derived from FreeSurfer versions 7.1, 6.0, and 5.3. Hum. Brain Mapp. 44 , 1515–1532 (2023).

Hedges, E. P. et al. Reliability of structural MRI measurements: the effects of scan session, head tilt, inter-scan interval, acquisition sequence, FreeSurfer version and processing stream. Neuroimage 246 , 118751 (2022).

Madan, C. R. & Kensinger, E. A. Test-retest reliability of brain morphology estimates. Brain Inform. 4 , 107–121 (2017).

Duff, E. et al. Reliability of multi-site UK Biobank MRI brain phenotypes for the assessment of neuropsychiatric complications of SARS-CoV-2 infection: the COVID-CNS travelling heads study. PLoS ONE 17 , e0273704 (2022).

O’Donnell, L. J. & Westin, C.-F. An introduction to diffusion tensor image analysis. Neurosurg. Clin. N. Am. 22 , 185–196 (2011).

Zhang, H., Schneider, T., Wheeler-Kingshott, C. A. & Alexander, D. C. NODDI: practical in vivo neurite orientation dispersion and density imaging of the human brain. Neuroimage 61 , 1000–1016 (2012).

Tariq, M., Schneider, T., Alexander, D. C., Gandini Wheeler-Kingshott, C. A. & Zhang, H. Bingham–NODDI: mapping anisotropic orientation dispersion of neurites using diffusion MRI. Neuroimage 133 , 207–223 (2016).

Andica, C. et al. Scan–rescan and inter-vendor reproducibility of neurite orientation dispersion and density imaging metrics. Neuroradiology 62 , 483–494 (2020).

Kong, X.-Z. et al. Mapping cortical brain asymmetry in 17,141 healthy individuals worldwide via the ENIGMA Consortium. Proc. Natl Acad. Sci. USA 115 , E5154–E5163 (2018).

Kurth, F., Gaser, C. & Luders, E. A 12-step user guide for analyzing voxel-wise gray matter asymmetries in statistical parametric mapping (SPM). Nat. Protoc. 10 , 293–304 (2015).

Leroy, F. et al. New human-specific brain landmark: the depth asymmetry of superior temporal sulcus. Proc. Natl Acad. Sci. USA 112 , 1208–1213 (2015).

1000 Genomes Project Consortium. et al.A global reference for human genetic variation. Nature 526 , 68–74 (2015).

Gogarten, S. M. et al. Genetic association testing using the GENESIS R/bioconductor package. Bioinformatics 35 , 5346–5348 (2019).

Manichaikul, A. et al. Robust relationship inference in genome-wide association studies. Bioinformatics 26 , 2867–2873 (2010).

Jiang, L. et al. A resource-efficient tool for mixed model association analysis of large-scale data. Nat. Genet. 51 , 1749–1755 (2019).

Day, F. R., Loh, P.-R., Scott, R. A., Ong, K. K. & Perry, J. R. B. A robust example of collider bias in a genetic association study. Am. J. Hum. Genet. 98 , 392–393 (2016).

Hartwig, F. P., Tilling, K., Davey Smith, G., Lawlor, D. A. & Borges, M. C. Bias in two-sample Mendelian randomization when using heritable covariable-adjusted summary associations. Int. J. Epidemiol. 50 , 1639–1650 (2021).

Zhu, Z. et al. Causal associations between risk factors and common diseases inferred from GWAS summary data. Nat. Commun. 9 , 224 (2018).

Burgess, S. & Thompson, S. G. Multivariable Mendelian randomization: the use of pleiotropic genetic variants to estimate causal effects. Am. J. Epidemiol. 181 , 251–260 (2015).

Grotzinger, A. D. et al. Multivariate genomic architecture of cortical thickness and surface area at multiple levels of analysis. Nat. Commun. 14 , 946 (2023).

Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81 , 559–575 (2007).

Loh, P.-R., Kichaev, G., Gazal, S., Schoech, A. P. & Price, A. L. Mixed-model association for biobank-scale datasets. Nat. Genet. 50 , 906–908 (2018).

Zheng, J. et al. PhenoSpD: an integrated toolkit for phenotypic correlation estimation and multiple testing correction using GWAS summary statistics. Gigascience 7 , giy090 (2018).

Yang, J. et al. Common SNPs explain a large proportion of the heritability for human height. Nat. Genet. 42 , 565–569 (2010).

Dahlke, J. A. & Wiernik, B. M. psychmeta: an R package for psychometric meta-analysis. Appl. Psychol. Meas. 43 , 415–416 (2019).

Foley, C. N. et al. A fast and efficient colocalization algorithm for identifying shared genetic risk factors across multiple traits. Nat. Commun. 12 , 764 (2021).

Berisa, T. & Pickrell, J. K. Approximately independent linkage disequilibrium blocks in human populations. Bioinformatics 32 , 283–285 (2016).

Bowden, J., Smith, G. D., Haycock, P. C. & Burgess, S. Consistent estimation in Mendelian randomization with some invalid instruments using a weighted median estimator. Genet. Epidemiol. 40 , 304–314 (2016).

Bowden, J., Davey Smith, G. & Burgess, S. Mendelian randomization with invalid instruments: effect estimation and bias detection through Egger regression. Int. J. Epidemiol. 44 , 512–525 (2015).

Verbanck, M., Chen, C.-Y., Neale, B. & Do, R. Publisher correction: detection of widespread horizontal pleiotropy in causal relationships inferred from Mendelian randomization between complex traits and diseases. Nat. Genet. 50 , 1196 (2018).

Morrison, J., Knoblauch, N., Marcus, J. H., Stephens, M. & He, X. Mendelian randomization accounting for correlated and uncorrelated pleiotropic effects using genome-wide summary statistics. Nat. Genet. 52 , 740–747 (2020).

Hemani, G., Tilling, K. & Smith, G. D. Orienting the causal relationship between imprecisely measured traits using GWAS summary data. PLoS Genet. 13 , e1007081 (2017).

Hemani, G. et al. The MR-base platform supports systematic causal inference across the human phenome. eLife 7 , e34408 (2018).

Burgess, S. Sample size and power calculations in Mendelian randomization with a single instrumental variable and a binary outcome. Int. J. Epidemiol. 43 , 922–929 (2014).

Bryois, J. et al. Genetic identification of cell types underlying brain complex traits yields novel insights into the etiology of Parkinson’s disease. Nat. Genet 52 , 482–493 (2020).

Won, H. et al. Chromosome conformation elucidates regulatory relationships in developing human brain. Nature 538 , 523–527 (2016).

Finucane, H. K. et al. Partitioning heritability by functional annotation using genome-wide association summary statistics. Nat. Genet. 47 , 1228–1235 (2015).

Finucane, H. K. et al. Heritability enrichment of specifically expressed genes identifies disease-relevant tissues and cell types. Nat. Genet. 50 , 621–629 (2018).

Ge, T., Chen, C.-Y., Ni, Y., Feng, Y.-C. A. & Smoller, J. W. Polygenic prediction via Bayesian regression and continuous shrinkage priors. Nat. Commun. 10 , 1776 (2019).

Warrier, V. et al. Gene–environment correlations and causal effects of childhood maltreatment on physical and mental health: a genetically informed approach. Lancet Psychiatry 8 , 373–386 (2021).

Warrier, V. et al. Genetic correlates of phenotypic heterogeneity in autism. Nat. Genet. 54 , 1293–1304 (2022).

Wright, C. F. et al. Optimising diagnostic yield in highly penetrant genomic disease. N. Engl. J. Med. 388 , 1559–1571 (2023).

Wang, G., Sarkar, A., Carbonetto, P. & Stephens, M. A simple new approach to variable selection in regression, with application to genetic fine mapping. J. R. Stat. Soc. Series B Stat. Methodol. 82 , 1273–1300 (2020).

Hu, B. et al. Neuronal and glial 3D chromatin architecture informs the cellular etiology of brain disorders. Nat. Commun. 12 , 3968 (2021).

McLaren, W. et al. The ensembl variant effect predictor. Genome Biol. 17 , 122 (2016).

Zhu, Z. et al. Integration of summary data from GWAS and eQTL studies predicts complex trait gene targets. Nat. Genet. 48 , 481–487 (2016).

O’Brien, H. E. et al. Expression quantitative trait loci in the developing human brain and their enrichment in neuropsychiatric disorders. Genome Biol. 19 , 194 (2018).

Yang, J., Qi, T., Wu, Y., Zhang, F. & Zeng, J. Genetic control of RNA splicing and its distinctive role in complex trait variation. Nat. Genet. 54 , 1355–1363 (2022).

Qi, T. et al. Identifying gene targets for brain-related traits using transcriptomic and methylomic data from blood. Nat. Commun. 9 , 2282 (2018).

Bethlehem, R. A. I. & Romero-Garcia, R. ucam-department-of-psychiatry/UKB: V1. Zenodo. https://doi.org/10.5281/zenodo.8051797 (2023).

Bethlehem, R. A. I. & Romero-Garcia, R. ucam-department-of-psychiatry/ABCD: V1. Zenodo. https://doi.org/10.5281/zenodo.8051799 (2023).

Warrier, V. vwarrier/ABCD_geneticQC: v1. Zenodo. https://doi.org/10.5281/zenodo.8050609 (2023).

Warrier, V. vwarrier/Imaging_genetics_analyses: v1. Zenodo. https://doi.org/10.5281/zenodo.8050589 (2023).

Download references

Acknowledgements

V.W. is supported by St. Catharine’s College Cambridge, funding from the Wellcome Trust (214322\Z\18\Z) and UKRI (10063472). E.-M.S. is supported by a Ph.D. studentship awarded by the Friends of Peterhouse. E.A.W.S. is supported by the National Institute for Health Research (NIHR) Cambridge Biomedical Research Center (BRC-1215-20014). The views expressed are those of the authors and not necessarily those of the NIHR or the Department of Health and Social Care. R.A.I.B. is supported by the Autism Research Trust. S.B.C. received funding from the Wellcome Trust (214322\Z\18\Z). S.B.C. also received funding from the Autism Center of Excellence, SFARI, the Templeton World Charitable Fund, the MRC and the NIHR Cambridge Biomedical Research Center. The research was supported by the NIHR Applied Research Collaboration East of England. J.S. was supported by NIMH (T32MH019112-29 and K08MH120564). E.T.B. was supported by an NIHR Senior Investigator award and the Wellcome Trust collaborative award for the Neuroscience in Psychiatry Network. A.F.A.-B. was supported by NIMH (K08MH120564). R.R.G. was supported by the EMERGIA Junta de Andalucía program (EMERGIA20_00139). S.L.V. was supported by Max Planck Gesellschaft, (Otto Hahn Award) and the Helmholtz Association’s Initiative and Networking Fund under the Helmholtz International Lab grant agreement InterLabs-0015, and the Canada First Research Excellence Fund (CFREF Competition 2, 2015–2016) awarded to the Healthy Brains, Healthy Lives initiative at McGill University, through the Helmholtz International BigBrain Analytics and Learning Laboratory (HIBALL). G.K.M. was supported by MRC (MR/W020025/1). For the purpose of open access, the authors have applied a CC BY license to any author-accepted manuscript version arising from this submission. We thank L.K. Abraham and J. Asimit for their helpful discussions. Additional acknowledgments are provided in the Supplementary Information.

Author information

These authors contributed equally: Hyejung Won, Hilary C. Martin, Edward T. Bullmore.

Authors and Affiliations

Department of Psychiatry, University of Cambridge, Cambridge, UK

Varun Warrier, Eva-Maria Stauffer, Lisa Ronan, Rafael Romero-Garcia, Simon Baron-Cohen, Graham K. Murray, Edward T. Bullmore & Richard A. I. Bethlehem

Department of Psychology, University of Cambridge, Cambridge, UK

Varun Warrier, Simon Baron-Cohen & Richard A. I. Bethlehem

Wellcome Trust Sanger Institute, Hinxton, UK

Qin Qin Huang, Emilie M. Wigdor & Hilary C. Martin

Medical Research Council Biostatistics Unit, University of Cambridge, Cambridge, UK

Eric A. W. Slob

Department of Applied Economics, Erasmus School of Economics, Erasmus University Rotterdam, Rotterdam, the Netherlands

Erasmus University Rotterdam Institute for Behavior and Biology, Erasmus University Rotterdam, Rotterdam, the Netherlands

Department of Child and Adolescent Psychiatry and Behavioral Science, The Children’s Hospital of Philadelphia, Philadelphia, PA, USA

Jakob Seidlitz, Michael J. Gandal & Aaron Alexander-Bloch

Lifespan Brain Institute, The Children’s Hospital of Philadelphia and Penn Medicine, Philadelphia, PA, USA

Department of Psychiatry, University of Pennsylvania, Philadelphia, PA, USA

Jakob Seidlitz & Aaron Alexander-Bloch

Institute of Neuroscience and Medicine, Brain & Behaviour (INM-7), Research Centre Jülich, FZ Jülich, Jülich, Germany

Sofie L. Valk

Institute of Systems Neuroscience, Medical Faculty, Heinrich Heine University Düsseldorf, Düsseldorf, Germany

Otto Hahn Group Cognitive Neurogenetics, Max Planck Institute for Human Cognitive and Brain Sciences, Leipzig, Germany

Psychiatric and Neurodevelopmental Genetics Unit, Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA

Travis T. Mallard

Department of Psychiatry, Harvard Medical School, Boston, MA, USA

Department of Psychology and Neuroscience, University of Colorado at Boulder, Boulder, CO, USA

Andrew D. Grotzinger

Institute for Behavioral Genetics, University of Colorado at Boulder, Boulder, CO, USA

Instituto de Biomedicina de Sevilla (IBiS) HUVR/CSIC/Universidad de Sevilla/CIBERSAM, ISCIII, Dpto. de Fisiología Médica y Biofísica, Seville, Spain

Rafael Romero-Garcia

Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, CA, USA

Daniel H. Geschwind

Program in Neurogenetics, Department of Neurology, University of California, Los Angeles, CA, USA

Center for Autism Research and Treatment, Jane and TerrySemel Institute for Neuroscience and Human Behavior, University of California, Los Angeles, CA, USA

Institute of Precision Health, University of California, Los Angeles, CA, USA

MRC Laboratory of Molecular Biology, Cambridge Biomedical Campus, Francis Crick Avenue, Cambridge, UK

Madeline A. Lancaster

Cambridgeshire and Peterborough NHS Trust, Cambridge, UK

Graham K. Murray & Edward T. Bullmore

Institute for Molecular Bioscience, University of Queensland, Brisbane, Queensland, Australia

Graham K. Murray

Department of Genetics and the Neuroscience Center, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA

Hyejung Won

You can also search for this author in PubMed   Google Scholar

Contributions

V.W. and R.A.I.B. designed the study, wrote the first draft of the manuscript and carried out revisions. V.W., R.A.I.B., H.C.M., E.T.B. and H.W. supervised the work. V.W., R.A.I.B., E.S., Q.Q.H., E.M.W., E.A.W.S., J.S. and R.R.G. analyzed the data. T.T.M. and A.D.G. advised on SEM. L.R. and S.V. advised on cortical structure and organization. S.B.C., D.H.G., M.L., G.K.M., M.J.G. and A.B. provided input to various analytical methods and helped interpret the data. All authors edited the manuscript and contributed to critical revisions of the manuscript.

Corresponding authors

Correspondence to Varun Warrier or Richard A. I. Bethlehem .

Ethics declarations

Competing interests.

A.A.-B. receives consulting income from Octave Biosciences. E.T.B. serves as a consultant for Sosei Heptares, Boehringer Ingelheim, GlaxoSmithKline, Monument Therapeutics and SR One. M.J.G. receives grant support from Mitsubishi Tanabe Pharma, unrelated to the current manuscript. The remaining authors declare no competing interests.

Peer review

Peer review information.

Nature Genetics thanks Matthew F. Glasser and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended data fig. 1 consistency in genetic effects between abcd and ukb..

( a ) Correlation in effect size (regression beta from GWAS) between ABCD and UKB cohorts for all 237 genome-wide significant SNPs in the UKB: Pearson’s correlation coefficient, r = 0.54 with 95% confidence interval 0.45–0.63. ( b ) Genetic correlation (central point) and 95% confidence intervals (error bars) for 12 global phenotypes in the UKB and ABCD cohorts. Given the relatively small size of ABCD, the intercept has been constrained as there is no participant overlap between the UKB (Nmax = 31,797) and ABCD (Nmax = 4,866) and there is no inflation in test statistics due to uncontrolled population stratification. Consequently, estimates of genetic correlation can be above 1.

Extended Data Fig. 2 Mendelian randomization analysis for causal relationships between genetic effects on global brain phenotypes.

Scatter plots for the bidirectional MR effects between surface area and folding index, intrinsic curvature index, and local gyrification index. Slopes of the line (MR regression coefficient) indicate the estimated MR effect by method. Linear a , b , and c are scatter plots where surface area is the exposure, and d , e , and f are scatter plots where surface area is the outcome. All scatter plots are for MR analyses conducted by splitting the UKB into two samples of similar sample sizes. All estimates were statistically significant in scatter plots A,B, and C. Inverse-variance weighted MR failed to reach statistical significance in scatter plots d, e, and f. Number of SNPs included in the MR are provided in Supplementary Table 9 . Error bars represent standard errors of the effect size (point estimates).

Extended Data Fig. 3 Forest plots and leave-one-out plots.

Forest plots ( a – c ) and leave-one-out ( d–f ) between surface area and folding index (FI, A and D), Intrinsic curvature index (ICI, B and E), and local gyrification index (LGI, C and F). Number of SNPs included in the MR are provided in Supplementary Table 9 . Error bars indicate 95% confidence intervals of the effect (point estimates).

Extended Data Fig. 4 Regional heritability.

a . The distribution of the SNP heritability for the 180 bilaterally-averaged regional phenotypes of the 13 neuroimaging modalities. Maximum GWAS sample size = 36,663. Box plots indicate the median value (central line), the interquartile range, and the whiskers indicate the minimum and maximum. b . The cortical spatial topology of SNP heritability for the 13 neuroimaging modalities.

Extended Data Fig. 5 Asymmetry indices and SNP heritability of asymmetry indices for the 13 phenotypes.

a . Asymmetry indices for the 13 phenotypes. Positive values indicate leftward asymmetry. b . SNP heritabilities for asymmetry indices by region and phenotype. SNP heritability was calculated using GCTA–GREML for approximately 9,650 unrelated individuals from the UK Biobank. Significant regions were identified after FDR correction within each of the 13 phenotypes.

Extended Data Fig. 6 Topography of the first phenotypic principal components.

Color scales depict the relative eigenvector ranging from −20 to +29, in all plots the midpoint is defined as 0. It should be noted that the sign is somewhat ambiguous and that the magnitude is relative to its own scaling (in this case within each phenotype for which the PCA is performed). Thus, in this context, the color scale indicates to what extent regions show more homogenous similarity (that is, regions with more similar color have more similar covariance).

Supplementary information

Supplementary information.

Supplementary Figs. 1–8, Notes 1–4 and associated figures, and additional acknowledgments.

Reporting Summary

Peer review file, supplementary tables.

Supplementary Tables 1–34.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article.

Warrier, V., Stauffer, EM., Huang, Q.Q. et al. Genetic insights into human cortical organization and development through genome-wide analyses of 2,347 neuroimaging phenotypes. Nat Genet 55 , 1483–1493 (2023). https://doi.org/10.1038/s41588-023-01475-y

Download citation

Received : 12 October 2022

Accepted : 13 July 2023

Published : 17 August 2023

Issue Date : September 2023

DOI : https://doi.org/10.1038/s41588-023-01475-y

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

Genetic architecture of the structural connectome.

  • Michael Wainberg
  • Natalie J. Forde
  • Shreejoy J. Tripathy

Nature Communications (2024)

The genetic relationships between brain structure and schizophrenia

  • Eva-Maria Stauffer
  • Richard A. I. Bethlehem
  • Edward T. Bullmore

Nature Communications (2023)

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

human genome project research paper pdf

IMAGES

  1. (PDF) The Human Genome Project, and recent advances in personalized

    human genome project research paper pdf

  2. (PDF) Human Genome Project

    human genome project research paper pdf

  3. (PDF) Informatics and the Human Genome Project

    human genome project research paper pdf

  4. (PDF) Human Genome Project Original Materials—Three Preservation Approaches

    human genome project research paper pdf

  5. (PDF) The Human Genome Project

    human genome project research paper pdf

  6. (PDF) Initial Sequencing and Analysis of the Human Genome

    human genome project research paper pdf

VIDEO

  1. Genomics for everyone: UCSC researchers release first human pangenome

  2. WHAT IS HUMAN GENOME PROJECT .. LECTURE IN URDU /HINDI ..#biology #VIRAL #foryou #GENEMOE #PROJECT

  3. Human Genome project_/Part-20/Chapter -6_Ncert /class -12th_biology #trending #youtubevideos #viral

  4. Human Genome Project PART A

  5. Human Genome Project Lec 2/ Genetics/ Bs/ MSC/ MS

  6. Decoding Life: The Human Genome Project Unveiled

COMMENTS

  1. PDF The Human Genome Project changed everything

    A map of human genome variation from population- scale sequencing. Ne 467, 1061-1073 (2010). Acknowledgements R.A.G. is partially supported by grants from the National Human Genome Research ...

  2. PDF Twenty years of the human genome

    sortium that made the Human Genome Project possible. Nature committed to open-data principles for genomics research back in 1996. By publishing the Human Genome Project's first paper, we worked with a publicly funded ini - tiative that was committed to data sharing. But the journal acknowledged there would be challenges to maintaining the ...

  3. The legacy of the Human Genome Project

    Genentech, 1 DNA Way, South San Francisco, CA, USA. Email aviv.regev.scgmail.com. 1442 24 SEPTEMBER 2021 VOL 373 ISSUE 6562. The human reference genome also en-abled other crucial efforts to decipher disease biology, such as cancer genome analyses to map somatic genetic drivers in tumors, mapping of rare disease genes, and phenome-wide ...

  4. PDF The Human Genome Project

    The Human Genome Project: An Annotated & Scholarly Guide to the Project in the United States. The idea for this annotated scholarly guide to the Human Genome Project (HGP) originated at an international meeting on the history of the HGP that was held in May of 2012 at the Cold Spring Harbor Laboratory's Banbury Center.

  5. The Human Genome Project changed everything

    The joint announcement of the release of the human 'draft' genome sequences occurred 20 years ago, at a ceremony in the White House. The first analyses by two groups, the publicly funded ...

  6. The Human Genome Project: big science transforms biology and medicine

    The Human Genome Project has transformed biology through its integrated big science approach to deciphering a reference human genome sequence along with the complete sequences of key model organisms. The project exemplifies the power, necessity and success of large, integrated, cross-disciplinary efforts - so-called 'big science' - directed towards complex major objectives. In this article ...

  7. (PDF) Human Genome Project

    PDF | On Jan 1, 2017, Guilherme S. Lopes and others published Human Genome Project | Find, read and cite all the research you need on ResearchGate

  8. Human Molecular Genetics and Genomics

    In 1987, the New York Times Magazine characterized the Human Genome Project as the "biggest, costliest, most provocative biomedical research project in history." 2 But in the years between the ...

  9. The Human Genome Project

    The Human Genome Project is an ambitious research effort aimed at deciphering the chemical makeup of the entire human genetic code (i.e., the genome). The primary work of the project is to develop three research tools that will allow scientists to identify genes involved in both rare and common diseases. Another project priority is to examine ...

  10. The complete sequence of a human genome

    The current human reference genome was released by the Genome Reference Consortium (GRC) in 2013 and most recently patched in 2019 (GRCh38.p13) ().This reference traces its origin to the publicly funded Human Genome Project and has been continually improved over the past two decades.Unlike the competing Celera effort and most modern sequencing projects based on "shotgun" sequence assembly ...

  11. (PDF) Human Genome Project: History and Assessment

    An example was the Human Genome Project, which had the objective sequence of approximately 3 billion base pairs from the human genome with the collaboration of a worldwide team, having an ...

  12. The International Human Genome Project

    The human genome project was conceived and executed as an international project, due to both pragmatic and principled reasons. This internationality has served the project well, with the resulting human genome being freely available for all researchers in all countries. Over time the reference human genome will likely have to evolve to a graph ...

  13. PDF THE NATIONAL ACADEMIES PRESS

    - Access to free PDF downloads of thousands of scientific reports ... Committee on Mapping and Sequencing the Human Genome, National Research Council National Research Council 1988. Mapping and Sequencing the Human Genome. ... the data generated in such a human genome project?

  14. (PDF) A review of human genome project (HGP) from ...

    Accepted 13 October 2017. The human genome is the collection of DNA in the nucleus of human cells. It. contains twenty-three pairs of chromosomes and serves as identification. marks or blueprints ...

  15. PDF The Human Genome Project

    The Human Genome Project. The Human Genome Project1 formally began in 1990 as a joint effort between the Department of Energy (DOE) and the National Institutes of Health (NIH). The goals of the project were: identify all the approximately 20,000-25,000 genes in human DNA, determine the sequences of the 3 billion chemical base pairs that make up ...

  16. The Human Genome Project: big science transforms biology and medicine

    Impact of the human genome project on biology and technology. First, the human genome sequence initiated the comprehensive discovery and cataloguing of a 'parts list' of most human genes [16,17], and by inference most human proteins, along with other important elements such as non-coding regulatory RNAs.Understanding a complex biological system requires knowing the parts, how they are ...

  17. Human Genome Project Fact Sheet

    The Human Genome Project was a landmark global scientific effort whose signature goal was to generate the first sequence of the human genome. In 2003, the Human Genome Project produced a genome sequence that accounted for over 90% of the human genome. It was as close to complete as the technologies for sequencing DNA allowed at the time.

  18. PDF On the sequencing of the human genome

    draft sequence of a large mammalian genome, but merely that the Celera paper does not provide such evidence. Two scientific papers (1, 2) recently appeared reporting ''draft'' sequences of the human genome. One was the product of the international Human Genome Project (HGP), and the other was the product of the biotechnology firm Celera ...

  19. The Genome Project-Write

    The Human Genome Project ("HGP-read"), nominally completed in 2004, aimed to sequence the human genome and to improve the technology, cost, and quality of DNA sequencing ( 1, 2 ). It was biology's first genome-scale project and at the time was considered controversial by some. Now, it is recognized as one of the great feats of exploration ...

  20. The Human Genome Project

    The Human Genome Project. The Human Genome Project (HGP) is one of the greatest scientific feats in history. The project was a voyage of biological discovery led by an international group of researchers looking to comprehensively study all of the DNA (known as a genome) of a select set of organisms. Launched in October 1990 and completed in ...

  21. Economic Benefits

    The economic and functional impacts generated by the sequencing of the human genome are already large and widespread. Between 1988 and 2010 the human genome sequencing projects, associated research and industry activity—directly and indirectly—generated an economic (output) impact of $796 billion, personal income exceeding $244 billion, and ...

  22. Systems Genome: Coordinated Gene Activity Networks, Recurring ...

    Abstract: As human progenitor cells differentiate into neurons, the activities of many genes change; these changes are maintained within a narrow range, referred to as genome homeostasis. This process, which alters the synchronization of the entire expressed genome, is distorted in neurodevelopmental diseases such as schizophrenia. The coordinated gene activity networks formed by altering sets ...

  23. Developmental isoform diversity in the human neocortex informs ...

    Reads were mapped to the human genome reference using minimap2, and isoforms were called with TALON. ... This work was supported by the Simons Foundation for Autism Research (SFARI Bridge to Independence Award and SFARI 957585 to M.J.G.), the National Institute of Mental Health (R01MH125252, R01MH121521, and R01MH123922 to M.J.G.; R01MH124018 ...

  24. A novel approach to exploring the dark genome and its application to

    The availability of whole genome sequence (WGS) data from a broad range of species provides unprecedented scope for comparative genomic investigations [1,2,3].However, these investigations rely to a large extent on annotation—the process of identifying and labeling genome features—which usually lags far behind the generation of sequence data.. Consequently, most whole genome sequences are ...

  25. Genetic insights into human cortical organization and ...

    Our understanding of the genetics of the human cerebral cortex is limited both in terms of the diversity and the anatomical granularity of brain structural phenotypes. Here we conducted a genome ...