Overview
- Approximately 8% of the human genome—roughly 98,000 ERV elements and fragments plus their associated solo LTRs—consists of endogenous retroviral sequences: the remnants of ancient retroviral infections that became permanently integrated into the germline DNA of our ancestors.
- Thousands of these ERV insertions are found at precisely the same chromosomal positions in humans, chimpanzees, gorillas, and other primates, a pattern explicable only by inheritance from shared common ancestors.
- Some ERVs have been co-opted for essential biological functions, most notably the syncytin genes derived from retroviral envelope proteins that are required for placental development in all placental mammals.
Every cell in the human body carries within its DNA the remnants of ancient viral infections—thousands of retroviral genomes that integrated into the chromosomes of our ancestors millions of years ago and have been passed down through every generation since. These endogenous retroviruses, or ERVs, constitute approximately 8% of the human genome, a proportion that exceeds the roughly 1.5% devoted to protein-coding genes.1, 2 The presence of these viral fossils in our DNA is remarkable in itself, but their true significance for understanding human origins lies in a striking pattern: many of the same ERV insertions appear at exactly the same chromosomal locations in humans and other primates. Because retroviral integration into a host genome is essentially random with respect to position, the probability of independent insertions occurring at the same site in different species is vanishingly small. Shared ERV insertions at orthologous loci therefore constitute powerful molecular evidence that humans and other primates inherited these sequences from common ancestors.3, 4
What endogenous retroviruses are
Retroviruses are a class of RNA viruses that replicate by reverse-transcribing their RNA genome into DNA and then integrating that DNA copy—called a provirus—into a host cell's chromosomes. Well-known exogenous (infectious) retroviruses include human immunodeficiency virus (HIV) and human T-lymphotropic virus (HTLV). When a retrovirus infects a somatic cell, the proviral insertion affects only that cell and its descendants. However, if a retrovirus infects a germline cell—a sperm, egg, or their precursors—the integrated provirus can be transmitted vertically to all of the host's offspring, becoming a permanent part of the species' genome.2, 5
A complete retroviral provirus contains at least three genes: gag (encoding structural proteins), pol (encoding viral enzymes including reverse transcriptase and integrase), and env (encoding the envelope glycoprotein that mediates cell entry). These coding regions are flanked on both ends by long terminal repeats (LTRs), regulatory sequences that drive transcription of the viral genome.2, 5 Over millions of years, most endogenous retroviruses accumulate mutations—point mutations, frameshifts, and deletions—that render them incapable of producing infectious viral particles. In many cases, recombination between the two flanking LTRs deletes the entire internal coding region, leaving behind a solitary LTR as the only trace of the original infection.2, 6
The human genome contains an estimated 98,000 ERV elements and fragments, grouped into approximately 30 to 40 distinct families based on sequence similarity. Together with their associated solo LTRs, these elements account for roughly 8% of the total genomic sequence—approximately 240 megabases of DNA.1, 2, 6 The oldest human ERV families integrated into primate genomes tens of millions of years ago, while the youngest—the HERV-K (HML-2) family—includes members that inserted as recently as a few hundred thousand years ago, some of which remain polymorphic in human populations today.7, 8
Shared insertions at identical loci
The central observation that makes ERVs such compelling evidence for common ancestry is the sharing of specific ERV insertions at orthologous (corresponding) chromosomal positions across multiple primate species. When a retrovirus integrates into a host genome, the site of insertion is determined by a complex interplay of factors, but with respect to the approximately 3.2 billion base pairs of a primate genome, integration at any particular nucleotide position is extraordinarily unlikely to occur twice independently.3, 4 The probability of two independent retroviral insertions landing at precisely the same nucleotide in the genomes of two different species has been estimated as less than one in several billion.3
When researchers compare the genomes of humans and chimpanzees, they find thousands of ERV insertions present at identical chromosomal positions in both species, with identical flanking host DNA sequences and identical target site duplications—the short duplications of host DNA created at the moment of retroviral integration.4, 9 The same pattern extends to gorillas, orangutans, and Old World monkeys, with the distribution of shared insertions tracking the known phylogenetic relationships among these species. ERVs shared by humans and chimpanzees but absent from gorillas must have integrated after the gorilla lineage diverged but before the human-chimpanzee split. ERVs shared by all great apes but absent from Old World monkeys must have integrated into the genome of the common ancestor of great apes, and so on.3, 4, 10
The alternative explanation—that each species was independently infected by the same retrovirus at the same genomic position—requires not one but thousands of astronomically improbable coincidences. By contrast, the hypothesis of common descent requires only a single integration event in a shared ancestor, followed by ordinary inheritance. This is why ERV insertions are sometimes called "molecular fossils": like physical fossils, they record events in evolutionary history, but with the added precision of exact genomic coordinates.3, 4
HERV-K: the youngest human ERV family
The HERV-K family, and in particular its HML-2 subgroup, is the most recently active endogenous retrovirus lineage in the human genome and has been studied more intensively than any other HERV family. HERV-K(HML-2) is the only human ERV group known to have produced proviruses with intact open reading frames for all retroviral genes, and some of its members can produce viral proteins and even virus-like particles in certain cell types.7, 5
In 1999, Barbulescu and colleagues cloned ten full-length HERV-K proviruses from the human genome and tested whether each was present at the orthologous position in the genomes of chimpanzees, gorillas, and orangutans. They found that eight of the ten proviruses were unique to humans: the other great apes possessed intact, empty preintegration sites at those loci, meaning the retroviral DNA had never been inserted there. The remaining two proviruses were shared with chimpanzees, indicating that those integration events predated the human-chimpanzee divergence.7 Conversely, at least one HERV-K provirus has been identified at an orthologous position in the genomes of chimpanzees, bonobos, and gorillas but not humans, with humans retaining the empty preintegration site—demonstrating that some integrations occurred in the lineage leading to African apes after it diverged from the human lineage.11
The 2005 chimpanzee genome sequence revealed further details. The Chimpanzee Sequencing and Analysis Consortium identified at least 73 human-specific HERV-K insertions (7 full-length proviruses and 66 solo LTRs) and at least 45 chimpanzee-specific insertions (1 full-length provirus and 44 solo LTRs).9 More recent analyses using data from the 1000 Genomes Project have identified 36 additional HERV-K(HML-2) insertions not present in the human reference genome, some of which are polymorphic across human populations—meaning they are present in some individuals but absent in others, a hallmark of very recent integration.8
Distribution of HERV-K(HML-2) proviral insertions across primate species7, 9, 11
HERV-W and other families
HERV-K is far from the only ERV family shared among primates. The HERV-W family, first characterized in the late 1990s, includes elements found at orthologous positions across Old World primates. Phylogenetic analysis of HERV-W sequences in humans, great apes, and Old World monkeys has shown that the family integrated into the primate genome at least 25 to 40 million years ago, before the divergence of catarrhine primates (apes and Old World monkeys).12 The most famous member of the HERV-W family is the locus on chromosome 7q21.2 that encodes syncytin-1, a functional gene derived from the retroviral env gene, which plays an essential role in placental development (discussed below).13
Other well-characterized families include HERV-H, which is among the most abundant ERV families in the human genome and has been implicated in regulating gene expression in embryonic stem cells, and HERV-E, elements of which have been found at orthologous positions in catarrhine primates.6, 14 The sheer number of ERV families, each with its own characteristic distribution across primate taxa, provides multiple independent lines of evidence for the branching pattern of primate evolution. Every ERV family tells the same story: species that share more recent common ancestors share more ERV insertions at orthologous loci.3, 4, 10
ERV phylogenies match species phylogenies
In a landmark 1999 study, Welkin Johnson and John Coffin demonstrated that phylogenetic trees constructed from ERV sequences accurately recapitulate the known evolutionary relationships among primates. They analyzed the LTR sequences of several endogenous retrovirus loci across multiple primate species, reasoning that because the two LTRs flanking a provirus are identical at the time of integration but evolve independently afterward, each orthologous ERV locus provides two independent estimates of the host species' phylogeny.3
Johnson and Coffin found that trees built from ERV LTR sequences were fully consistent with the established phylogeny of Old World primates, including the relationships among Old World monkeys, lesser apes, and great apes, as well as the branching order of gorillas, chimpanzees, and humans.3 This result is significant because ERV sequences evolve under different constraints than the protein-coding genes or morphological characters traditionally used to construct phylogenies. The concordance of ERV-based trees with trees derived from anatomy, nuclear DNA, and mitochondrial DNA provides strong independent confirmation of the primate evolutionary tree.3, 10
Subsequent studies have extended this approach. Belshaw and colleagues showed in 2004 that the HERV-K(HML-2) family has been reinfecting the human genome over a long evolutionary timescale, with the pattern of accumulation consistent with the known primate phylogeny.4 The analysis of ERV insertion polymorphisms—loci where some individuals or species have an ERV and others have an empty preintegration site—provides particularly clear phylogenetic signal, because the character states (present vs. absent) are unambiguous and the ancestral state (absent) is known.3, 10
ERV-based phylogenetic predictions vs. observed species relationships3, 9, 10
| ERV distribution pattern | Predicted relationship | Confirmed by |
|---|---|---|
| ERV at same locus in human + chimp, absent in gorilla | Human-chimp clade excludes gorilla | Morphology, nuclear DNA, mtDNA |
| ERV shared by all great apes, absent in gibbons | Great apes form a clade excluding gibbons | Morphology, nuclear DNA, mtDNA |
| ERV shared by catarrhines, absent in New World monkeys | Old World primates form a clade | Morphology, nuclear DNA, mtDNA |
| Human-specific ERV, empty site in all other apes | Integration after human-chimp divergence | Molecular clock estimates |
Co-option: from virus to vital function
Perhaps the most striking chapter in the story of endogenous retroviruses is the discovery that some ERV genes have been "domesticated" by their hosts—co-opted from their original viral function to serve essential roles in mammalian biology. The best-documented example involves the syncytin genes, which are derived from the env (envelope) genes of ancient retroviruses and now play indispensable roles in the formation of the placenta.13, 15
In 2000, Mi and colleagues identified syncytin-1 as the envelope protein of a defective HERV-W provirus located on human chromosome 7. They showed that syncytin-1 is highly expressed in the placental syncytiotrophoblast—the multinucleated cell layer that forms the interface between maternal and fetal blood—and that recombinant syncytin-1 protein can induce cell-cell fusion in vitro. Antibodies against syncytin-1 blocked fusion of a trophoblastic cell line, indicating that this co-opted viral protein mediates the cell fusion events that are essential for normal placental morphogenesis.13
A second syncytin gene, syncytin-2, was subsequently identified as the envelope protein of the HERV-FRD provirus. Syncytin-2 is also expressed specifically in the placenta and possesses fusogenic (cell-fusing) activity, and additionally functions as an immunosuppressive factor that may help prevent the maternal immune system from rejecting the genetically foreign fetus.15 The critical importance of syncytin genes was demonstrated in 2009 when Dupressoir and colleagues generated syncytin-A knockout mice (syncytin-A is the murine functional equivalent of human syncytin-1, derived from an independent ERV capture). Homozygous syncytin-A-null embryos failed to form a functional syncytiotrophoblast layer, suffered severe placental defects, and died in utero between 11.5 and 13.5 days of gestation.16
What makes the syncytin story particularly illuminating is that placental mammals on different continents have independently co-opted different ERV envelope genes for the same essential function. Primates use syncytin-1 (from HERV-W) and syncytin-2 (from HERV-FRD); mice use syncytin-A and syncytin-B (from murine ERVs unrelated to the primate ones); rabbits, dogs, and ruminants each use yet other captured retroviral envelope genes.16, 17 This pattern of convergent co-option—different mammals independently recruiting different retroviral genes for the same function—illustrates both the evolutionary creativity of natural selection and the deep entanglement of retroviral sequences with mammalian biology.17
The probability argument
The argument from shared ERV insertion sites is sometimes expressed in probabilistic terms, and the numbers involved are striking. A retroviral integrase inserts the proviral DNA into the host chromosome at a position determined by a combination of factors including local chromatin structure and sequence preferences, but across the genome as a whole, the number of potential insertion sites runs into the billions.2, 5 For any single ERV, the probability of an independent insertion event targeting the exact same nucleotide position in two different species' genomes is therefore on the order of one in 109 or less.3
When one considers that humans and chimpanzees share not one but thousands of ERV insertions at orthologous positions, the combined probability of this pattern arising by coincidence rather than common descent becomes effectively zero. Even accounting for known integration site biases—some retroviruses preferentially target active genes, transcription start sites, or particular chromatin states—the specificity of shared orthologous insertions extends to the exact nucleotide position and the identical target site duplication, a level of precision that integration biases cannot explain.3, 4, 9
Furthermore, the hierarchical nesting of shared ERV insertions precisely mirrors the predicted pattern of common descent. If primates share common ancestors at different time depths, one would expect to find some ERVs shared by all primates, some shared only by catarrhines, some shared only by great apes, some shared only by African apes, and some unique to individual species—and this is exactly what is observed.3, 10 The pattern is not merely consistent with common descent; it is predicted by common descent and would be extraordinarily difficult to explain under any alternative hypothesis.3
ERVs and genome composition
The initial sequencing of the human genome in 2001 revealed that transposable elements of all kinds—including ERVs, LINEs (long interspersed nuclear elements), SINEs (short interspersed nuclear elements), and DNA transposons—constitute approximately 45% of the human genome. Within this fraction, the LTR retrotransposons, which include ERVs and related elements, account for roughly 8% of the total genome sequence.1 This means that viral-derived DNA in the human genome outweighs protein-coding DNA by a factor of more than five.1, 2
The Griffiths review of 2001, published shortly after the release of the human genome draft, catalogued the known HERV families and noted that although HERVs retain structural similarity to infectious retroviruses, the vast majority have been so extensively mutated over evolutionary time that they are incapable of producing functional proteins, let alone infectious viral particles. The exceptions are notable: a small number of HERV-K(HML-2) proviruses retain intact or nearly intact open reading frames and can express viral proteins under certain conditions, including in some cancers and in placental tissue.6, 7
The comparative analysis of the chimpanzee genome in 2005 provided a genome-wide perspective on ERV evolution in the two most closely related hominid species. While the vast majority of ERV loci are shared at orthologous positions—consistent with their integration before the human-chimpanzee divergence roughly 6 to 7 million years ago—each species also carries lineage-specific insertions reflecting ongoing retroviral activity after the split. Notably, the chimpanzee genome contains two active ERV lineages (PtERV1 and PtERV2) that have no counterpart in the human genome, representing retroviral infections that occurred exclusively in the chimpanzee lineage.9
Composition of transposable elements in the human genome1
Significance for common descent
Endogenous retroviruses occupy a unique position among the multiple independent lines of evidence for primate common ancestry. Unlike morphological comparisons, which can be complicated by convergent evolution, or protein-coding gene comparisons, which are subject to natural selection, ERV insertion sites are selectively neutral markers that record discrete, irreversible historical events.3, 10 An ERV insertion is a one-time event: once a provirus integrates at a particular genomic position, it remains there indefinitely (barring rare deletion events), and it is inherited by all descendants of the organism in whose germline the insertion occurred. There is no known mechanism by which an ERV can be precisely excised from one species' genome and inserted at the identical position in an unrelated species' genome.3
The evidence from ERVs thus operates at multiple levels. At the simplest level, the sharing of individual ERV insertions at orthologous positions in two or more species is evidence that those species share a common ancestor. At a deeper level, the hierarchical, nested distribution of shared ERVs across the primate order—with progressively fewer shared insertions as one compares more distantly related species—is evidence for the branching pattern of primate evolution. And at the most detailed level, the phylogenetic trees constructed from ERV sequences match trees constructed from entirely independent data sources, providing a powerful test of the evolutionary hypothesis that has been passed repeatedly.3, 4, 10
The co-option of ERV genes for host functions such as placentation adds another dimension. It demonstrates that the evolutionary process is not merely a passive accumulation of genetic debris but an active repurposing of available genetic material—turning ancient parasites into essential components of mammalian biology.13, 16, 17 The study of endogenous retroviruses reveals a genome that is not a pristine blueprint but a palimpsest, overwritten again and again by the molecular events of evolutionary history, with each layer of viral inscription recording the ancestry of its host.2, 6
References
Phylogeny of a novel family of human endogenous retrovirus sequences, HERV-W, in humans and other primates
Syncytin is a captive retroviral envelope protein involved in human placental morphogenesis
Human endogenous retroviruses are ancient acquired elements still shaping innate immune responses
An envelope glycoprotein of the human endogenous retrovirus HERV-W is expressed in the human placenta and fuses cells expressing the type D mammalian retrovirus receptor
Syncytin-A knockout mice demonstrate the critical role in placentation of a fusogenic, endogenous retrovirus-derived, envelope gene
A pair of co-opted retroviral envelope syncytin genes is required for formation of the two-layered murine placental syncytiotrophoblast