Reference genome
A reference genome (also known as a reference assembly) is a digital nucleic acid sequence database, assembled by scientists as a representative example of the set of genes in one idealized individual organism of a species. As they are assembled from the sequencing of DNA from a number of individual donors, reference genomes do not accurately represent the set of genes of any single individual organism. Instead a reference provides a haploid mosaic of different DNA sequences from each donor. There are reference genomes for multiple species of viruses, bacteria, fungus, plants, and animals.
For example, the human reference genome, GRCh38, from the Genome Reference Consortium is derived from thirteen anonymous volunteers.[1]
As the cost of DNA sequencing falls, and new full genome sequencing technologies emerge, more genome sequences continue to be generated. Reference genomes are typically used as a guide on which new genomes are built, enabling them to be assembled much more quickly and cheaply than the initial Human Genome Project. Most individuals with their entire genome sequenced, such as James D. Watson, had their genome assembled in this manner.[2][3] For much of a genome, the reference provides a good approximation of the DNA of any single individual. But in regions with high allelic diversity, such as the major histocompatibility complex in humans and the major urinary proteins of mice, the reference genome may differ significantly from other individuals.[4][5][6] Comparison between the reference (build 36) and Watson's genome revealed 3.3 million single nucleotide polymorphism differences, while about 1.4 percent of his DNA could not be matched to the reference genome at all.[7][2] For regions where there is known to be large scale variation, sets of alternate loci are assembled alongside the reference locus.
Reference genomes can be accessed online at several locations, using dedicated browsers such as Ensembl or UCSC Genome Browser.[8]
Properties of reference genomes
Measures of length
The length of a genome can be measured in multiple different ways.
A simple way to measure genome length is to count the number of base pairs in the assembly.[9]
The golden path is an alternative measure of length that omits redundant regions such as haplotypes and pseudoautosomal regions.[10][11] It is usually constructed by layering sequencing information over a physical map to combine scaffold information. It is a 'best estimate' of what the genome will look like and typically includes gaps, making it longer than the typical base pair assembly.[12]
Mammalian genomes
The human and mouse reference genomes are maintained and improved by the Genome Reference Consortium (GRC), a group of fewer than 20 scientists from a number of genome research institutes, including the European Bioinformatics Institute, the National Center for Biotechnology Information, the Sanger Institute and McDonnell Genome Institute at Washington University in St. Louis. GRC continues to improve reference genomes by building new alignments that contain fewer gaps, and fixing misrepresentations in the sequence.
Human reference genome
The human reference genome GRCh38 was released from the Genome Reference Consortium on 17 December 2013.[13] This build contained around 250 gaps, whereas the first version had roughly 150,000 gaps.[1] The GRCh38 assembly saw the closure or reduction of more than 100 gaps. Nanopore sequencing has seen the closure of 12 gaps in the GRCh38 reference assembly through the use of ultra-long reads.[14]
The human reference genome is derived from thirteen anonymous volunteers from Buffalo, New York. Donors were recruited by advertisement in The Buffalo News, on Sunday, March 23, 1997. The first ten male and ten female volunteers were invited to make an appointment with the project's genetic counselors and donate blood from which DNA was extracted. As a result of how the DNA samples were processed, about 80 percent of the reference genome came from eight people and one male, designated RP11, accounts for 66 percent of the total. The ABO blood group system differs among humans, but the human reference genome contains only an O allele, although the others are annotated).[15][1][16][17][7]
There are limitations to the Human Reference Genome due fact that it is "single" distinct sequence. It is specifically named as a "reference" because of this. The main purpose to which it is put, is as an index, or a locator, for genetic features. The 1000 Genomes Project is creating a database to provide information about the variations in genomes across the human population.[18]
Recent genome assemblies are as follows:[19]
Release name | Date of release | Equivalent UCSC version |
---|---|---|
GRCh38 | Dec 2013 | hg38 |
GRCh37 | Feb 2009 | hg19 |
NCBI Build 36.1 | Mar 2006 | hg18 |
NCBI Build 35 | May 2004 | hg17 |
NCBI Build 34 | Jul 2003 | hg16 |
Mouse reference genome
Recent mouse genome assemblies are as follows:[19]
Release name | Date of release | Equivalent UCSC version |
---|---|---|
GRCm38 | Dec 2011 | mm10 |
NCBI Build 37 | Jul 2007 | mm9 |
NCBI Build 36 | Feb 2006 | mm8 |
NCBI Build 35 | Aug 2005 | mm7 |
NCBI Build 34 | Mar 2005 | mm6 |
References
- Editorial (October 2010). "E pluribus unum". Nature Methods. 7 (5): 910–918. doi:10.1038/nmeth0510-331. PMC 415373. PMID 331.
- Wheeler DA, Srinivasan M, Egholm M, Shen Y, Chen L, McGuire A, He W, Chen YJ, Makhijani V, Roth GT, Gomes X, Tartaro K, Niazi F, Turcotte CL, Irzyk GP, Lupski JR, Chinault C, Song XZ, Liu Y, Yuan Y, Nazareth L, Qin X, Muzny DM, Margulies M, Weinstock GM, Gibbs RA, Rothberg JM (2008). "The complete genome of an individual by massively parallel DNA sequencing". Nature. 452 (7189): 872–6. Bibcode:2008Natur.452..872W. doi:10.1038/nature06884. PMID 18421352.
- The exception to this is J. Craig Venter whose DNA was sequenced and assembled using shotgun sequencing methods.
- MHC Sequencing Consortium (1999). "Complete sequence and gene map of a human major histocompatibility complex". Nature. 401 (6756): 921–923. Bibcode:1999Natur.401..921T. doi:10.1038/44853. PMID 10553908. S2CID 186243515.
- Logan DW, Marton TF, Stowers L (2008). Vosshall LB (ed.). "Species specificity in major urinary proteins by parallel evolution". PLOS ONE. 3 (9): e3280. Bibcode:2008PLoSO...3.3280L. doi:10.1371/journal.pone.0003280. PMC 2533699. PMID 18815613.
- Hurst J, Beynon RJ, Roberts SC, Wyatt TD (October 2007). Urinary Lipocalins in Rodenta:is there a Generic Model?. Chemical Signals in Vertebrates 11. Springer New York. ISBN 978-0-387-73944-1.
- Wade, Nicholas (May 31, 2007). "Genome of DNA Pioneer Is Deciphered". New York Times. Retrieved February 21, 2009.
- Flicek P, Aken BL, Beal K, et al. (January 2008). "Ensembl 2008". Nucleic Acids Res. 36 (Database issue): D707–14. doi:10.1093/nar/gkm988. PMC 2238821. PMID 18000006.
- "Help - Glossary - Homo sapiens - Ensembl genome browser 87". www.ensembl.org.
- "Golden path length | VectorBase". www.vectorbase.org. Retrieved 2016-12-12.
- "Help - Glossary - Homo sapiens - Ensembl genome browser 87". www.ensembl.org.
- "Whole assembly vs Golden path length in Ensembl? - SEQanswers". seqanswers.com. Retrieved 2016-12-12.
- NCBI. "GRCh38 - hg38 - Genome - Assembly - NCBI". ncbi.nlm.nih.gov. Retrieved 2019-03-15.
- Jain, Miten; Koren, Sergey; Miga, Karen H; Quick, Josh; Rand, Arthur C; Sasani, Thomas A; Tyson, John R; Beggs, Andrew D; Dilthey, Alexander T (2018-01-29). "Nanopore sequencing and assembly of a human genome with ultra-long reads". Nature Biotechnology. 36 (4): 338–345. doi:10.1038/nbt.4060. ISSN 1546-1696. PMC 5889714. PMID 29431738.
- Scherer, Stewart (2008). A short guide to the human genome. CSHL Press. p. 135. ISBN 978-0-87969-791-4.
- Ballouz, Sara; Dobin, Alexander; Gillis, Jesse A. (9 August 2019). "Is it time to change the reference genome?". Genome Biology. 20 (1). doi:10.1186/s13059-019-1774-4. PMID 31399121.
- Rosenfeld, Jeffrey A.; Mason, Christopher E.; Smith, Todd M.; Seo, Jeong-Sun (11 July 2012). "Limitations of the Human Reference Genome for Personalized Genomics". PLOS ONE. 7 (7): e40294. Bibcode:2012PLoSO...740294R. doi:10.1371/journal.pone.0040294. PMC 3394790. PMID 22811759.
- https://www.internationalgenome.org/home
- "UCSC Genome Bioinformatics: FAQ". genome.ucsc.edu. Retrieved 2016-08-18.