Population structure (genetics)
Population structure (or population stratification) is the presence of a systematic difference in allele frequencies between subpopulations in a population as a result of non-random mating between individuals. It can be informative of genetic ancestry, and in the context of medical genetics it is an important confounding variable in genome wide association studies (GWAS).
Causes
The basic cause of population structure in sexually reproducing species is non-random mating between groups: if all individuals within a population mate randomly, then the frequencies of alleles between groups should be similar. Population structure commonly arises from physical separation by distance or barriers, like mountains and rivers, followed by genetic drift. Other causes include gene flow from migrations, population bottlenecks and expansions, founder effects, evolutionary pressure, random chance, and (in humans) cultural factors.[1][2]
Association studies
Population structure can be a problem for association studies, such as case-control studies, where the association could be found due to the underlying structure of the population and not a disease associated locus. By analogy, one might imagine a scenario in which certain small beads are made out of a certain type of unique foam, and that children tend to choke on these beads; one might wrongly conclude that the foam material causes choking when in fact it is the small size of the beads. Also the real disease causing locus might not be found in the study if the locus is less prevalent in the population where the case subjects are chosen. For this reason, it was common in the 1990s to use family-based data where the effect of population structure can easily be controlled for using methods such as the transmission disequilibrium test (TDT). But if the structure is known or a putative structure is found, there are a number of possible ways to implement this structure in the association studies and thus compensate for any population bias. Most contemporary genome-wide association studies take the view that the problem of population structure is manageable,[3] and that the logistic advantages of using unrelated cases and controls make these studies preferable to family-based association studies.
The two most widely used approaches to this problem include genomic control, which is a relatively nonparametric method for controlling the inflation of test statistics,[4] and structured association methods,[5] which use genetic information to estimate and control for population structure. Principal component analysis was shown to be effective by Alkes Price and colleagues.[6] It is also possible to correct for structure and confounding from cryptic relatedness by deriving a kinship matrix and including it in a linear mixed model.[7][8]
Genomic control
The assumption of population homogeneity in association studies, especially case-control studies, can easily be violated and can lead to both type I and type II errors. It is therefore important for the models used in the study to compensate for the population structure. The problem in case control studies is that if there is a genetic involvement in the disease, the case population is more likely to be related than the individuals in the control population. This means that the assumption of independence of observations is violated. Often this will lead to an overestimation of the significance of an association but it depends on the way the sample was chosen. If, coincidentally, there is a higher allele frequency in a subpopulation of the cases, you will find association with any trait that is more prevalent in the case population.[9] This kind of spurious association increases as the sample population grows so the problem should be of special concern in large scale association studies when loci only cause relatively small effects on the trait. A method that in some cases can compensate for the above described problems has been developed by Devlin and Roeder (1999).[4] It uses both a frequentist and a Bayesian approach (the latter being appropriate when dealing with a large number of candidate genes).
The frequentist way of correcting for population structure works by using markers that are not linked with the trait in question to correct for any inflation of the statistic caused by population structure. The method was first developed for binary traits but has since been generalized for quantitative ones.[10] For the binary one, which applies to finding genetic differences between the case and control populations, Devlin and Roeder (1999) use Armitage's trend test
and the test for allelic frequencies
Alleles | aa | Aa | AA | total |
---|---|---|---|---|
Case | r0 | r1 | r2 | R |
Control | s0 | s1 | s2 | S |
total | n0 | n1 | n2 | N |
If the population is in Hardy–Weinberg equilibrium the two statistics are approximately equal. Under the null hypothesis of no population stratification the trend test is asymptotic distribution with one degree of freedom. The idea is that the statistic is inflated by a factor so that where depends on the effect of stratification. The above method rests upon the assumptions that the inflation factor is constant, which means that the loci should have roughly equal mutation rates, should not be under different selection in the two populations, and the amount of Hardy–Weinberg disequilibrium measured in Wright’s coefficient of inbreeding F should not differ between the different loci. The last of these is of greatest concern. If the effect of the stratification is similar across the different loci can be estimated from the unlinked markers
where L is the number of unlinked markers. The denominator is derived from the gamma distribution as a robust estimator of . Other estimators have been suggested, for example, Reich and Goldstein[11] suggested using the mean of the statistics instead. This is not the only way to estimate but according to Bacanu et al.[12] it is an appropriate estimate even if some of the unlinked markers are actually in disequilibrium with a disease causing locus or are themselves associated with the disease. Under the null hypothesis and when correcting for stratification using L unlinked genes, is approximately distributed. With this correction the overall type I error rate should be approximately equal to even when the population is stratified. Devlin and Roeder (1999)[4] mostly considered the situation where gives a 95% confidence level and not smaller p-values. Marchini et al. (2004)[13] demonstrates by simulation that genomic control can lead to an anti-conservative p-value if this value is very small and the two populations (case and control) are extremely distinct. This was especially a problem if the number of unlinked markers were in the order 50−100. This can result in false positives (at that significance level).
Demographic inference
Population structure is an important aspect of evolutionary and population genetics. Events like migrations and interactions between groups leave a genetic imprint on populations. Admixed populations will have haplotype chunks from their ancestral groups, which gradually shrink over time because of recombination. By exploiting this fact and matching shared haplotype chunks from individuals within a genetic dataset, researchers may trace and date the origins of population admixture and reconstruct historic events such as the rise and fall of empires, slave trades, colonialism, and population expansions.[14]
Population structure can be inferred within data using a variety of methods such as dimensionality reduction and cluster analysis,[15][16] or assuming a statistical model for the data and estimating its parameters using maximum likelihood estimation.[17]
Many statistical methods rely on simple population models in order to infer historical demographic changes, such as the presence of population bottlenecks, admixture events or population divergence times. Often these methods rely on the assumption of panmictia, or homogeneity in an ancestral population. Misspecification of such models, for instance by not taking into account the existence of structure in an ancestral population, can give rise to heavily biased parameter estimates.[18] Simulation studies show that historical population structure can even have genetic effects that can easily be misinterpreted as historical changes in population size, or the existence of admixture events, even when no such events occurred.[19]
References
- Cardon LR, Palmer LJ (February 2003). "Population stratification and spurious allelic association". Lancet. 361 (9357): 598–604. doi:10.1016/S0140-6736(03)12520-2. PMID 12598158. S2CID 14255234.
- Gil McVean (2001). "Population Structure" (PDF). Archived from the original (PDF) on 2018-11-23. Retrieved 2020-11-14.
- Pritchard JK, Rosenberg NA (July 1999). "Use of unlinked genetic markers to detect population stratification in association studies". American Journal of Human Genetics. 65 (1): 220–8. doi:10.1086/302449. PMC 1378093. PMID 10364535.
- Devlin B, Roeder K (December 1999). "Genomic control for association studies". Biometrics. 55 (4): 997–1004. doi:10.1111/j.0006-341X.1999.00997.x. PMID 11315092.
- Pritchard JK, Stephens M, Rosenberg NA, Donnelly P (July 2000). "Association mapping in structured populations". American Journal of Human Genetics. 67 (1): 170–81. doi:10.1086/302959. PMC 1287075. PMID 10827107.
- Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D (August 2006). "Principal components analysis corrects for stratification in genome-wide association studies". Nature Genetics. 38 (8): 904–9. doi:10.1038/ng1847. PMID 16862161. S2CID 8127858.
- Yu J, Pressoir G, Briggs WH, Vroh Bi I, Yamasaki M, Doebley JF, et al. (February 2006). "A unified mixed-model method for association mapping that accounts for multiple levels of relatedness". Nature Genetics. 38 (2): 203–8. doi:10.1038/ng1702. PMID 16380716. S2CID 8507433.
- Loh PR, Tucker G, Bulik-Sullivan BK, Vilhjálmsson BJ, Finucane HK, Salem RM, et al. (March 2015). "Efficient Bayesian mixed-model analysis increases association power in large cohorts". Nature Genetics. 47 (3): 284–90. doi:10.1038/ng.3190. PMC 4342297. PMID 25642633.
- Lander ES, Schork NJ (September 1994). "Genetic dissection of complex traits". Science. 265 (5181): 2037–48. doi:10.1126/science.8091226. PMID 8091226.
- Bacanu SA, Devlin B, Roeder K (January 2002). "Association studies for quantitative traits in structured populations". Genetic Epidemiology. 22 (1): 78–93. doi:10.1002/gepi.1045. PMID 11754475.
- Reich DE, Goldstein DB (January 2001). "Detecting association in a case-control study while correcting for population stratification". Genetic Epidemiology. 20 (1): 4–16. doi:10.1002/1098-2272(200101)20:1<4::AID-GEPI2>3.0.CO;2-T. PMID 11119293.
- Bacanu SA, Devlin B, Roeder K (June 2000). "The power of genomic control". American Journal of Human Genetics. 66 (6): 1933–44. doi:10.1086/302929. PMC 1378064. PMID 10801388.
- Marchini J, Cardon LR, Phillips MS, Donnelly P (May 2004). "The effects of human population structure on large genetic association studies". Nature Genetics. 36 (5): 512–7. doi:10.1038/ng1337. PMID 15052271. S2CID 11694537.
- Hellenthal G, Busby GB, Band G, Wilson JF, Capelli C, Falush D, Myers S (February 2014). "A genetic atlas of human admixture history". Science. 343 (6172): 747–751. doi:10.1126/science.1243518. PMC 4209567. PMID 24531965.
- Patterson N, Price AL, Reich D (December 2006). "Population structure and eigenanalysis". PLoS Genetics. 2 (12): e190. doi:10.1371/journal.pgen.0020190. PMC 1713260. PMID 17194218.
- Frichot E, Mathieu F, Trouillon T, Bouchard G, François O (April 2014). "Fast and efficient estimation of individual ancestry coefficients". Genetics. 196 (4): 973–83. doi:10.1534/genetics.113.160572. PMC 3982712. PMID 24496008.
- Alexander DH, Novembre J, Lange K (September 2009). "Fast model-based estimation of ancestry in unrelated individuals". Genome Research. 19 (9): 1655–64. doi:10.1101/gr.094052.109. PMC 2752134. PMID 19648217.
- Scerri EM, Thomas MG, Manica A, Gunz P, Stock JT, Stringer C, et al. (August 2018). "Did Our Species Evolve in Subdivided Populations across Africa, and Why Does It Matter?". Trends in Ecology & Evolution. 33 (8): 582–594. doi:10.1016/j.tree.2018.05.005. PMC 6092560. PMID 30007846.
- Rodríguez W, Mazet O, Grusea S, Arredondo A, Corujo JM, Boitard S, Chikhi L (December 2018). "The IICR and the non-stationary structured coalescent: towards demographic inference with arbitrary changes in population structure". Heredity. 121 (6): 663–678. doi:10.1038/s41437-018-0148-0. PMC 6221895. PMID 30293985.