ANCESTRY INFERENCE William Zeng
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Motivation • Analyze genotype data in recently admixed populations • African-Americans, Latinos, Mozabites • Disease mapping • Historical insight
Terminology • Admixed populations • Genes from two distinct groups of ancestors • 0,1 or 2 alleles inherited from one group of ancestor • Haplotype • A group of SNPs normally inherited together • Genetic marker • A DNA sequence at a known location to differentiate between individuals
Terminology Continued • Linkage Disequilibrium • Non-random associations between alleles of different loci • Li and Stephens • Population genetic model, recombination’s role in LD, HMM • Hidden Markov Models • Formal foundation for probabilistic models • Building block of computational sequence analysis
Ancestry Inference Models • Price et al. 2009 • HAPMIX • Maples et al. 2013 • RFMix • Baran et al. 2012 • LAMP-LD and LAMP-HAP • Yang et al. 2012 • Spacial Ancestry Analysis
Ancestry Inference Models • Price et al. 2009 • HAPMIX • Maples et al. 2013 • RFMix • Baran et al. 2012 • LAMP-LD and LAMP-HAP • Yang et al. 2012 • Spacial Ancestry Analysis
HAPMIX • Haplotype-based method • Extension of Li-Stephens population genetic model • Phase data from reference populations • Model ancestral populations • Obtained from HapMap • Use HMM to estimate ancestry of each locus • Average ancestry inferences over all phase solutions • Doesn’t assume any one solution for admixed data is correct • Two-way admixture • Analyzed African-Americans and Mozabites of North Africa
HAPMIX vs Other Models • ANCESTRYMAP • HMM on unlinked SNPs • LAMP • Majority vote of ancestry information from windows of SNPs • Neither utilize rich information available in haplotypes • HAPAA • Also used HMM to model LD • Doesn’t allow for rate of miscopying from wrong population • Doesn’t allow unphased data
HAPMIX Results • Very accurate compared to other techniques • Infer about ancestral populations and date of admixture • Mozabites • 78% European, 22% sub-Saharan African • Admixture at least 2800 years ago
Pros and Cons of HAPMIX • Pros • Model LD explicitly • Accurate • Cons • Can only consider two ancestral populations • Requires large reference population • Slow
Ancestry Inference Models • Price et al. 2009 • HAPMIX • Maples et al. 2013 • RFMix • Baran et al. 2012 • LAMP-LD and LAMP-HAP • Yang et al. 2012 • Spacial Ancestry Analysis
RFMix • Powerful enough to detect sub-continental level ancestry • Europeans genetically heterogeneous, as are Latinos • ~30x faster than other methods • Conditional random field parameterized by random forests trained on reference panels. • Learn from admixed population • Incorporate its ancestry information with reference population’s • Autocorrects phasing errors • Jointly models phasing and ancestry
RFMix Model
RFMix Continued • Discriminative modeling approach instead of generative • Model P(Y|X) instead of estimating P(Y,X) and using Bayes’ rule to estimate P(Y|X) • Discriminative models have less asymptotic error • Generative models better with sparse data • Amount of available human genomic data expected to increase
RFMix Results • African Americans contain Native American ancestry (0.4%) • RFMix able to detect low-occurrence ancestry • RFMix more accurate than LAMP-HAP (95.6% vs 93.7%) • 33 times faster than LAMP-HAP, 1.7 faster than SupportMix • Discrete window approach allows parallelization and multithreading
Pros and Cons of RFMix • Pros • Fast • Discriminative rather than generative • Works with more than two ancestral populations • Does not require large proxy reference population • Utilizes haplotype information • Cons • Requires more data
Ancestry Inference Models • Price et al. 2009 • HAPMIX • Maples et al. 2013 • RFMix • Baran et al. 2012 • LAMP-LD and LAMP-HAP • Yang et al. 2012 • Spacial Ancestry Analysis
LAMP-LD and LAMP-HAP • Other methods only model two-way admixtures • Latinos: Admix of Europeans, Native Americans, Africans • HMM of haplotype information in novel window-based framework • No ancestry switch in windows • Relaxed in post-processing step • HMM does not depend on number of ancestral populations, but rather a constant • Approximates Li-Stephens model • Other models for multi-way admixture • WINPOP doesn’t utilize haplotype information, more error • GEDI-ADMX uses different framework for inferring ancestry
LAMP-LD and LAMP-HAP Continued • LAMP-LD: Models LD in ancestral populations • Similar to HAPMIX • LAMP-HAP: Accounts for Mendelian segregation in nuclear family trios • Four alleles: paternal transmitted, paternal un-transmitted, maternal transmitted, maternal un-transmitted • East Asian haplotypes model Native Americans • Less accurate • 56 min to construct reference panels, 6.5 s process genotype • 89 s per genotype for HAPMIX
LAMP Results • Larger reference set & more accurate proxy population -> superior accuracy • Presence of European gene flow in Native Americans • If smaller than 6%, not statistically significant
Pros and Cons of LAMP • Pros • Can model more than two ancestral populations • Moderately fast • LAMP-HAP utilizes family information in inferring ancestry • Cons • Requires large reference populations • Not as accurate as HAPMIX
Ancestry Inference Models • Price et al. 2009 • HAPMIX • Maples et al. 2013 • RFMix • Baran et al. 2012 • LAMP-LD and LAMP-HAP • Yang et al. 2012 • Spacial Ancestry Analysis
Spacial Ancestry Analysis (SPA) • Model-based approach for spatial structure in 2D/3D space • Allele frequency distribution as function over geographic space • Principal component analysis(PCA) not model-based, would based estimate admixed individual’s location at halfway point • Iteratively estimate slope of allele function, then update positions of individuals • SNPs with large gradients in allele frequency are candidate regions under selection
2D Modeling of Europe
3D Modeling of the Globe
Allele SPA Scores
SPA Results • Geographic location ó Allele frequency functions • Some SNPS have large gradients in allele frequency • Candidate regions under selections (LCT, FOXP2) • Found by other methods that detect positive selection • Other models available besides slope functions • Able to find more signals than other methods • FST, iHS, Bayenv
Summary • RFMix outperforms HAPMIX and LAMP in terms of speed, accuracy, scaling, and ability to sample multiple ancestral populations • SPA able to infer geographic location of individuals
You can also read