ANCESTRY INFERENCE William Zeng

Page created by Dwight Ramos
 
CONTINUE READING
ANCESTRY INFERENCE William Zeng
ANCESTRY INFERENCE
William Zeng
ANCESTRY INFERENCE William Zeng
Motivation
• Analyze genotype data in recently admixed populations
  • African-Americans, Latinos, Mozabites
• Disease mapping
• Historical insight
ANCESTRY INFERENCE William Zeng
Terminology
• Admixed populations
  • Genes from two distinct groups of ancestors
  • 0,1 or 2 alleles inherited from one group of ancestor
• Haplotype
  • A group of SNPs normally inherited together
• Genetic marker
  • A DNA sequence at a known location to differentiate between
    individuals
ANCESTRY INFERENCE William Zeng
Terminology Continued
• Linkage Disequilibrium
  • Non-random associations between alleles of different loci
• Li and Stephens
  • Population genetic model, recombination’s role in LD, HMM
• Hidden Markov Models
  • Formal foundation for probabilistic models
  • Building block of computational sequence analysis
Ancestry Inference Models
• Price et al. 2009
  • HAPMIX
• Maples et al. 2013
  • RFMix
• Baran et al. 2012
  • LAMP-LD and LAMP-HAP
• Yang et al. 2012
  • Spacial Ancestry Analysis
Ancestry Inference Models
• Price et al. 2009
  • HAPMIX
• Maples et al. 2013
  • RFMix
• Baran et al. 2012
  • LAMP-LD and LAMP-HAP
• Yang et al. 2012
  • Spacial Ancestry Analysis
HAPMIX
• Haplotype-based method
• Extension of Li-Stephens population genetic model
• Phase data from reference populations
  • Model ancestral populations
  • Obtained from HapMap
• Use HMM to estimate ancestry of each locus
• Average ancestry inferences over all phase solutions
  • Doesn’t assume any one solution for admixed data is correct
• Two-way admixture
  • Analyzed African-Americans and Mozabites of North Africa
HAPMIX vs Other Models
• ANCESTRYMAP
  • HMM on unlinked SNPs
• LAMP
  • Majority vote of ancestry information from windows of SNPs
• Neither utilize rich information available in haplotypes
• HAPAA
  • Also used HMM to model LD
  • Doesn’t allow for rate of miscopying from wrong population
  • Doesn’t allow unphased data
HAPMIX Results
• Very accurate compared to other techniques
• Infer about ancestral populations and date of admixture
• Mozabites
  • 78% European, 22% sub-Saharan African
  • Admixture at least 2800 years ago
Pros and Cons of HAPMIX
• Pros
  • Model LD explicitly
  • Accurate
• Cons
  • Can only consider two ancestral populations
  • Requires large reference population
  • Slow
Ancestry Inference Models
• Price et al. 2009
  • HAPMIX
• Maples et al. 2013
  • RFMix
• Baran et al. 2012
  • LAMP-LD and LAMP-HAP
• Yang et al. 2012
  • Spacial Ancestry Analysis
RFMix
• Powerful enough to detect sub-continental level ancestry
  • Europeans genetically heterogeneous, as are Latinos
• ~30x faster than other methods
• Conditional random field parameterized by random forests
  trained on reference panels.
• Learn from admixed population
  • Incorporate its ancestry information with reference population’s
• Autocorrects phasing errors
  • Jointly models phasing and ancestry
RFMix Model
RFMix Continued
• Discriminative modeling approach instead of generative
• Model P(Y|X) instead of estimating P(Y,X) and using
  Bayes’ rule to estimate P(Y|X)
• Discriminative models have less asymptotic error
• Generative models better with sparse data
 • Amount of available human genomic data expected to increase
RFMix Results
• African Americans contain Native American ancestry
 (0.4%)
 • RFMix able to detect low-occurrence ancestry
• RFMix more accurate than LAMP-HAP (95.6% vs 93.7%)
• 33 times faster than LAMP-HAP, 1.7 faster than
 SupportMix
 • Discrete window approach allows parallelization and multithreading
Pros and Cons of RFMix
• Pros
  • Fast
  • Discriminative rather than generative
  • Works with more than two ancestral populations
  • Does not require large proxy reference population
  • Utilizes haplotype information
• Cons
  • Requires more data
Ancestry Inference Models
• Price et al. 2009
  • HAPMIX
• Maples et al. 2013
  • RFMix
• Baran et al. 2012
  • LAMP-LD and LAMP-HAP
• Yang et al. 2012
  • Spacial Ancestry Analysis
LAMP-LD and LAMP-HAP
• Other methods only model two-way admixtures
  • Latinos: Admix of Europeans, Native Americans, Africans
• HMM of haplotype information in novel window-based
 framework
  • No ancestry switch in windows
  • Relaxed in post-processing step
• HMM does not depend on number of ancestral
  populations, but rather a constant
• Approximates Li-Stephens model
• Other models for multi-way admixture
  • WINPOP doesn’t utilize haplotype information, more error
  • GEDI-ADMX uses different framework for inferring ancestry
LAMP-LD and LAMP-HAP Continued
• LAMP-LD: Models LD in ancestral populations
  • Similar to HAPMIX
• LAMP-HAP: Accounts for Mendelian segregation in
 nuclear family trios
 • Four alleles: paternal transmitted, paternal un-transmitted,
                maternal transmitted, maternal un-transmitted
• East Asian haplotypes model Native Americans
  • Less accurate
• 56 min to construct reference panels, 6.5 s process
 genotype
 • 89 s per genotype for HAPMIX
LAMP Results
• Larger reference set & more accurate proxy population ->
  superior accuracy
• Presence of European gene flow in Native Americans
 • If smaller than 6%, not statistically significant
Pros and Cons of LAMP
• Pros
  • Can model more than two ancestral populations
  • Moderately fast
  • LAMP-HAP utilizes family information in inferring ancestry
• Cons
  • Requires large reference populations
  • Not as accurate as HAPMIX
Ancestry Inference Models
• Price et al. 2009
  • HAPMIX
• Maples et al. 2013
  • RFMix
• Baran et al. 2012
  • LAMP-LD and LAMP-HAP
• Yang et al. 2012
  • Spacial Ancestry Analysis
Spacial Ancestry Analysis (SPA)
• Model-based approach for spatial structure in 2D/3D
  space
• Allele frequency distribution as function over geographic
  space
• Principal component analysis(PCA) not model-based,
  would based estimate admixed individual’s location at
  halfway point
• Iteratively estimate slope of allele function, then update
  positions of individuals
• SNPs with large gradients in allele frequency are
  candidate regions under selection
2D Modeling of Europe
3D Modeling of the Globe
Allele SPA Scores
SPA Results
• Geographic location ó Allele frequency functions
• Some SNPS have large gradients in allele frequency
  • Candidate regions under selections (LCT, FOXP2)
  • Found by other methods that detect positive selection
• Other models available besides slope functions
• Able to find more signals than other methods
  • FST, iHS, Bayenv
Summary
• RFMix outperforms HAPMIX and LAMP in terms of speed,
  accuracy, scaling, and ability to sample multiple ancestral
  populations
• SPA able to infer geographic location of individuals
You can also read