Upload
christiana-ball
View
219
Download
0
Tags:
Embed Size (px)
Citation preview
Targeted next generation sequencing for population genomics and phylogenomics in Ambystomatid
salamanders
Eric M. O’Neill
David W. Weisrock
Photograph by Stephen Dalton/Animals Animals - Earth Scenes
Preliminary Results
Ambystoma tigrinum complex
Coalescent Processes
• Stochastic• Incomplete lineage
sorting• Gene tree
incongruence• Capture variance• Many loci
Degnan and Rosenberg, 2006 PLOS Genetics
Goals
• Sequence >100 independent loci from 100s of samples– both alleles
• Population genetics• Species delimitation• Gene phylogenies• Species phylogeny
Jeremiah Smith
Past Option
• Sanger Sequencing– expensive– cloning or computational phasing alleles– low throughput
454 (Roche) Next Generation Sequencing
1 million reads × 400 bp each = 400 Million bp
Meyer et al. 2008 Nature Protocols
Barcoding
Methods• Screened ~250 EST loci across 16 representative samples• Found >100 variable loci that amplify well at the same
temperature• Amplified 95 loci for one individual in one plate• 94 individuals
– 8930 amplicons• Pooled across 95 loci for each individual• Barcoded 94 individuals and pooled• UKY-AGTC: 454 Libraries, emPCR, 454 sequencing
Preliminary Results
• Two test runs: 1/8th picotiter plate– 65K + 20K sequences
• One final run: 1/4th picotiter plate– 225K sequences
• Total ~ 300K sequences• Coverage of about 34X per sample per locus• Sorted >95%
1664 seqs / 95 loci = 18X coverage96% loci have sequence45 loci had >10X coverage
Genotyping
• Clonal amplification through emPCR• Each sequence is derived from a single DNA strand• Identify both alleles without bacterial cloning
Errors
• Homopolymer regions• Single nucleotide mismatches
Automated Statistical Genotyping
Hohenlohe et al., 2010 PLOS Genetics
Genotyping
• Let n be the total number of reads per site
• Let n = n1 + n2 + n3, where ni is the read count for each possible nucleotide at the site
• For diploid, there are 10 possible genotypes– 4 homozygous (AA, TT, GG, CC)– 6 heterozygous (AT, AG, AC, TG, TC, GC)
• Calculate the likelihood of each possible genotype using a multinomial sampling distribution, which gives the probability of observing a set of read counts (n1,n2,n3,n4)
Likelihood of a Homozygote
Likelihood of a Heterozygote
Assigning Genotypes
• The 2 equations give the likelihoods of the two most likely hypotheses out of 10
• Use a LRT to compare the Homo vs. Het hypotheses (df=1)
• If the test is significant, we assign the most likely genotype at that site for that individual
• If the test is not significant, we do not assign a genotype
• This process tests for each SNP independently, but we want to genotype the entire sequence
8 ways to be Het at 3 SNPs: C—T—C G—T—CC—C—C G—C—CC—T—T G—T—TC—C—T G—C—T
We need to maintain the correct info.
Desired Workflow• 454 data received as FASTA files• Sort by barcode
– Tommy has some code for this
• Assemble by locus (alignments)– Currently in Geneious, what other options?
• Genotype (phase the alleles)– Need to implement automated method– Quality scores
• Export data as sequences for phylogenetic analysis• Export data as alleles for population genetic analysis