Upload
jsrep91
View
881
Download
2
Tags:
Embed Size (px)
Citation preview
SNPs: the HapMap and 1000 Genomes Projects
Joseph ReplogleCavalcanti Lab Group
5/25/2012
Understanding Human Genetic VariationWithin and Among Populations
Types of Human Genetic Variation
• Individual: de novo and rare variations• Population: variations which have become
fixed within a population– Single Nucleotide Polymorphisms (SNPs): base
pair substitutions• Transition: purine -> purine (A<->G), pyrimidine ->
pyrimidine (C<->T)• Transversion: purine <-> pyrimidine• common ~1-5% minor allele frequency (MAF) in major
populations
Types of Human Genetic Variation (cont.)
– Copy-Number Variations (CNVs): • insertions, deletions, duplications of DNA segments
(>1kb)
– Other Variations:• Structural: inversions• Repeats: microsatellites (STRs), minisatellites (VNTRs)• Frameshift mutations
SNP Distribution throughout the Genome
• Genetic variability throughout the genome reflects function (among other factors)
HLA!Sachidanandam et al. 2001
Factors Affecting SNP Distribution• Intrinsic, Structural:
Mutation clusters due to recombination events and sequence context-specific effects [3,4]– a) Time to Most Recent
Common Ancestor of genes in population influences SNPs (older genes -> more SNPs in population)
– b) base composition, local recombination, gene density, chromatin structure, nucleosome position, replication timing
Lercher and Hurst 2002
Factors Affecting SNP Distribution (cont.)
• Functional: mutation clusters due to natural selection (examples include immunoglobulin genes)
a) balancing selection increases diversityb) purifying and directional selection
decrease diversityc) transcriptional activity
• Ascertainment bias: better characterization of SNPs around genes of interest [5]
Effects of Genetic Variation
• Pathogenic and non-pathogenic heritable traits• Genetic variation reveals millions of years of
human history– “One can think of selective pressures as natural, in
vivo human experiments in which we can measure the response of human populations to unknown perturbations, and these alterations can inform the function of genes within a given locus.” Raj et al. 2012
– Understand the history of mutation, selection and recombination within the human genome
Potential Uses of SNP data
Ultimately, synergy of genomics and functional work will allow us to understand human traits and disease.
• Association Mapping: Genome Wide Association (GWA) studies, Pharmacogenomics
• Modeling Mendelian and Complex diseases• eQTL and functional genomics• Selection!
Selection: EHH and iHS
• Extended Haplotype Homozygosity (EHH)• Integrated Haplotype Score (iHS)
Chromosome 2
Voight et al. 2006
Selection of Lassa Fever Susceptibility Genes in YRI populations
Andersen et al (2012)
eQTL
Positive Selection
SLE susceptibility locus (rs11755393; GWAS p= 2.20 x 10 -08 )
Slide from Replogle and Raj
International HapMap Project
• “to identify and catalog genetic similarities and differences in human beings”
• Haplotype Map: SNPs (genotypes) at separate loci whose alleles are statistically associated due to limited genetic recombination
HapMap Project
Linkage Disequilibrium (LD)
• Alleles at different loci are not independent due to
AB Ab
aB ab
AB
Ab
aBab
Af
af
Bf bf
Af
af
Bf bf
Linkage equilibrium Linkage disequilibrium
Image by Gil McVean
Origin of LD
The mutation arises on a particular genetic background
If the mutation increases in frequency, the associated haplotype will also increase in frequency.
Factors Increasing LD:1) Genetic Drift
(stochastic sampling)2) Selection3) Non-Random Mating4) Population Structure
Over time the association between the new mutation and linked mutations will decay by recombination
Recombination is the only factor which decreases LD.
... ... ...
Image modified from Gil McVean
Haplotype
• ~107 common (MAF >1%) SNPs in the human genome• ‘tag SNPs’ allow for identification of an individual’s haplotypes• Estimated 300,000-600,000 tag SNPs in genome• Genotyping: testing tag SNPs• Sequencing: whole genome sequence
HapMap Project
HapMap Populations
• 270 total DNA samples• Yoruba in Ibadan, Nigeria (YRI)• Japanese in Tokyo, Japan (JPT)• Han Chinese in Beijing, China (CHB)• CEPH (Utah residents with ancestry from
northern and western Europe) (CEU)
HapMap Methodology
• Genotype individuals for several million SNPs– 1 SNP per 5kb or less– MAF >1% as estimated by TSC project, JSNP, dbSNP, and initial
SNP map– Random shotgun sequencing to obtain additional SNPs– Coding and noncoding SNPs
• Data analysis to identify LD and Haplotype maps• Tag SNPs are useful with haplotype and recombination
map• Data available online in multiple formats
http://hapmap.ncbi.nlm.nih.gov/downloads/index.html.en
HapMap Methodology (cont.)
• Data analysis to identify LD and Haplotype maps
• Tag SNPs are useful with haplotype and recombination map
• Data available online in multiple formats http://hapmap.ncbi.nlm.nih.gov/downloads/index.html.en
• Phase III data released 2009
Reference Genome?
• Mosaic haploid DNA sequence
• GRCh37
1000 Genomes
• “to find most genetic variants that have frequencies of at least 1% in the populations studied”
• Low coverage sequencing of >2000 individuals, exome sequencing, trios
• Characterization of SNPs and Structural Variants (INDELs)
1000 Genomes Populations
• Yoruba in Ibadan, Nigeria (YRI)• Japanese in Tokyo, Japan (JPT)• Han Chinese in Beijing, China (CHB)• CEPH (Utah residents with ancestry from northern and
western Europe) (CEU)• Luhya in Webuye, Kenya (LWK)• Toscani in Italy (TSI)• Peruvians in Lima, Peru (PER) • Mexican ancestry in Los Angeles, CA (MXL)• And many more!
“Low-Coverage” Sequencing
• Sequencing: 1) DNA copies broken into short pieces2) Each piece is sequenced (random pieces means most
of genome is covered)3) Sequenced fragments are aligned and joined to
determine complete genome• 28X sequencing coverage necessary for complete
genome• Low-coverage sequencing (4X coverage): many pieces
of individual genomes are missed
1000 Genomes Data
• Latest release: – 1092 samples – SNP, indel, and large deletion– Autosomes and chrX– ~38.2 M SNPs from low coverage and exome
sequencing• 1000genomes site has a link to a NCBI FTP
with their latest data
VCF file format
• Variant Call Format 4.1: meta-info followed by header and data
• tab-delimited text file• Compressed .gzzcat file.vcf.gz| grep -e ^# -e SNP | bgzip -c >
snps.vcf.gz• http://www.1000genomes.org/wiki/Analysis/
Variant%20Call%20Format/vcf-variant-call-format-version-41
Columns in VCF format• CHROM: chromosome (no colons)• POS: numerical reference position, with the 1st base having position 1
(some variants have multiple pos records)• ID: semi-colon separated list of unique identifiers where available (ex.
dbSNP rs number)• EF: reference base(s) A,C,G,T,N (case insensitive) for a given variant• ALT: comma separated list of alternate non-reference alleles called on at
least one of the samples.• QUAL: phred-scaled quality score for the assertion made in ALT. i.e. -
10log_10 prob(call in ALT is wrong)• FILTER: another quality measure; PASS if this position has passed all filters• INFO: semicolon seperated additional info; ex. AF (allele frequency), DB
(dbSNP membership), VALIDATED
Durbin et al. 2004
Interested?
• Get Prof. Cavalcanti to buy Human Evolutionary Genetics: Origins, Peoples and Disease
References1. Sachidanandam R et al. (2001) A map of human genome sequence variation containing 1.42 million single
nucleotide polymorphisms. Nature 409: 928-933.2. Lercher MJ and Hurst LD (2002) Human SNP variability and mutation rate are higher in regions of high
recombination Trends Genet. 18: 337-340.3. Rogozin IB and Pavlov YI (2003) Theoretical analysis of mutational hotspots and their DNA sequence context
specificity. Mutat Res 544(1): 65-85.4. Ma X, et al. (2012) Mutation Hot Spots in Yeast Caused by Long-Range Clustering of Homopolymeric
Sequences.Cell Reports 1(1): 36-42.5. Clark AG, et al. (2005) Ascertainment bias in studies of human genome-wide polymorphism. Genome Res
15: 1496-1502. 6. Raj T et al. (2012) Alzheimer Disease Susceptibility Loci: Evidence for a Protein Network under Natural Selection.
AJHG 90 720-726. 7. Voight BF et al. (2006) A Map of Recent Positive Selection in the Human Genome. PLoS Biology 4(3): e72.8. Andersen KG et al. (2012) Genome-wide scans provide evidence for positive selection of genes implicated in Lassa
fever. Philos Trans R Soc Lond B Biol Sci 367(1590): 868-877.9. Hapmap.org10. McVean, Gil (2004). Population Genetics of the Human Genome. Oxford Human Genome Lecture Series.11. Gibbs RA et al. (2003) The International HapMap Project. Nature 426: 789-796.12. 1000genomes.org13. Durbin R M et al. (2010). A map of human genome variation from population-scale sequencing. Nature 467(7319):
1061-1073.