48
Sequence Variations Baxevanis and Ouellette, Chapter 7 - Sequence Polymorphisms NCBI SNP Primer: http://www.ncbi.nlm.nih.gov/About/primer/snps.html

A. Genetic Variationcschweikert/cisc4020/SNPs.pdf · – Genetic variation in populations SNPs as genetic markers – “Classical” genetic diseases – Multi-factorial diseases

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A. Genetic Variationcschweikert/cisc4020/SNPs.pdf · – Genetic variation in populations SNPs as genetic markers – “Classical” genetic diseases – Multi-factorial diseases

Sequence Variations

Baxevanis and Ouellette, Chapter 7 - Sequence Polymorphisms

NCBI SNP Primer: http://www.ncbi.nlm.nih.gov/About/primer/snps.html

Page 2: A. Genetic Variationcschweikert/cisc4020/SNPs.pdf · – Genetic variation in populations SNPs as genetic markers – “Classical” genetic diseases – Multi-factorial diseases

Overview

Mutation and Alleles– Linkage– Genetic variation in populations

SNPs as genetic markers– “Classical” genetic diseases– Multi-factorial diseases and risk factors– Genome scans (genotyping)

Page 3: A. Genetic Variationcschweikert/cisc4020/SNPs.pdf · – Genetic variation in populations SNPs as genetic markers – “Classical” genetic diseases – Multi-factorial diseases

A review of some basic genetics

Page 4: A. Genetic Variationcschweikert/cisc4020/SNPs.pdf · – Genetic variation in populations SNPs as genetic markers – “Classical” genetic diseases – Multi-factorial diseases

Alleles• An allele is a particular DNA sequence for a gene.• Some gene alleles are responsible for ordinary

phenotypes like blue/brown eyes.• Others lead to classic genetic diseases like cystic

fibrosis or Huntington’s disease.

Page 5: A. Genetic Variationcschweikert/cisc4020/SNPs.pdf · – Genetic variation in populations SNPs as genetic markers – “Classical” genetic diseases – Multi-factorial diseases

Changes occur in DNA sequences = mutations

Page 6: A. Genetic Variationcschweikert/cisc4020/SNPs.pdf · – Genetic variation in populations SNPs as genetic markers – “Classical” genetic diseases – Multi-factorial diseases

Many Causes of Mutations

• Somatic vs. reproductive cells• Radiation and/or chemical damage to DNA• Random errors of the replication machinery• Normal biological processes - methylation

Page 7: A. Genetic Variationcschweikert/cisc4020/SNPs.pdf · – Genetic variation in populations SNPs as genetic markers – “Classical” genetic diseases – Multi-factorial diseases

• Mutations occur randomly throughout DNA.

•Most have no phenotypic effect (non-coding regions, equivalent codons, similar AAs).

•Some damage the function of a protein or regulatory element.

•A very few provide an evolutionary advantage.

Mutations Create Alleles

Page 8: A. Genetic Variationcschweikert/cisc4020/SNPs.pdf · – Genetic variation in populations SNPs as genetic markers – “Classical” genetic diseases – Multi-factorial diseases

Population Genetics• Chromosome pairs segregate and recombine in every

generation.

• Every allele of every gene has its own independent evolutionary history (and future).

• Frequencies of various alleles differ in different sub-populations of people.

Page 9: A. Genetic Variationcschweikert/cisc4020/SNPs.pdf · – Genetic variation in populations SNPs as genetic markers – “Classical” genetic diseases – Multi-factorial diseases

Human Alleles• The OMIM (Online Mendelian Inheritance

in Man) database at the NCBI tracks all human mutations with known pheontypes.

• It contains a total of about 2,000 genetic diseases [and another ~11,000 genetic loci with known phenotypes - but not necessarily known gene sequences]

• It is designed for use by physicians:– can search by disease name– contains summaries from clinical studies

Page 10: A. Genetic Variationcschweikert/cisc4020/SNPs.pdf · – Genetic variation in populations SNPs as genetic markers – “Classical” genetic diseases – Multi-factorial diseases

OMIM Morbid Map: Cytogenetic map location of disease genes.

Page 11: A. Genetic Variationcschweikert/cisc4020/SNPs.pdf · – Genetic variation in populations SNPs as genetic markers – “Classical” genetic diseases – Multi-factorial diseases

Variation Makes Life Interesting

• The Human Genome has been sequenced;what’s next?

• Much of what makes us unique individuals is represented by the differences in our DNA sequence from other people.

• There are rare and common forms (alleles) of every gene.

• Probably only 3-4 alleles are present in 95% of the population for most genes, but lots of rare mutations.

Page 12: A. Genetic Variationcschweikert/cisc4020/SNPs.pdf · – Genetic variation in populations SNPs as genetic markers – “Classical” genetic diseases – Multi-factorial diseases

SNPs are Mutations

Page 13: A. Genetic Variationcschweikert/cisc4020/SNPs.pdf · – Genetic variation in populations SNPs as genetic markers – “Classical” genetic diseases – Multi-factorial diseases

SNPs• A mutation that causes a single base change is

known as a Single Nucleotide Polymorphism (SNP).

• Other kinds of mutations include insertions and deletions.

• Large breaks and rearrangement of chromosomes also occur (translocations)s

GATTTAGATCGCGATAGAGGATTTAGATCTCGATAGAG

^

Page 14: A. Genetic Variationcschweikert/cisc4020/SNPs.pdf · – Genetic variation in populations SNPs as genetic markers – “Classical” genetic diseases – Multi-factorial diseases

SNPs are Very Common• SNPs are very common in the human

population.• Between any two people, there is an average

of one SNP every ~1250 bases.• Most of these have no phenotypic effect.

– Only <1% of all human SNPs impact protein function (non-coding regions).

– Selection against mis-sense mutations (think about what would happen to dominant lethal mutations?).

• Some are alleles of genes.

Page 15: A. Genetic Variationcschweikert/cisc4020/SNPs.pdf · – Genetic variation in populations SNPs as genetic markers – “Classical” genetic diseases – Multi-factorial diseases

Genome Sequencing finds SNPs• The Human Genome Project involves sequencing

DNA cloned from a number of different people.[The Celera sequence comes from 5 people.]

• Even within one person’s DNA, the homologous chromosomes have SNPs.

• This inevitably leads to the discovery of SNPs -any single base sequence difference

• These SNPs can be valuable as the basis for diagnostic tests

Page 16: A. Genetic Variationcschweikert/cisc4020/SNPs.pdf · – Genetic variation in populations SNPs as genetic markers – “Classical” genetic diseases – Multi-factorial diseases
Page 17: A. Genetic Variationcschweikert/cisc4020/SNPs.pdf · – Genetic variation in populations SNPs as genetic markers – “Classical” genetic diseases – Multi-factorial diseases

We describe a map of 1.42 million single nucleotide polymorphisms (SNPs) distributed throughout the human genome, providing an average density on available sequence of one SNP every 1.9 kilobases. These SNPs were primarily discovered by two projects: The SNP Consortium and the analysis of clone overlaps by the International Human Genome Sequencing Consortium. The map integrates all publicly available SNPs with described genes and other genomic features. We estimate that 60,000 SNPs fall within exon (coding and untranslated regions), and 85% of exons are within 5 kb of the nearest SNP. Nucleotide diversity varies greatly across the genome, in a manner broadly consistent with a standard population genetic model of human history. This high-density SNP map provides a public resource for defining haplotype variation across the genome, and should help to identify biomedically important genes for diagnosis and therapy.

Page 18: A. Genetic Variationcschweikert/cisc4020/SNPs.pdf · – Genetic variation in populations SNPs as genetic markers – “Classical” genetic diseases – Multi-factorial diseases

http://www.ncbi.nlm.nih.gov/snp

Page 19: A. Genetic Variationcschweikert/cisc4020/SNPs.pdf · – Genetic variation in populations SNPs as genetic markers – “Classical” genetic diseases – Multi-factorial diseases

SNP Discovery: dbSNP database

Page 20: A. Genetic Variationcschweikert/cisc4020/SNPs.pdf · – Genetic variation in populations SNPs as genetic markers – “Classical” genetic diseases – Multi-factorial diseases
Page 21: A. Genetic Variationcschweikert/cisc4020/SNPs.pdf · – Genetic variation in populations SNPs as genetic markers – “Classical” genetic diseases – Multi-factorial diseases

Search dbSNP with BLAST“As of June, 2008,dbSNP has 12.8 million SNPs in the human genome”

• It is possible to search dbSNP by BLAST comparisons to a target sequence

Page 22: A. Genetic Variationcschweikert/cisc4020/SNPs.pdf · – Genetic variation in populations SNPs as genetic markers – “Classical” genetic diseases – Multi-factorial diseases

>gnl|dbSNP|rs1042574_allelePos=51 total len = 101 |taxid = 9606|snpClass = 1Length = 101

Score = 149 bits (75), Expect = 3e-33Identities = 79/81 (97%)Strand = Plus / Plus

Query: 1489 ccctcttccctgacctcccaactctaaagccaagcactttatatttttctcttagatatt 1548||||||||||||||||||||||||||||||||||||||||||||||| || |||||||||

Sbjct: 1 ccctcttccctgacctcccaactctaaagccaagcactttatattttcctyttagatatt 60

Query: 1549 cactaaggacttaaaataaaa 1569|||||||||||||||||||||

Sbjct: 61 cactaaggacttaaaataaaa 81

If a matchingSNP is found, then it can bedirectly located on the Genome map

Page 23: A. Genetic Variationcschweikert/cisc4020/SNPs.pdf · – Genetic variation in populations SNPs as genetic markers – “Classical” genetic diseases – Multi-factorial diseases

Uses for SNPs• Diagnostic tests for disease alleles• Markers to aid in cloning of interesting

genes (disease genes)• Pharmacogenomics - genetics of reponse

to drugs (effectiveness and side effects)

Page 24: A. Genetic Variationcschweikert/cisc4020/SNPs.pdf · – Genetic variation in populations SNPs as genetic markers – “Classical” genetic diseases – Multi-factorial diseases

DNA Diagnostic Testing• Hereditary diseases - potential parents, pre-

natal, late onset diseases.• Genes that predispose to disease (risk

factors).• Genotyping of infectious agents (bacterial

& viral).• Forensics - using DNA testing to establish

identity.

Page 25: A. Genetic Variationcschweikert/cisc4020/SNPs.pdf · – Genetic variation in populations SNPs as genetic markers – “Classical” genetic diseases – Multi-factorial diseases

Clinical Manifestationsof Genetic Variation

(All disease has a genetic component)• Susceptibility vs. resistance• Variations in disease severity or symptoms• Reaction to drugs (pharmacogenetics)• Variable disease course and prognosis SNPs can be found that are linked to

all of these traits.

Page 26: A. Genetic Variationcschweikert/cisc4020/SNPs.pdf · – Genetic variation in populations SNPs as genetic markers – “Classical” genetic diseases – Multi-factorial diseases

Finding Disease Genes• Virtually all diseases have a genetic component.• Start with DNA samples from families that show

inheritance of the disease.• Use STS markers to map the gene or genes

involved (linkage analysis).• Find SNPs in the genetic region(s) that are likely

candidates for involvement in that disease.• Get the gene from genomic sub-clone.

Page 27: A. Genetic Variationcschweikert/cisc4020/SNPs.pdf · – Genetic variation in populations SNPs as genetic markers – “Classical” genetic diseases – Multi-factorial diseases

Some Diseases Involve Many Genes

• There are a number of classic “genetic diseases” caused by mutations of a single gene .

– Huntington’s, Cystic Fibrosis, Tay-Sachs, PKU, etc.• There are also many diseases that are the result of the

interactions of many genes:– asthma, heart disease, cancer

• Each of these genes may be considered to be a risk factor for the disease.

• Groups of genetic markers (SNPs) may be associated with a disease without determining a mechanism.

Page 28: A. Genetic Variationcschweikert/cisc4020/SNPs.pdf · – Genetic variation in populations SNPs as genetic markers – “Classical” genetic diseases – Multi-factorial diseases

Multiple Causes

• Some diseases may actually be caused by any of a group of different genes (multiple causes), but all show the same symptoms.

• SNP linkage analysis can identify these sub-populations more efficiently than classical molecular genetic approaches.

• Machine learning, genetic algorithms, SVMs

Page 29: A. Genetic Variationcschweikert/cisc4020/SNPs.pdf · – Genetic variation in populations SNPs as genetic markers – “Classical” genetic diseases – Multi-factorial diseases

“The study of the distribution of genetic variants, including SNPs, lies

within the domain of population genetics, and the study of the

relationship between SNPs and phenotypic variation lies in the domain

of quantitative genetics.”�

Gibson&Muse

Page 30: A. Genetic Variationcschweikert/cisc4020/SNPs.pdf · – Genetic variation in populations SNPs as genetic markers – “Classical” genetic diseases – Multi-factorial diseases

A B c

a B C

a

B

C

A B c

a B C

a B c

A b c

A b c

A b c

a b C

a b C

A b c

A b c

a B C

A

B

c

a b C

a B c

A b c

Quantitative Trait Locus Mapping

A B C

a b c

F1

A B C

a b c

F1

X

a b c

a b c

A B C

A B C

Parent 3 Parent 4

X

HEI

GH

T

GENOTYPE BB Bb bb

♦ ♦

♦ ♦ ♦

♦ ♦ ♦

B b

Bb Bb Bb BB BB BB bb bb bb

a b c

a b c

A B C

A B C

Parent 1 Parent 2

X

Knott et al. (1997) TAG 84:810-820

Page 31: A. Genetic Variationcschweikert/cisc4020/SNPs.pdf · – Genetic variation in populations SNPs as genetic markers – “Classical” genetic diseases – Multi-factorial diseases

Association Mapping

recombination through evolutionary history

present-day chromosomes in natural population

* T G

* T A

C G

C A * T G

C A

ancestral chromosomes

* T G

Page 32: A. Genetic Variationcschweikert/cisc4020/SNPs.pdf · – Genetic variation in populations SNPs as genetic markers – “Classical” genetic diseases – Multi-factorial diseases

SNP Discovery Methods

•  Pairwise Sequence Comparison from databases, eSNP

•  Deep Resequencing

Page 33: A. Genetic Variationcschweikert/cisc4020/SNPs.pdf · – Genetic variation in populations SNPs as genetic markers – “Classical” genetic diseases – Multi-factorial diseases

SNP Analysis Agenda

Sequence-Based SNP Identification

Common Bioinformatic Solutions Phred, Phrap, Consed, Polyphred, and Polybayes

High-Throughput SNP Identification Solution

Page 34: A. Genetic Variationcschweikert/cisc4020/SNPs.pdf · – Genetic variation in populations SNPs as genetic markers – “Classical” genetic diseases – Multi-factorial diseases
Page 35: A. Genetic Variationcschweikert/cisc4020/SNPs.pdf · – Genetic variation in populations SNPs as genetic markers – “Classical” genetic diseases – Multi-factorial diseases

•  Overlapping PCR Amplicons across entire gene •  Make no assumptions about sequence function

•  Sequence diversity and genetic structure for each gene is different •  Proper association studies can only be designed in this context •  Complete resequencing facilitates population genetics methods

Page 36: A. Genetic Variationcschweikert/cisc4020/SNPs.pdf · – Genetic variation in populations SNPs as genetic markers – “Classical” genetic diseases – Multi-factorial diseases

Sequence each end of the fragment.

Base-calling Quality determination

Contig assembly Final

quality determination

Sequence viewing Polymorphism tagging

Polymorphism reporting Individual genotyping

Polymorphism detection

PolyPhred/Polybayes

Consed

Analysis

Sequence Phred Phrap Amplify DNA 5’ 3’

Sequence-based SNP Identification

Phylogenetic analysis

ATAGACG ATACACG ATAGACG ATACACG

ATAGACG ATACACG

Homozygotes Heterozygote

Page 37: A. Genetic Variationcschweikert/cisc4020/SNPs.pdf · – Genetic variation in populations SNPs as genetic markers – “Classical” genetic diseases – Multi-factorial diseases

Phred, Phrap, Consed, Polyphred, Polybayes

•  phred: Base calling and quality assignments �

•  phrap: Contig formation and new quality assignments �

•  consed: Visual X-Windows graphic interface, to view and edit alignments and contigs, and to view the original traces �

•  polyphred: find polymorphisms in phrap contigs, quality calls, add data to phrap files to permit consed finding and visualization of polymorphisms.

•  polybayes: Fully probabilistic SNP detection algorithm that calculates the probability (SNP score) that discrepancies at a given location of a multiple alignment represent true sequence variations as opposed to sequencing errors.

Page 38: A. Genetic Variationcschweikert/cisc4020/SNPs.pdf · – Genetic variation in populations SNPs as genetic markers – “Classical” genetic diseases – Multi-factorial diseases

Figure 1. Application of the POLYBAYES procedure to EST data. a, Regions of known human repeats in a genomic sequence are masked. b, Matching human ESTs are retrieved from dbEST and traces are re-called. c, Paralogous ESTs are identified and discarded. d, Alignments of native EST reads are screened for candidate variable sites. e, An STS is designed for the verification of a candidate SNP. f, The uniqueness of the genomic location is determined by sequencing the STS in CHM1 (homozygous DNA). g, The presence of a SNP is analysed by sequencing the STS from pooled DNA samples.

Nature Genetics 23, 452 - 456 (1999)

A general approach to single-nucleotide polymorphism discovery

Gabor T. Marth, Ian Korf, Mark D. Yandell, Raymond T. Yeh, Zhijie Gu, Hamideh Zakeri, Nathan O. Stitziel, LaDeana Hillier, Pui-Yan Kwok & Warren R. Gish

Page 39: A. Genetic Variationcschweikert/cisc4020/SNPs.pdf · – Genetic variation in populations SNPs as genetic markers – “Classical” genetic diseases – Multi-factorial diseases

PolyPhred: automating the detection and genotyping of single nucleotide substitutions using fluorescence-based resequencing

Deborah A. Nickerson*, Vincent O. Tobe and Scott L. Taylor �

Nucleic Acids Research; 1997- 25:2745

SNP calling Correct call False positive False positive False positive

Page 40: A. Genetic Variationcschweikert/cisc4020/SNPs.pdf · – Genetic variation in populations SNPs as genetic markers – “Classical” genetic diseases – Multi-factorial diseases

Trace File�

High quality region – no ambiguities

Page 41: A. Genetic Variationcschweikert/cisc4020/SNPs.pdf · – Genetic variation in populations SNPs as genetic markers – “Classical” genetic diseases – Multi-factorial diseases

Trace File�

Medium quality region – some ambiguities

Page 42: A. Genetic Variationcschweikert/cisc4020/SNPs.pdf · – Genetic variation in populations SNPs as genetic markers – “Classical” genetic diseases – Multi-factorial diseases

Trace File�

Poor quality region – low confidence

Page 43: A. Genetic Variationcschweikert/cisc4020/SNPs.pdf · – Genetic variation in populations SNPs as genetic markers – “Classical” genetic diseases – Multi-factorial diseases

Using PolyPhred to Visualize SNPs

• Compares sequences across traces obtained from different individuals to identify sites for SNPs. • Will occasionally miscall genotypes - frequency of such mistakes depends on the sequencing chemistry used to generate the trace. • To reduce the number of miscalled sites, ignores regions of poor quality & ends

Page 44: A. Genetic Variationcschweikert/cisc4020/SNPs.pdf · – Genetic variation in populations SNPs as genetic markers – “Classical” genetic diseases – Multi-factorial diseases

Polyphred –  Reads the ACE file to obtain the consensus sequence and the names of the

trace (chromat) files used in the assembly.

–  Reads the PHD files associated with each trace.

–  During the SNP search phase, PolyPhred combines information from all of the sequence traces to derive a genotype and a score for each sequence

–  The score indicates how well the trace at the site matches the expected pattern for a SNP.

–  Updates the ACE and PHD files by adding tags that mark the positions of the sites. The tagged sites can then be examined using Consed.

Page 45: A. Genetic Variationcschweikert/cisc4020/SNPs.pdf · – Genetic variation in populations SNPs as genetic markers – “Classical” genetic diseases – Multi-factorial diseases

Polybayes Bayesian statistical model takes into account: - depth of coverge - base quality values of the sequences

Polybayes calculations are aided with information on major/minor allele frequencies as well as polymorphism rates within the species under investigation

**Can also integrate into the poly files for viewing with Consed

Page 46: A. Genetic Variationcschweikert/cisc4020/SNPs.pdf · – Genetic variation in populations SNPs as genetic markers – “Classical” genetic diseases – Multi-factorial diseases

•  Alignment Critical in the automation of base calls –  Commonly used Phrap (from PhredPhrap) is an assembler and is NOT

ideal for alignments –  Many commonly used aligners work best with protein sequences

or with a reference sequence –  Preservation of quality scores for input into SNP identification

programs –  Speed for high-throughput programs

•  Automated SNP Calls -  Reference Sequence Required -  Traditional approaches without reference sequence include

“eSNPs” (human, maize, and pine) -Very little redundancy outside of abundant genes -Overall high number of false positives (single pass reads)

-  Not specific to frequencies observed in different organisms -  High number of false positives in currently accepted methods

(Polybayes & Polyphred)

Alignment and SNP Calling Pipeline�Challenges in High-Throughput SNP Identification

Page 47: A. Genetic Variationcschweikert/cisc4020/SNPs.pdf · – Genetic variation in populations SNPs as genetic markers – “Classical” genetic diseases – Multi-factorial diseases

5’ UTR

exon

Intron

3’ UTR

Page 48: A. Genetic Variationcschweikert/cisc4020/SNPs.pdf · – Genetic variation in populations SNPs as genetic markers – “Classical” genetic diseases – Multi-factorial diseases

4-Coumarate CoA Ligase (4CL)0 500 1000 1500 2000 2500

1

994

1410

1609

1697

1845

1934

2004

2385

2589

F4 R4 F3 R3 F2 R1A61 601 947 1454 1486 2003

F5 R3 F6 R6491 1956 2728

743-781 bound_moiety="AMP" 2396-2417 proposed active sites1

s2

s3

s4

s5

s6

s7

s8

s9

s11

G T A G T C G G G CG T A G T C G G G CG T A G T C G G G CG T A G T C G G G CG T A G T C G G G CG T A G T C G G G CG T A G T C G G G CG T A G T C G G G CG T A G T C G G G CG T A G T C G G G CG T A G T C G G G CG T A G T C G G G CG T A G T C G G G CG T A G T C G G G CG T A G T C G G G CG T A G T C G G G CG T A G T C G G G CG T A G T C G G G CG T A G T C G G G CG T A G T C G G G CG T A G T C G G G CG T A G T C G G G CG T A G T C G G G CG T A G T C G G G CG T A G T C G G G CG T A G T C G G G CG T A G T C G G G CG T A G T C G G G CG T A G T C G G G CG T A G T C G G G CG T A G T C G G G CG T A G T C G G G CG T A G T C G G G CG T A G T C G G G CG T A G T C G G G CG T A G T C G G G C

A C T A C T G A A TA C T A C T G A A TA C T A C T G A A TA C T A C T G A A T

A C T A C T G G A TA C T A C T G G A TA C T A C T G G A TA C T A C T G G A T

A C T A C C G G A TA C T A C C G G A TA C T A C C G G A TA C T A C C G G A T

A C T A C C G G A CA C T A C C G G A CA C T A C C G G A CA C T A C C G G A C

A C T A C C A G A CA C T A C C A G A CA C T A C C A G A CA C T A C C A G A CA C T A C C A G A CA C T A C C A G A CA C T A C C A G A C

G T A G T C G G G CG T A G T C G G G C

A C T G T C G G G CA C T G T C G G G C

G C A G C C G G G C

4CL haplotype frequencies