37
Introduction to Single Nucleotide Polymorphisms (SNPs) Zhongming Zhao Department of Psychiatry and Center for the Study of Biological Complexity June 28, 2004 Email: [email protected]

Introduction to Single Nucleotide Polymorphisms (SNPs) Zhongming Zhao

  • Upload
    jael

  • View
    62

  • Download
    1

Embed Size (px)

DESCRIPTION

Introduction to Single Nucleotide Polymorphisms (SNPs) Zhongming Zhao Department of Psychiatry and Center for the Study of Biological Complexity June 28, 2004 Email: [email protected]. Organization. Introduction to single nucleotide polymorphism (SNPs) - PowerPoint PPT Presentation

Citation preview

Page 1: Introduction to Single Nucleotide Polymorphisms (SNPs) Zhongming Zhao

Introduction to Single Nucleotide Polymorphisms (SNPs)

Zhongming Zhao Department of Psychiatry and Center for the Study of Biological ComplexityJune 28, 2004

Email: [email protected]

Introduction to Single Nucleotide Polymorphisms (SNPs)

Zhongming Zhao Department of Psychiatry and Center for the Study of Biological ComplexityJune 28, 2004

Email: [email protected]

Page 2: Introduction to Single Nucleotide Polymorphisms (SNPs) Zhongming Zhao

Organization

Introduction to single nucleotide polymorphism (SNPs)

An overview of mammalian genome projects

Online resource of SNPs and genome sequences

Page 3: Introduction to Single Nucleotide Polymorphisms (SNPs) Zhongming Zhao

SNPs

SNPs are DNA sequence variations that occur when a single nucleotide (A, T, C, or G) is altered (a single base variation).

Page 4: Introduction to Single Nucleotide Polymorphisms (SNPs) Zhongming Zhao

Single Nucleotide Polymorphism

G A

C C

G A

C T

G/A

Page 5: Introduction to Single Nucleotide Polymorphisms (SNPs) Zhongming Zhao

Sequence Alignment

Alignment of 16 SARS genome sequences by program Clustal W

Page 6: Introduction to Single Nucleotide Polymorphisms (SNPs) Zhongming Zhao

SNPs in Substitution Types

To From A C G T

A

C

G

T

R: A/G

Y: C/T

M: A/C

K: G/T

W: A/T

S: C/G

Page 7: Introduction to Single Nucleotide Polymorphisms (SNPs) Zhongming Zhao

Distribution of Substitutions

Data A/G (%) C/T (%) A/C (%) G/T (%) A/T (%) C/G (%) Ts (%) Ts/Tv

Mouse dbSNP 34.11 33.94 8.63 8.60 8.39 6.32 68.05 2.13

Mouse Celera 33.35 33.33 9.13 9.08 8.83 6.29 66.67 2.00

Human 33.12 33.15 8.74 8.77 7.42 8.80 66.28 1.97

0

5

10

15

20

25

30

35

40

A/G C/T A/C G/T A/T C/G

Pro

po

rtio

n (

%)

Mouse dbSNP

Mouse Celera

Human

Page 8: Introduction to Single Nucleotide Polymorphisms (SNPs) Zhongming Zhao

Disease Studies− Causes of genetic diseases− Association studies of complex diseases

Population Studies− Population structures and history− Haplotype analysis

Functional Analysis− Pharmacogenomics

Genome Mapping− Dense/fine marker set− Haplotype map

Comparative Genomics− Genome evolution− Mechanism of molecular evolution

SNPs are Valuable Tools in Genetic Analysis

Page 9: Introduction to Single Nucleotide Polymorphisms (SNPs) Zhongming Zhao

Public: NCBI dbSNP TSC Whitehead Institute

SNP Database HGMD HGBase (now HGVD) UCSC Genome

Browser Ensembl Mouse Phenome

Database

Private Celera RefSNP Sequenom RealSNP Incyte SNP Program

SNP Databases

Page 10: Introduction to Single Nucleotide Polymorphisms (SNPs) Zhongming Zhao

Celera RefSNP: Celera CgsSNP: identified

by the computational method from five individuals’ genomic sequences

Most SNPs are mapped dbSNP HGMD HGBase 5.0 million human SNPs 3.1 million mouse SNPs

NCBI dbSNP Launched in Sept. 1998 Data are deposited by various

sources rs: grouping of identical,

independent submissions of variation

Recomputed in builds based on incremental freezes

24 Species Over 19 million submissions

SNP Databases

Page 11: Introduction to Single Nucleotide Polymorphisms (SNPs) Zhongming Zhao

NCBI dbSNP

Page 12: Introduction to Single Nucleotide Polymorphisms (SNPs) Zhongming Zhao

dbSNP& genome build cycle

Locus LinkLocus Link

data data dumpdump

MSSQLMSSQL

FASTAFASTA

submissionsubmission

RefSNPRefSNPdocsum setdocsum set

asn.1 + XMLasn.1 + XML

link link Calculation &Calculation &annotationannotation

MapViewMapView

RefSeqRefSeq

GenomeGenomesequencesequence

rsrssetset

new new ss ss

accessionsaccessions

setsetRecalculation & mappingRecalculation & mapping

• Rs ID anchors links back to dbSNP

• Checkpoint for data synchronization

• Synchronized with NCBI genome assembly pipelines

denormalizationdenormalization

Page 13: Introduction to Single Nucleotide Polymorphisms (SNPs) Zhongming Zhao

dbSNP growthhuman data 1998-

2003

2.1M SNPs in first comprehensive map: Nature 2001

First TSC submission towards their goal of 200K SNPs

Computational mining from genome clone seq. ramps up

HapMap begins additional 6x shotgun coverage

June 2004: 9.8M refSNPs. 2005: Perlegen+NHGRI+??

12-15M

Page 14: Introduction to Single Nucleotide Polymorphisms (SNPs) Zhongming Zhao

Human Variations in dbSNP Build 121

Total submissions (all ss#): 19,888,389Total Non-redundant submissions: 9,856,125

‘SNP’ class 9,170,759Uniquely mapped (ref only) 8,549,864Unique + SNP 7,946,976

Page 15: Introduction to Single Nucleotide Polymorphisms (SNPs) Zhongming Zhao

Mapping SNPs to the Genome

• Format the flanking sequences of SNPs (e.g. 50 bp each side)

• Using alignment program BLAST or BLAT with the following criteria:

•0 gap in the aligned region

•The SNP position is within the aligned region

•Aligned region at least 100 bp in length

•Only 1 ambiguous letter matches

•No more than 1% sequence mismatches in the aligned region

Page 16: Introduction to Single Nucleotide Polymorphisms (SNPs) Zhongming Zhao

Most SNPs Map Uniquely during Genome Annotation

71,503

1,661

473,215

5,088

38,124

4,899,650

87,155

430,839

6,524

100

1,000

10,000

100,000

1,000,000

10,000,000

Once Twice 3 - 10 11+ Masked

Hits to Genome

Human

Mouse

Rat

Mosquito

Page 17: Introduction to Single Nucleotide Polymorphisms (SNPs) Zhongming Zhao

FASTA Format and Data Structure for a rs Record

define for FASTA records start with ">" | object-type=general

| |

| | database name

| | | offset taxID list of

| | | rs# | length | SNP class alleles

| | | | | | | | |

define:>gnl|dbSNP|rs271_allelePos=51totallen=101|taxid=9606|snpClass=1|alleles='G/A'

5' sequence: CTGCATCACA TGTACTGATT CTGTCCATTG GAACAGAGAT GATGACTGGT

variation: R

3' sequence: TTACTAAACC CTGAGCCCTG GTGTTTCTGT TGATAGGGGG TTGCATTGAT

http://www.ncbi.nlm.nih.gov/SNP/snp_ref.cgi?rs=rs271

Page 18: Introduction to Single Nucleotide Polymorphisms (SNPs) Zhongming Zhao

The SNP Consortium (TSC)

Page 19: Introduction to Single Nucleotide Polymorphisms (SNPs) Zhongming Zhao

The SNP Consortium (TSC)

• The SNP Consortium (TSC) is a public/private collaboration that has to date discovered and characterized nearly 1.8 million SNPs

• The TSC was funded by 11 corporate members and the Wellcome Trust.

• Started in April 1999 and that time its mission is to develop up to 300,000 SNPs distributed evenly throughout the human genome. Finally, in 2001, it finished by 1.5 million SNPs

• Well designed. Good quality of SNP data and allele frequencies.

Page 20: Introduction to Single Nucleotide Polymorphisms (SNPs) Zhongming Zhao

Celera CDS

Page 21: Introduction to Single Nucleotide Polymorphisms (SNPs) Zhongming Zhao

The Sequenom’s RealSNP

• Aims to develop assays for Sequenom’s Mass Spec Genotyping machine.

• Most candidate SNPs were obtained from dbSNPs, some were from Incyte’s proprietary SNPs

• Started in 2002

• Over 5.4M designed SNP assays

• Over 400,000 working assays

• Over 220,000 confirmed polymorphic SNPs

Page 22: Introduction to Single Nucleotide Polymorphisms (SNPs) Zhongming Zhao

Distribution of Heterozygosity: 1.42 million SNP Map

• The genome was divided into contiguous bins of 200,000 bp. A histogram was generated of the distribution of heterozygosity values across all such bins.

• Heterozygosity was calculated across contiguous 200,000-bp bins on Chromosome 6. The blue lines represent the values within which 95% of regions fall: 2.0  x 10-4 - 15.8 x 10-4. Red, bins falling outside this range. The extended region of unusually high heterozygosity centred at 34 Mb corresponds to the HLA.

• Correlation of nucleotide diversity with GC content of each read (autosomes only). Higher GC content, higher nucleotide diversity.

• Nature 2001 409:928-933

HLA

Page 23: Introduction to Single Nucleotide Polymorphisms (SNPs) Zhongming Zhao

• To develop a haplotype map of the human genome

• To describe the common patterns of human DNA sequence variation

• U.S.A., Japan, the U.K., Canada, China, and Nigeria

• Over A total of 270 people•Yoruba, Nigeria (30 both-parent-and-adult-child trios)

•Japanese (45 unrelated individuals)

•Han Chinese (45 unrelated individuals)

•CEPH (30 trios)

• Genotyped for at least 1 million SNPs evenly across the human genome

Page 24: Introduction to Single Nucleotide Polymorphisms (SNPs) Zhongming Zhao

The Human Genome & Variation

Science February 2001 Nature February 2001

Page 25: Introduction to Single Nucleotide Polymorphisms (SNPs) Zhongming Zhao

The Rodent Genome & Variation

December 5, 2002 Nature April 1, 2004

Page 26: Introduction to Single Nucleotide Polymorphisms (SNPs) Zhongming Zhao

Human Genome Sequencing Project

International Human Genome Sequencing Consortium (IHGSC)− A collaboration of 20 groups from the USA, the United Kingdom, Japan, France,

Germany, and China− Goals: DNA sequence, genetic map, physical map, genetic variation, functional

analysis, etc.− A 15-year $3 billion project (1990-2005, finished 2001)− Hierarchical shotgun sequencing strategy

Celera Human Genome Project− Compete IHGSC from the biotech industry− Whole-genome shotgun sequencing (WGS) strategy− DNA samples from five individuals, mainly from Craig Venter

Many follow-up studies Chromosome 6, 7, 9, 10, 13, 14, 16, 19, 20, 21, 22 Comparative genomics

Nature 2001 409:860-921

Science 2001 291:1304-1351

Science 2003 300:286-290

Page 27: Introduction to Single Nucleotide Polymorphisms (SNPs) Zhongming Zhao

The Automatic Production Line at the Whitehead Genome Sequencing Center

Page 28: Introduction to Single Nucleotide Polymorphisms (SNPs) Zhongming Zhao

The Largest Government Projects Since 1990

Proposed Project Projected cost ($ billion)

Target completion date

Estimated life-span (years)

Space Station Freedom

30.0 1999 30

Earth Observing System

17.0 2000 15

Superconducting Super Collider

11.0 1999 30

Human Genome Project

3.0 2005 Perpetual

Hubble Space Telescope

1.5 1990 15-20

Science 2003 300:286-290

Page 29: Introduction to Single Nucleotide Polymorphisms (SNPs) Zhongming Zhao

Mouse Genome Sequencing Project

Mouse Genome Sequencing Consortium (MGSC)− Whitehead/MIT Genome Center− Washington University Genome Sequencing Center− Wellcome Trust Sanger Institute− Ensembl

Hybrid Sequencing Strategy (WGS and hierarchical shotgun)

Single mouse strain C57BL/6J (female)

SNPs generated by WGS sequencing: 79,269 SNPs from four strains (C57BL/6J, 129S1/SvImJ, C3H/HeJ, BALB/cByJ)

Nature 2002 420:520

Page 30: Introduction to Single Nucleotide Polymorphisms (SNPs) Zhongming Zhao

Nature 2002 470:574578

Page 31: Introduction to Single Nucleotide Polymorphisms (SNPs) Zhongming Zhao

Rat Genome Sequencing Project

Rat Genome Sequencing Consortium (RGSC)− Led by Baylor Genome Sequencing Center (BCM-HGSC)− International collaboration including Celera Genomics

Combined Strategy: WGS and BAC Sequencing

Brown Norway rat (most sequences from two females)

The rat genome (2.75 Gb) is smaller than the human (2.9 Gb) but larger than the mouse (2.5 Gb?)

These three genomes encode similar numbers of genes

Almost all human genes known to be associated with disease have orthologues in the rat genome

About a billion nucleotides (~40% of the euchromatic rat genome) in in the orthologous alignment among human/mouse/rat.

Nature 2004 428:493-521

Page 32: Introduction to Single Nucleotide Polymorphisms (SNPs) Zhongming Zhao

Hypermutability of CpG

CG TGGC AC

Mouse (32) Human (34)CG -3.52% -3.19%TG +1.38% +1.21%CA +1.38%` +1.21%

30,000 to 45,000 CpG islands in the human genome (Science 2001) 45,000 and 37,000 in the human and mouse genomes (PNAS 1993, 90:11995) 27,000 and 15,500 in the human and mouse genome (Nature 2002)

+1

-1

Page 33: Introduction to Single Nucleotide Polymorphisms (SNPs) Zhongming Zhao

Neighboring Nucleotide Bias of SNPs

-6

-4

-2

0

2

4

6

Position(bp)

Bia

s(%

)

A C G T

-4.44

-6

-4

-2

0

2

4

6

Position (bp)

Per

cen

tag

e o

f B

ias

%A %C %/G %T+4.91

-4.63

+5.05

-4.44

+2.58

-3.55

Mouse

Human

Page 34: Introduction to Single Nucleotide Polymorphisms (SNPs) Zhongming Zhao

Map of Conserved Synteny between Human, Mouse, and Rat Genomes

Page 35: Introduction to Single Nucleotide Polymorphisms (SNPs) Zhongming Zhao

Infer the Mutation Direction

• We have human SNPs with outgroup chimpanzee sequences (divergence time is about 4-6 million years, sequence difference is about 1.2%)

• We have mouse SNPs with outgroup rat sequences (divergence time is about 12-24 million years, sequence diversity is unknown )

Page 36: Introduction to Single Nucleotide Polymorphisms (SNPs) Zhongming Zhao

Infer the Mutation Direction

A C C A A A Direction: A->C

A C C A A C Direction: C->A

Hum SNPs Chimp Oran

Page 37: Introduction to Single Nucleotide Polymorphisms (SNPs) Zhongming Zhao

Web ResourcesWeb Resources NCBI dbSNP

www.ncbi.nlm.nih.gov/SNP

ftp.ncbi.nlm.nih.gov/snp

Celera Genomics: www.celera.com

The SNP Consortium (TSC): http://snp.cshl.org

UCSC Genome Browser: http://genome.ucsc.edu/

The Human Gene Mutation Database (HGMD): http://archive.uwcm.ac.uk/uwcm/mg/hgmd0.html

Human Genome Variation Database (HGVD): http://hgvbase.cgb.ki.se/

MIT SNP database: Human: http://www.broad.mit.edu/snp/human/Mouse: http://www.broad.mit.edu/snp/mouse/

Sequenom RealSNP: https://www.realsnp.com/default.asp

Ensembl Genome Browser: http://www.ensembl.org/ The HapMap Project: http://www.hapmap.org/

Mouse Phenome Database:

http://aretha.jax.org/pub-cgi/phenome/mpdcgi?rtn=projects/details&sym=Mpd1