113
GRC Workshop ASHG 22 Oct 2013

GRC Workshop

  • Upload
    creola

  • View
    128

  • Download
    3

Embed Size (px)

DESCRIPTION

GRC Workshop. ASHG. 22 Oct 2013. Outline Reference Assembly Basics GRC: Assembly management and dataflow GRCh38 Accessing the assembly and data. http://genomereference.org. Reference Assembly Basics. What is the Reference Assembly?. An assembly is a MODEL of the genome. - PowerPoint PPT Presentation

Citation preview

Page 1: GRC Workshop

GRC WorkshopASHG

22 Oct 2013

Page 2: GRC Workshop

OutlineReference Assembly BasicsGRC: Assembly management and dataflowGRCh38Accessing the assembly and data

http://genomereference.org

Page 3: GRC Workshop

What is the Reference Assembly?

Reference Assembly Basics

Page 4: GRC Workshop
Page 5: GRC Workshop
Page 6: GRC Workshop

An assembly is a MODEL of the genome

Page 7: GRC Workshop
Page 8: GRC Workshop

Lander and Waterman(1988) Genomics

Reads are randomly distributedOverlap between reads does not vary

AssumptionsVariables:G= haploid genome length in bpL= sequence read length in bpN= number of reads sequencedT= amount of overlap needed for detection in bpC= Coverage (C=LN/G)

Poisson distribution:P(Y=y)=(ly * e–l)/y!y= number of events in an intervall = mean number of events in an interval

For sequence calculations, coverage can be viewed as l

Reference Assembly Basics

Using this equation, you can calculate the probability that a base hasbeen sequenced y number of times.

By manipulating this formula, you can estimate the numbers of gaps for any given level of coverage.

Page 9: GRC Workshop

SequencedNot sequenced1X Coverage5X Coverage

10X Coverage

37% 63%0.6% 99.4%

0.005% 99.995%

Reference Assembly Basics

Page 10: GRC Workshop

2009 Sanger cost: shotgun sequence ~ $0.01/base finished sequence ~ $0.03/base

This clone: Shotgun=$1500Finish=$3000

Reference Assembly Basics

Page 11: GRC Workshop

Reference Assembly Basics

Page 12: GRC Workshop

tetrao

don

muntja

k_ind

ian

zebra

finch

zebra

fish

macaq

ue

alliga

tor

chick

ensh

eep

monod

elphis

orang

utan

gorill

ave

rvet

cpba

t

chim

p

owl_m

onke

y cat

pig

dusk

y_titi co

w

eleph

ant

fugu

babo

on dog

hedg

ehog

shrew

armad

illo

opos

sum

squir

rel_m

onke

yrab

bit

galag

olem

urrfb

at rat

mouse

marmos

et

wallab

y

colob

us_m

onke

y

platyp

us

0

1

2

3

4

5

6

7

8

9

10

Sequence Gaps : Uncaptured vs. Total

Uncaptured gaps Captured gaps

Species

Gap

Ave

. per

BA

C

Captured gap= no sequence, but a sub-clone spans the gapUncaptured gap= no sequence, no sub-clone spanning gap

Bob Blakesley, NISC

Reference Assembly Basics

Page 13: GRC Workshop

BiologyRepetitive sequence (interspersed repeats, segmental duplications)Variation

(regions of high diversity, structural variation)

Kidd et al., 2008

Reference Assembly Basics

Page 14: GRC Workshop

Reference Assembly Basics

Eugene Yaschenko, NCBI

Page 15: GRC Workshop

EnrichmentObservedExpected

-5

-4

-3

-2

-1

0

1

2

3

4

5

60

40

20

0

20

40

60

Maj

or h

isto

com

patib

ility

com

plex

ant

igen

Che

mok

ine

Tum

or n

ecro

sis

fact

or re

cept

or

Oth

er c

ytok

ine

rece

ptor

Cys

tein

e pr

otea

se in

hibi

tor

CA

M fa

mily

adh

esio

n m

olec

ule

Apo

lipop

rote

in

KR

AB

box

tran

scrip

tion

fact

or

Inte

rmed

iate

fila

men

t

Imm

unog

lobu

lin re

cept

or fa

mily

mem

ber

Oth

er c

ell a

dhes

ion

mol

ecul

e

Zinc

fing

er tr

ansc

riptio

n fa

ctor

Def

ense

/imm

unity

pro

tein

Stru

ctur

al p

rote

in

Cys

tein

e pr

otea

se

Cyt

okin

e re

cept

or

Oxy

gena

se

Cel

l adh

esio

n m

olec

ule

Tran

scrip

tion

fact

or

Mis

cella

neou

s fu

nctio

n

Sig

nalin

g m

olec

ule

Oxi

dore

duct

ase

Unc

lass

ified

Nuc

leic

aci

d bi

ndin

g

Sel

ect r

egul

ator

y m

olec

ule

Kin

ase

Hyd

rola

se

Rib

osom

al p

rote

in

Pro

tein

kin

ase

G-p

rote

in m

odul

ator

Ext

race

llula

r mat

rix

Oth

er tr

ansc

riptio

n fa

ctor

Human- PANTHER classifications (biological process)

Evan Eichler, University of Washington

Reference Assembly Basics

Page 16: GRC Workshop

TechnologyRead length long reads vs. short readsMate lengths distribution of insert sizesRead accuracy error model for your technologyRead depth coverage at each baseGenome distribution reads covering entire genome equally

Ajay et al., 2011

Page 17: GRC Workshop

Genome Research, May, 1997

Reference Assembly Basics

Page 18: GRC Workshop

Restrict and make libraries2, 4, 8, 10, 40, 150 kb

End-sequence allclones and retainpairing information“mate-pairs”

Find sequence overlaps

Each end sequenceis referred to as a read

WGS contig

tails

WGS: Sanger Reads

Scaffold

Reference Assembly Basics

Page 19: GRC Workshop

Contig: a sequence constructed from smaller, overlapping sequences, which contains no gaps.

Scaffold: a sequence constructed from smaller sequences, which may contain gaps.

Genome Vocabulary

Typically built from reads, but also from sequences in GenBank/EMBL/DDBJ

Typically built from sequences in GenBank/EMBL/DDBJ

Reference Assembly Basics

Page 20: GRC Workshop

Schatz et al, 2010

Reference Assembly Basics

Page 21: GRC Workshop

A T T T T C C C T T C T G A A A T G A T G A A A G A G T C

Reference Assembly Basics

Page 22: GRC Workshop

BAC insertBAC vector

Shotgun sequence

Assemble

Fold

sequ

ence

Gaps

deeper sequencecoverage rarelyresolves all gaps

GAPS

“finishers” go in to manually fill the gaps, often by PCR

Clone based assemblies

Reference Assembly Basics

Page 23: GRC Workshop

A

BCD

EFGH

IJKLMNO

ABCD

FGH

KL

ON

Ideally…

Non-sequence based Map

(flip)

ABCD

FGH

KL

ON

Reference Assembly Basics

Page 24: GRC Workshop

More like…

A

BCD

EFGH

IJKLMNO

A

BC

ZYX

W

HJ

M

V

N

O

AB

HIJ

CDY

LMNO

AB

HIJ

LMNO

?

Reference Assembly Basics

Page 25: GRC Workshop

Sequence vs. Non-sequence based mapsMmu7

WI GeneticWI/MRC RH

Page 26: GRC Workshop

Human assemblies available in the NCBI assembly database

http://www.ncbi.nlm.nih.gov/assembly

Reference Assembly Basics

Page 27: GRC Workshop

Reference Assembly Basics

Page 28: GRC Workshop

Reference Assembly Basics

N50:Measure of continuity.Half of the contigs in the assembly are this length or greater.

Page 29: GRC Workshop

Reference Assembly BasicsFragmented genomes tend to

have more partial modelsFragmented genomes have

fewer frameshifts

Alexander Souvorov, NCBI

Page 30: GRC Workshop

OutlineReference Assembly BasicsGRC: Assembly management and dataflowGRCh38Accessing the assembly and data

http://genomereference.org

Page 31: GRC Workshop

http://genomereference.org

Page 32: GRC Workshop

Distributed data

Genome not in INSDC Database

Old Assembly Model

GRC Assembly Management

Human Genome Project (HGP)

Page 33: GRC Workshop

GRC Assembly Management

Page 34: GRC Workshop

GRC Assembly Management

Page 35: GRC Workshop

Distributed data

Genome not in INSDC Database

Old Assembly Model

Centralized Data

GRC Assembly Management

Page 36: GRC Workshop

Issue tracking system (based on JIRA)

GRC Assembly Management

http://genomereference.org

Page 37: GRC Workshop

GRC Assembly Management

Page 38: GRC Workshop

GRC Assembly Management

5 July 2011

Page 39: GRC Workshop

GRC Assembly Management

Page 40: GRC Workshop

GRC Assembly Management

Page 41: GRC Workshop

ACCESSION NAME CONTIG

GAP Telomere 10000

AP006221 XX-190A2 Hschr1_ctg1

AL627309 RP11-34P13 Hschr1_ctg1

GAP type-3

AC114498 RP5-857K21 Hschr1_ctg3

AL669831 RP11-206L10 Hschr1_ctg3

AL645608 RP11-54O7 Hschr1_ctg3

Tiling Path File (TPF)

GRC Assembly Management

Page 42: GRC Workshop

Full Dovetail

Half-dovetail

Contained

Short/Blunt

GRC Assembly Management

Page 43: GRC Workshop

GRC Assembly Management

Page 44: GRC Workshop

GRC Assembly Management

Page 45: GRC Workshop

GRC Assembly Management

Page 46: GRC Workshop

GRC Assembly Management

Page 47: GRC Workshop

Build sequence contigs based on contigs defined in TPF (Tiling Path File).

Check for orientation consistenciesSelect switch pointsInstantiate sequence for further analysis

Switch point

Representative chromosome sequence

GRC Assembly Management

Page 48: GRC Workshop

AGP: A Golden Path

Provides instructions for building a sequence• Defines components sequences used to build scaffolds/chromosome• Switch points• Defines gaps and types

GRC Produces

GRC Assembly Management

• AGP• FASTA

Page 49: GRC Workshop

Distributed data

Old Assembly ModelCentralized Data

Updated Assembly Model

GRC Assembly Management

Genome not in INSDC Database

Page 50: GRC Workshop

Sequences from haplotype 1Sequences from haplotype 2

Old Assembly model: compress into a consensus

New Assembly model: represent both haplotypes

GRC Assembly Management

Page 51: GRC Workshop

Assembly (e.g. GRCh37)

Primary Assembly

Non-nuclear assembly unit

(e.g. MT)

ALT 1

ALT 2

ALT 3

ALT 4

ALT 5

ALT 9

ALT 6

ALT 7ALT

8

PAR

Genomic Region(MHC)

Genomic Region

(UGT2B17)Genomic

Region(MAPT)

GRC Assembly Management

Page 52: GRC Workshop

AC074378.4AC079749.5

AC134921.2AC147055.2

AC140484.1AC019173.4

AC093720.2AC021146.7

NCBI36 NC_000004.10 (chr4) Tiling Path

Xue Y et al, 2008

TMPRSS11E TMPRSS11E2

GRCh37 NC_000004.11 (chr4) Tiling Path

AC074378.4AC079749.5

AC134921.1AC147055.2

AC093720.2AC021146.7

TMPRSS11E

GRCh37: NT_167250.1 (UGT2B17 alternate locus)

AC074378.4AC140484.1

AC019173.4AC226496.2

AC021146.7

TMPRSS11E2

UGT2B17 RegionGRC Assembly Management

Page 53: GRC Workshop

GRC Assembly Management

7 alternate haplotypesat the MHC

Alternate loci released as:FASTA

AGPAlignment to chromosome

UGT2B17 MHC MAPT

GRCh37 (hg19)

Page 54: GRC Workshop

Assembly (e.g. GRCh37.p13)

Primary Assembly

Non-nuclear assembly unit

(e.g. MT)

ALT 1

ALT 2

ALT 3

ALT 4

ALT 5

ALT 9

ALT 6

ALT 7ALT

8

PAR

Genomic Region(MHC)

Genomic Region

(UGT2B17)Genomic

Region(MAPT)

Patches

Genomic Region(ABO)

Genomic Region(SMA)

Genomic Region

(PECAM1)

GRC Assembly Management

Page 55: GRC Workshop

GRC Assembly Management

GRCh37.p13• 178 Regions: 3.15% of chromosome

sequence• 131 FIX patches: add 6.8 Mb novel

sequence• 73 NOVEL patches: add >800kb novel

sequence

Page 56: GRC Workshop

MHC (chr6)Chr 6 representation (PGF)

Alt_Ref_Locus_2 (COX)

GRC Assembly Management

Page 57: GRC Workshop

17q deletion

H1

H2

Zody et al, 2008

GRC Assembly Management

Page 58: GRC Workshop

GRC Assembly Management

Page 59: GRC Workshop

chromosome

alt/patch

reads On-target alignment

Off-target alignments

(n=122,922)

GRC Assembly Management

Page 60: GRC Workshop

GRC Assembly Management

Page 61: GRC Workshop

Masks and alt aware aligners reduce the incidence of ambiguous alignments observed when aligning

reads to the full assembly

Mask1: mask chr for fix patches, scaffold for novel/alts. Mask2: mask only on scaffolds

GRC Assembly Management

Page 62: GRC Workshop

Distributed data

Genome not in INSDC Database

Old Assembly ModelCentralized Data

Updated Assembly Model

Genome in INSDC DatabaseGenome not in INSDC Database

GRC Assembly Management

Page 63: GRC Workshop

OutlineReference Assembly BasicsGRC: Assembly management and dataflowGRCh38Accessing the assembly and data

http://genomereference.org

Page 64: GRC Workshop

GRCh38 Impact

GRCh38

Page 65: GRC Workshop

GRCh38 Impact

GRCh37 Scaff N50: 44,983,201GRCh37B Scaff N50: 62,124,159

GRCh37 Contig N50: 38,440,852GRCh37B Contig N50: 49,319,739

Page 66: GRC Workshop

GRCh38 Impact

Page 67: GRC Workshop

GRCh38 Impact

Page 68: GRC Workshop

Modeled CentromeresIndividual base updatesFixed tiling path/assembly errorsAddition of novel sequence

GRCh38 Impact

Major Features of GRCh38

Page 69: GRC Workshop

CENTROMERES

GRCh38 Impact

Page 70: GRC Workshop

61-mer analysis set

9664

1kG high-confidence set

13584222

Mismatches MAF = 0n=15,244

MAF=0Insertio

nsn=834

MAF=0Deletion

sn=1541

MAF<5%Mismatc

h in pseudo/pr txptn=1413

Annotator and clinical

requestsn= ~260

GRCh38 Impact

Page 71: GRC Workshop

Pile-Up Analysis: “Never Seen” Mismatched Bases Originating from RP11 Components

GRCh38 Impact

79% of these bases are heterozygous in RP11 WGS

Page 72: GRC Workshop

GRCh37 Insertions Originating from RP11

GRCh38 Impact

GRCh37 Deletions Originating from RP11

17% heterozygous in RP11 WGS 18% heterozygous in RP11 WGS

Page 73: GRC Workshop

GRCh38 Impact

Page 74: GRC Workshop

GRCh38 Impact

Page 75: GRC Workshop

GRCh38 Impact

Page 76: GRC Workshop

1q32 1q21 1p211p21 patch alignment to chromosome 1

Dennis et al., 2012GRCh38 Impact

Page 77: GRC Workshop

HYDIN: chr16 (16q22.2)HYDIN2: chr1 (1q21.1)Missing in NCBI35/NCBI36 Unlocalized in GRCh37 Finished in GRCh38

Alignment of HYDIN2 Genomic, 300 Kb, 99.4% ID

Alignment of HYDIN CHM1_1.0, >99.9% IDAlignment of HYDIN2 Genomic, 300 Kb, 99.4% ID

Alignment of HYDIN CHM1_1.0, >99.9% ID

Doggett et al., 2006GRCh38 Impact

Page 78: GRC Workshop

GRCh38 Impact

Other Major Tiling Path Updates• Single CHM1 haplotype paths for:

• 1p12, 1q21, 1q32: SRGAP2• IGH• LRC/KIR• CCL3L1 (17q21)

• OM-guided• 10q11• Chr. 9 peri-centromeric inversion

Page 79: GRC Workshop

GRCh38 Impact

NOVEL GENES!GRCh37.p13: 211 genes found only on alt

loci and patches

Page 80: GRC Workshop

GRCh38 Impact

Sudmant et al., 2010

Page 81: GRC Workshop

Genovese et al., 2013

Page 82: GRC Workshop

1000G decoy sequence, viewed by:• GenBank alignment• Percent Repeat Masked• Repeat Mask type• Sequence Source (HTG, HuRef, ALLPATHS)

GRCh38 Impact

In a preliminary analysis, 90% of NA12878 reads that previously aligned uniquely to the decoy sequence had

an alignment to the updated assembly.

Page 83: GRC Workshop

GRCh38 Impact

Where is the decoy sequence in GRCh38?• Alt loci (low repeat content)• Model centromeres (high repeat content)• Unlocalized/Unplaced Scaffolds• Chromosomes

Page 84: GRC Workshop

OutlineReference Assembly BasicsGRC: Assembly management and dataflowGRCh38Accessing the assembly and data

http://genomereference.org

Page 85: GRC Workshop

Accessing the Data

Page 86: GRC Workshop

Accessing the Data

Page 87: GRC Workshop

Accessing the Data

Page 88: GRC Workshop

Accessing the Data

Page 89: GRC Workshop

Accessing the Data

Page 90: GRC Workshop

NCBI Genes, Ensembl Genes, Annotated Clone Problems, Segmental Duplications

Accessing the Data

Page 91: GRC Workshop

Accessing the Data

Page 92: GRC Workshop

Accessing the Data

Page 93: GRC Workshop

Accessing the Data

Page 94: GRC Workshop

Accessing the Data

Page 95: GRC Workshop

GRCh38 in Ensembl

GRCh38 will be incorporated into the existing Ensembl interface. Features such as genes, variation, regulation will be remade or remapped onto the new genome. Nearly 500 tracks are available.

GENCODE gene set

Page 96: GRC Workshop

Accessing the Data

Page 97: GRC Workshop

Alternate sequences in Ensembl

Haplotypes and patches on the chromosome

A fix patch around the ABO gene

Use the Region comparison view to see the difference between the patch and primary assembly

The GRC alignment track indicates edits

Page 98: GRC Workshop

View your data on the Genome

Zoomed in

Zoomed out

Follow the link from the homepage

Red bases show mismatches

Page 99: GRC Workshop

Transition to GRCh38 in Ensembl

INSDC coordinates identify the assembly as well as the position

Convert coordinates between assemblies

Our blog series details our progress with GRCh38Ensembl.info

Page 100: GRC Workshop

Remap Set up slide

Page 101: GRC Workshop

Accessing the Data

Page 102: GRC Workshop

Accessing the Data

Page 103: GRC Workshop

1000 Genomes Browser: http://www.ncbi.nlm.nih.gov/variation/tools/1000genomesGeT-RM Browser: http://www.ncbi.nlm.nih.gov/variation/tools/getrmVariation Viewer: http://www.ncbi.nlm.nih.gov/variation/view (coming Fall 2013!)

Page 104: GRC Workshop
Page 105: GRC Workshop

Tiling Path

Sequence Bar

Segmental Duplications, Eichler Lab

1000 Genomes strict accessibility mask

Annotated clone assembly problems

Page 106: GRC Workshop

dbSNP Build 138 based on annotation run 104

Model based paralogous sequence differences, NCBI annotation run #Paralogous/pseudo gene alignments, NCBI annotation run #

Single Unique Nucleotide (SUN) map, Sudmant 2010ClinVar Long Variations

GRC Curation Issues

ClinVar Short Variations

Page 107: GRC Workshop

http://twitter.com/[email protected]

Accessing the Data

Page 108: GRC Workshop

http://genomeref.blogspot.com/

Accessing the Data

Page 109: GRC Workshop
Page 110: GRC Workshop
Page 111: GRC Workshop
Page 112: GRC Workshop
Page 113: GRC Workshop

Accessing the Data