Next-generation genomics: an integrative approach

Preview:

Citation preview

Permissions: you are free to blog or live-blog about this presentation as long as you attribute the work to its authors

Korea Center for Disease Control & Prevention

Next-generation genomics:an integrative approach

Chang Bum Hong

Division of Structural and functional Genomics, Center for Genome Sciences, NIH

twitter

APPLICATIONS OF NEXT-GENERATION SEQUENCING

2011• Genome structural variation discovery and genotyping• RNA sequencing: advances, challenges and opportunities• Charting histon modifications and the functional organization of mammalian genomes

2010• Evaluating genome-scale approaches to eukaryotic DNA replication• Advances in understanding cancer genomes through second-generation sequencing• Genome-wide allele-specific analysis: insights into regulatory variation• Next-generation genomics: an integrative approach• Uncovering the roles of rare variants in common disease through whole-genome sequencing• Principles and challenges of genome-wide DNA methylation analysis• Prokaryotic transcriptomics: a new view on regulation, physiology and pathogenicity• Sequencing technologies - the next generation• RNA processing and its regulation: global insights into biological networks

2009• The complex eukaryotic transcriptome: unexpected pervasive transcription and novel small RNAs• ChIP-seq: advantages and challenges of a maturing technology• Insights from genomic profiling of transcription factors• RNA-Seq: a revolutionary tool for transcriptomics

DNA

RNA

Protein

Complete genome resequencingTargeted genomic resequencingde novo sequencing

Translated into proteins

DNA being transcribed into RNA

PhenotypeDisease

Chromatin immuniprecipitation sequencingSequencing of bisulfite-treated DNA

EpigenomeTranscriptome sequencingSmall RNA sequencing

Proteomics

Transcriptomics

Genomics

Genome-scale dataGWAS, ChIP-seq and RNA-seq

Next-generation sequencing

•We define this as the use of established sequencing platforms, including the

• Illumia/Solexa Genome Analyzer

• Roche/454 Genome Sequencer

• Applied Biosystems SOLiD

• Helicos and Pacific Biosciences

HiSeq 2000

5500xl SOLid System

MiSeq

Ion Personal Genome Machine

Genome Sequencer FLX System

GS Junior

HeliScope Single Molecule SequncerPACBIO RSJay Flatley Greg Lucier

Jay Flatley Greg Lucier Stephen Quake

Jim Watson Craig Venter

John WestFormer Illumina CEOFounder of HelicosLife Technogoies CEOIllumina CEO

?

BGI 1 x 454, 27 x SOLiD3/4, 128 x Illumina HiSeq

94 x Illumina GA2, 10 x 454, 8 x SOLiD3/4, 1 x Heliscope, 1 x Polonator, 1 x PacBioBroad Institute

Next Generation Genomics: World Map of High-throughput Sequencershttp://pathogenomics.bham.ac.uk/hts/

GMI at Seoul National University College of Medicine 10 x Illumina GA2Macrogen 10 x Illumina GA2, 1 x 454, 2 x SOLiD3/4NICEM Illumina GA2, 454Gachon University of Medicine and Science Illumina GA2, 2 x SOLiD 3/4KRIBB 1x Illumina GA2

• Next-next....-generation: how many ‘next’s are there?

• First Generation: automated version of Sanger sequencing(DNA-sequencing method invented by Fred Sanger in the 1970s)

• Second Generation

• Roche/454 sequencing machine from 454 Life Science(2005)

• 450 bases per read / $0.02 per 1000 bases / 2 days per Gb

• Solexa from Illumina(2006)

• 75 bases per read / $0.01 per 1000 bases / 0.5 days per Gb

• SOLiD from Applied BioSystem(2006)

• 50 bases per read / $0.001 per 1000 bases / 0.5 days per Gb

• Next-Next-Gen - Third Generation?

• Hiseq2000 from Illumina - 0.04 days per Gb

• Helicos Heliscope

• Pacific Biosciences SMART

Sequencing technologies

Shendure & Ji, 2008

Michael L. Metzker, 2010

Sequencing technologiesFeature generation

Sequencing technologiesSequencing by synthesis

Michael L. Metzker, 2010

• Sequencing

• How deep?

• Single, Paired read or both

• Alignment

• References, assemble or both

• Experimental specific analysis

• A ‘one-size-fits-all’ program dose not exist

NGS typical procedure

• Sequence assembly

• Whole Genome Assembly (Reference, De novo)

• Transcriptome Assembly

• Short Sequence Alignment

• Single read

• Paired read

• Genomic Variation Detection

• Detection of Single Nucleotide Polymorphism (SNP)

• Detection of Alternative Splicing Event

• Detection of major/minor transcript isoforms

Applications

Shendure & Ji, 2008

Applications

Bioinformatics tools

Shendure & Ji, 2008

• Sequence Reads

• fastq

• fasta

• Alignment

• Sequence Alignment Map (SAM)

• BAM (Binary Alignment Map)

• Variation

• VCF (Variation Call Format)

File Format

Data: Sequence Reads

Data: Sequence Reads

A challenge call for a new compression algorithmCompression of genomic sequences in FASTQ format

Sebastian Deorowicz et.al, 2011

Data: Sequence Reads

Compress type Compress time Size

gzip 14s 28M

bzip2 9.75s 23M

dsrc 1.36s 21M

• ChIP-Seq

• allows you to assay the amount of binding and location of a protein to DNA, such as a transcription factor bound to the start site of a gene, or a histones of a certain type

• RNA-Seq

• Mapping transcription start sites

• Characterization of alternative splicing patterns

• Gene fusion detection

• Estimation of the abundance of the transcripts from their depth of coverage in the mapping

Example of Applications

ChIP-Seq

Barski A & Zhao K, 2009

Chromatin immunoprecipitation (ChIP)

Kharchenko et al, 2008

Shirely et al, 2009

ChIP-Seq

Shirely et al, 2009

ChIP-Seq Software packages

Shirely et al, 2009

RNA-Seq

Zhong Wang, 2009

RNA-Seq (De novo transcriptome assembly)

RNA-Seq(Transcriptome resequencing)

RNA-Seq

RNA-Seq mapping of short reads in exon-exon junctionsRNA-Seq mapping of short reads over exon-exon junctions, depending on where each end maps to, it could be defined a Transor a Cis event.

from wikipedia.org

RNA-Seq Software packages

Shirely et al, 2009

• Genes in DNA being transcribed into RNA

• might be spliced

• transported to an appropriate cellular compartment

• translated into proteins

• Regulated at many levels

• DNA methylation

• chromatin modification

• binding of transcription factors to the DNA

• binding of splicing factors to the RNA and RNA transport

DNA encodes heritable traits

•What types of genomic data sets are available?

•Why perform integrative genomic analysis?

• Approaches to an integrative analysis

• Using large-scale data sets for integrative analysis

• Future perspectives

NGG(Next-generation genomics)an integrative approach

• Sequence variation data

• SNP genotyping arrays

• resequencing

• Transcriptomic data

• RNA-Seq

• identify transcripts arising from gene fusion events

• detect novel classes of non-coding RNAs

• Epigenomic data

• Bisulphite tratment

• Chromatin immunoprecipitation

• Interactome data

• RNA-protein interaction

• protein -protein interaction networks

• define genetic and signaling pathways

What types of genomic data sets are available?

• Annotating functional features of the genome

• Inferring the function of genetic variants

• Understanding mechanisms of gene regulation

Figure 1 | Annotating the genome through detecting transcription-factor binding sites and histone-modification states.

Why perform integrative genomic analysis?

Figure 2 | Identification of regulatory SNPs

Approaches to an integrative analysis

• Data complexity reduction

• summarize each experiment as a collection of genomic regions with strong enrichment of signal

• especially important to inspect at least some of the results by eye

• Unsupervised integration

• 목적은 어떤 올바른 답을 찾는 것이 아니라 데이터 집합 내에서 구조를 발견

• Clustering: partitioning a large data set into easily digestible, conceptual pieces

• Supervised integration

• 예제 입출력을 사용해 예측하는 방법을 학습하는 기법

• Bayesian network

Approaches to an integrative analysis

an intromic H3K4me1 peak predicts an enhancer elements

Promoter

Transcribed

UCSC browser with EnCODE data

Using large-scale data sets for integrative analysis

• For the bench scientist

• open-source web browser, such as FireFox

• add-ons: gatekeepers

Using large-scale data sets for integrative analysis

• For the bench scientist

• stand-alone analytical system: CisGenome

• genome browser: UCSC browser, Anno-J

Figure 4 | Flow chart for data analysisWorkflow for ChIP-seq analysis

Galaxy

UCSC browser

Online or stand-alone tools

Using large-scale data sets for integrative analysis

• Bioinformatics hurdles

• normalized data

Future perspectives

•Data integration itself is not an end

• designed to generate novel hypotheses and help to test them

• Community-wide effort, akin to Wikipedia

• Searchable with Google-like capabilities

Future perspectives

Future perspectives

토비 세가란Genstruct에서 약제 발현원리 이해를 위한 알고리즘 설계

사트남 알랙생명과학 커뮤니티를 위한 버티컬 검색 엔진을 개발

하는 넥스트바이오의 엔지니어링 부사장

Recommended