Genetics 211 - 2014 Lecture 3 - Home | Stanford Medicine

Genetics 211 - 2014 Lecture 3

Genome Resequencing and Functional Genomics Gavin Sherlock January 21st 2014

SAM files •  Sequence Alignment/Map format •  Is a concise file format that contains information about how sequence

reads maps to a reference genome •  Can be further compressed in BAM format, which is a binary format

of SAM. •  Can also be sorted and indexed to provide fast random access, using

SAMtools (more on this in a minute). •  Requires ~1 byte per input base to store sequences, qualities and meta

information. •  Supports paired-end reads and color space. •  Is produced by bowtie, bwa •  SAM can be converted to pileup format.

SAM format No. Name Description 1 QNAME Query NAME of the read or read pair 2 FLAG Bitwise FLAG 3 RNAME Reference Sequence Name 4 POS 1-Based leftmost POSitionof clipped alignment 5 MAPQ MAPping Quality (Phred-scaled) 6 CIGAR Extended CIGAR string (operations: MIDNSHP) 7 MRNM Mate Reference NaMe (‘=’ if same as RNAME) 8 MPOS 1-Based leftmost Mate POSition 9 ISIZE Inferred Insert SIZE 10 SEQ Query SEQuence on the same strand as the reference 11 QUAL Query QUALity (ASCII-33=Phred base quality)

SAM Format coor 12345678901234 5678901234567890123456789012345!ref AGCATGTTAGATAA**GATAGCTGTGCTAGTAGGCAGTCAGCGCCAT!!r001+ TTAGATAAAGGATA*CTG!r004+ ATAGCT..............TCAGC!r001- CAGCGCCAT!

@SQ SN:ref LN:45!r001 163 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG *!r004 0 ref 16 30 6M14N5M * 0 0 ATAGCTTCAGC *!r001 83 ref 37 30 9M * 0 0 CAGCGCCAT *!

Alignment Visualization •  SAMtools has its own terminal-based text

visualization tool, called tview •  Very simple, but fast, and useful for looking at SNPs

and indels

Alignment Visualization •  For more complex viewing needs, use either GenomeView or

IGV •  Can display BAM files, plus annotation tracks •  Java based dynamic visualization software •  Shows snps, indels, spliced reads and a signal track

http://genomeview.sourceforge.net/ http://www.broadinstitute.org/igv/

•  Also check out JBrowse for online viewing of BAM files

NB: Browser-driven analyses are necessarily anecdotal and at best semi-quantitative.

Genome Resequencing

•  SNP/indel discovery – Can now multiplex 48 yeast strains on a single

lane of the HiSeq 2000 (~70x coverage each) –  1 lane of HiSeq = ~10x human genome

•  Structural variant discovery •  Copy Number changes •  Novel sequence discovery

SNP and indel calling •  SNPs and indels are ‘called’ from SAM/BAM files •  Typically use GATK (Genome Analysis Toolkit), which will

identify nucleotides with support for variation –  Sounds trivial, but typically lots of false positives –  Harder in a diploid than a haploid –  Harder with lower coverage –  You *must* read the “Best Practice Variant Detection with the

GATK” page before using GATK; it changes frequently –  Make sure the version of GATK you are using (it is updated

frequently) corresponds to the latest “best practices” –  Always document what you did – this is best done in the form of a

Perl or Python script that glues GATK calls into a pipeline, and that documents its parameters.

VCF files •  VCF stands for variant call format •  Produced by the GATK, and describes all

variant positions derived from bam files •  Meta information lines, preceded by ##,

indicate the source of data used, and how the file was generated.

•  Variants are not annotated – to do this, using a tool such as SNPeff or ANNOVAR –  Helps you zero in on which variants might be of

interest

Copy Number Variation

•  Sequence coverage of a region can allow detection of amplifications or deletions – i.e. duplicated regions will have higher coverage.

•  Power to detect such regions depends on their size, copy number and number of reads.

CNV Detection Power

Chiang et al, 2009. Nature Methods 6, 99-103.

Coverage Shows Amplification HXT7 HXT6

Structural Variation

•  Using paired-end read data, it’s possible to identify structural variants:

Adapted from Korbel at al, 2007.

Structural Variant Identification

•  There are several tools, e.g.: –  BreakDancer (used by WashU Sequencing Center) –  VariationHunter –  MoDIL –  inGAP-SV

•  has a nice visualization tool (http://ingap.sourceforge.net/)

–  PEMer (developed by the Gerstein lab) •  No real robust comparison between these exists on a

standard dataset •  Ask around, find out what others are using, get latest

versions and experiment

•  interactions between nucleic acids and proteins"

•  transcript identity"•  transcript abundance"• RNA editing"• SNPs"• Allele specific expression"• Regulation"

• Nucleosome positioning"• 3D genome architecture"• Active promoters"•  interactions between

nucleic acids and proteins"•  chromatin modifications"

• genome variability"• metagenomics"• genome modifications"• detection of mutations"• association studies"• phylogeny"• evolution"

Applications of Next-Gen Sequencing

genome chromatin transcriptome"

de novo sequencing"

assembly"

annotation"

mapping"

resequencing"

detection of variants"

mapping"

Hi-C"

3D reconstruction"

mapping"

ChIP-Seq"

detection of binding sites"

mapping"

RNA-Seq"

transcript detection and quantification"

mapping"

ATAC-Seq"

Identify open

chromatin"

RNA-seq

•  RNA-Seq is similar to performing expression profiling on microarrays

•  Instead of detecting transcript abundance by hybridization signal, we count fragments of transcripts

•  Several advantages over microarrays –  No prior knowledge needed of which parts of the genome are

expressed –  Allows splice site discovery –  5’ and 3’ UTR mapping –  Novel transcript discovery –  View RNA modifications (editing, other enzymatic changes) –  With longer reads, can “phase” splice sites –  Possibly discover many novel isoforms –  More sensitive

Dynamic Range

How do we sequence mRNA?

Total RNA

DNAase Treatment

Oligo-‐dT beads

PolyA purified RNA

First Strand cDNA synthesis

AAA(A)n" 3’ poly A tail"5’ cap structure"

mRNA"

AAA(A)n" 3’ "5’ "3’" 5’" oligo (dT)12-18 primer"

3’" 5’"AAA(A)n" 3’ "5’ "

dNTPs reverse transcriptase"

AAA(A)n" 3’ "5’ "3’" 5’"

cDNA:mRNA hybrid"

Second Strand Synthesis

AAA(A)n" 3’ "5’ "3’" 5’"

dNTPs RNAaseH"E. coli polymerase I!

5’ "3’" 5’"

remnants of mRNA serve as primers for"synthesis of second strand of cDNA"

5’ "3’" 5’"

bacteriophage T4"DNA ligase"

5’ "3’" 5’"

3’ "double stranded cDNA"

Library construction, similar to genomic DNA, using forked adapters"

Shatter RNA, Prime with Random Hexamers

AAA(A)n" 3’ poly A tail"5’ cap structure"

Fragment RNA"

5’" 3’"

Prime 1st strand synthesis with random hexamers"

5’ "3’" 5’"

3’ "double stranded cDNA"

Library construction, using forked adapters"

Random Hexamer Induced Sequence Errors

van Gurp TP, McIntyre LM, Verhoeven KJ. (2013). Consistent errors in first strand cDNA due to random hexamer mispriming. PLoS One 8(12):e85583.

PolyA purified RNA

Fragmented RNA

5’ Ambion Fragmenta>on Buffer

5’ 3’

Polynucleo>de kinase in T4 RNA ligase buffer (which has ATP). Should be fresh.

RNA Fragments with 5’P and 3’OH

P-‐5’ 3’-‐OH

Remove ATP

Retaining Strand Specificity Through RNA ligation

Adapter Ligation

Ligate adenylated 3’ adaptor

T4 RNA ligase I (no ATP)

Size Select Fragments between 125 and 200bp on a polyacrylamide gel

Remove unligated adaptor

Ligate 5’ Adaptor

T4 RNA ligase

Creation of Strand Specific Library

Reverse Transcrip>on (Superscript II)

RNA hydrolysis

PCR

Sequence"

Using dUTP to create strand specificity

mRNA"

fragment"

1st strand synthesis with random hexamers

and normal dNTPs"5’" 3’"

forked adapter ligation"

5’" 3’"2nd strand synthesis with dTTP -> dUTP"

Creating Strand Specificity

UNG treatment"

Ad #1"Ad #2"

Pre-amplification and sequencing"

Mapping of Reads

•  Map reads to both the genome, and the predicted spliced genome.

•  Un-mappable reads may span unknown exon-exon junctions from novel transcripts or exons. –  With short reads, a challenge to map these –  TopHat (works in conjunction with Bowtie) can find

these quite efficiently. •  Need to be able to accommodate mismatches.

Exonic Read Density

•  To measure abundance when sequencing entire transcripts, you must normalize the data for the transcript length.

•  Exonic Read Density = Reads per kb gene exon per million mapped reads – Developed by the Wold lab, but makes intuitive

sense.

Why Exonic Read Density? What we observe in mapped reads"

What was sequenced"

What was present in RNA"

1 rpkm 3 rpkm 1 rpkm

Experimental Considerations

•  Paired-end vs. single end –  Transcriptome complexity will dictate your choice

•  Strand specific vs. strand agnostic –  There’s no real reason to not use strand specific

library protocols •  Total RNA vs. polyA+ purified vs. ribosomal

depleted vs. other RNA subpopulations.

Analysis Considerations •  Read Mapping

–  Unspliced Aligners –  Spliced Aligners

•  Transcriptome Reconstruction –  Genome guided –  Genome independent reconstruction

•  Expression quantification –  Gene quantification –  Isoform quantification

•  Differential Expression

Unspliced Aligners

•  Limited to identifying known exons and junctions

•  Requires a good reference transcriptome •  BWT (e.g. bowtie, bwa) based aligners are

much faster •  Seed and extend aligners (e.g. SHRiMP,

Stampy) are more sensitive

Spliced Aligners Align to whole genome, including intron-spanning reads that allow large gaps •  Exon first (MapSplice, SpliceMap, TopHat)

–  Two step process •  Use unspliced alignment •  Take unmapped reads, split, and look for possible spliced connections

–  Typically faster •  Seed-extend (GSNAP, QPALMA)

–  Break reads into short seeds and place on genome, then examine with more sensitive methods

–  Find more splice junctions, though not *yet* clear if they tend to be false positives

Garber et al, 2011, Nature Methods

Transcriptome Reconstruction

•  Challenging because – Transcript abundance spans several orders of

magnitude – Reads will originate from mature mRNA, as

well as incompletely spliced precursor RNA – Reads are short, and genes can have many

isoforms, making it challenging to determine which isoform produced which read

Two Approaches •  Genome Guided

–  Relies on reference genome –  Uses spliced reads to reconstruct the transcriptome –  E.g. cufflinks (identifies minimal set of isoforms),

scripture (identifies maximal set of isoforms) •  Genome Independent Approach

–  Tries to de novo assemble transcripts –  TransAbyss, Velvet –  Sensitive to sequencing errors –  Usually requires more computational resources

Two isoforms of the same gene:

Determining differential Expression

•  A number of packages available – Cuffdiff, DE-Seq, EdgeR etc.

•  Require replicates for each condition, so can compare within vs. between sample variance

•  More abundant transcripts are more able to be determined to have differential expression

3’ SEQ

•  Sometimes you have samples with low quality, fragmented RNA

•  Alternatively, you may not want to sequence complete transcripts, but instead, simply quantitate them

•  3’ SEQ attempts to reduce read coverage across transcripts, while still quantifying them

3’ SEQ

Better Assaying Isoforms •  To better understand a biological system, we really want to

understand all transcripts –  Alternative splicing first seen in viruses in the 1970s

•  Splicing generates complexity –  Humans have only ~2X more genes than Drosophila –  More than one gene one protein –  >38,000 Dscam isoforms! –  Alternative splicing can be altered in disease

•  With relatively short reads, even with paired end sequencing, it’s not clear which exons ends up with which other exons in mature isoforms

•  Long-Read RNA-Seq results in better isoform determination.

Long-Read RNA-Seq

Sharon D, Tilgner H, Grubert F, Snyder M. (2013). A single-molecule long-read survey of the human transcriptome. Nat Biotechnol 31(11):1009-14

TIF-Seq

•  Transcript Isoform Sequencing •  Does not capture exonic structure •  Instead captures 5’ and 3’ ends of

transcripts •  From only ~6000 genes in yeast, almost 2

million unique transcript isoforms identified •  371,087 major TIFs identified genome-wide

Pelechano V, Wei W, Steinmetz LM. (2013). Extensive transcriptional heterogeneity revealed by isoform profiling. Nature 497(7447):127-31.

TIF-Seq

TIF-Seq

Recommended Reading •  Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., Durbin, R., and 1000 Genome Project Data

Processing Subgroup (2009). The Sequence Alignment/Map format and SAMtools. Bioinformatics 25(16):2078-9. •  Thorvaldsdóttir, H., Robinson, J.T. and Mesirov, J.P. (2013). Integrative Genomics Viewer (IGV): high-performance genomics data

visualization and exploration. Brief Bioinform. 14(2):178-92. •  Abeel, T., Van Parys, T., Saeys, Y., Galagan, J. and Van de Peer, Y. (2012). GenomeView: a next-generation genome browser. Nucleic

Acids Res. 40(2):e12. •  Westesson, O., Skinner, M., Holmes, I. (2013). Visualizing next-generation sequencing data with JBrowse. Brief Bioinform. 14(2):172-7.

•  Van der Auwera, G.A., Carneiro, M., Hartl, C., Poplin, R., del Angel, G., Levy-Moonshine, A., Jordan, T., Shakir, K., Roazen, D., Thibault, J., Banks, E., Garimella, K., Altshuler, D., Gabriel, S. and DePristo, M. (2013). From FastQ Data to High-Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline. Current Protocols in Bioinformatics 43:11.10.1-11.10.33.

•  Danecek, P., Auton, A., Abecasis, G., Albers, C.A., Banks, E., DePristo, M.A., Handsaker, R.E., Lunter, G., Marth, G.T., Sherry, S.T., McVean, G., Durbin, R. and the 1000 Genomes Project Analysis Group (2011). The variant call format and VCFtools. Bioinformatics 27(15):2156-8.

•  Parkhomchuk, D., Borodina, T., Amstislavskiy, V., Banaru, M., Hallen, L., Krobitsch, S., Lehrach, H., and Soldatov, A. (2009). Transcriptome analysis by strand-specific sequencing of complementary DNA. Nucleic Acids Res. 37(18):e123.

•  Borodina, T., Adjaye, J., Sultan, M. (2011). A strand-specific library preparation protocol for RNA sequencing. Methods Enzymol. 500:79-98.

•  van Gurp, T.P., McIntyre, L.M., Verhoeven, K.J. (2013). Consistent errors in first strand cDNA due to random hexamer mispriming. PLoS One 8(12):e85583.

•  Beck, A.H., Weng, Z., Witten, D.M., Zhu, S., Foley, J.W., Lacroute, P., Smith, C.L., Tibshirani, R., van de Rijn, M., Sidow, A. and West, R.B. (2010). 3'-end sequencing for expression quantification (3SEQ) from archival tumor samples. PLoS One 5(1):e8768.

•  Sharon, D., Tilgner, H., Grubert, F. and Snyder, M. (2013). A single-molecule long-read survey of the human transcriptome. Nat Biotechnol. 31(11):1009-14.

•  Pelechano, V., Wei, W. and Steinmetz, L.M. (2013). Extensive transcriptional heterogeneity revealed by isoform profiling. Nature 497(7447):127-31.

•  Trapnell, C., Pachter, L. and Salzberg, S.L. (2009). TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 25(9):1105-11. •  Trapnell, C., Williams, B.A., Pertea, G., Mortazavi, A., Kwan, G., van Baren, M.J., Salzberg, S.L., Wold, B.J. and Pachter, L. (2010).

Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol. 28(5):511-5. Cufflinks

Documents

Genetics 211 - 2014 Lecture 3 - Home | Stanford Medicine