Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Genetics 211 - 2014 Lecture 3
Genome Resequencing and Functional Genomics Gavin Sherlock January 21st 2014
SAM files • Sequence Alignment/Map format • Is a concise file format that contains information about how sequence
reads maps to a reference genome • Can be further compressed in BAM format, which is a binary format
of SAM. • Can also be sorted and indexed to provide fast random access, using
SAMtools (more on this in a minute). • Requires ~1 byte per input base to store sequences, qualities and meta
information. • Supports paired-end reads and color space. • Is produced by bowtie, bwa • SAM can be converted to pileup format.
SAM format No. Name Description 1 QNAME Query NAME of the read or read pair 2 FLAG Bitwise FLAG 3 RNAME Reference Sequence Name 4 POS 1-Based leftmost POSitionof clipped alignment 5 MAPQ MAPping Quality (Phred-scaled) 6 CIGAR Extended CIGAR string (operations: MIDNSHP) 7 MRNM Mate Reference NaMe (‘=’ if same as RNAME) 8 MPOS 1-Based leftmost Mate POSition 9 ISIZE Inferred Insert SIZE 10 SEQ Query SEQuence on the same strand as the reference 11 QUAL Query QUALity (ASCII-33=Phred base quality)
SAM Format coor 12345678901234 5678901234567890123456789012345!ref AGCATGTTAGATAA**GATAGCTGTGCTAGTAGGCAGTCAGCGCCAT!!r001+ TTAGATAAAGGATA*CTG!r004+ ATAGCT..............TCAGC!r001- CAGCGCCAT!
@SQ SN:ref LN:45!r001 163 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG *!r004 0 ref 16 30 6M14N5M * 0 0 ATAGCTTCAGC *!r001 83 ref 37 30 9M * 0 0 CAGCGCCAT *!
Alignment Visualization • SAMtools has its own terminal-based text
visualization tool, called tview • Very simple, but fast, and useful for looking at SNPs
and indels
Alignment Visualization • For more complex viewing needs, use either GenomeView or
IGV • Can display BAM files, plus annotation tracks • Java based dynamic visualization software • Shows snps, indels, spliced reads and a signal track
http://genomeview.sourceforge.net/ http://www.broadinstitute.org/igv/
• Also check out JBrowse for online viewing of BAM files
NB: Browser-driven analyses are necessarily anecdotal and at best semi-quantitative.
Genome Resequencing
• SNP/indel discovery – Can now multiplex 48 yeast strains on a single
lane of the HiSeq 2000 (~70x coverage each) – 1 lane of HiSeq = ~10x human genome
• Structural variant discovery • Copy Number changes • Novel sequence discovery
SNP and indel calling • SNPs and indels are ‘called’ from SAM/BAM files • Typically use GATK (Genome Analysis Toolkit), which will
identify nucleotides with support for variation – Sounds trivial, but typically lots of false positives – Harder in a diploid than a haploid – Harder with lower coverage – You *must* read the “Best Practice Variant Detection with the
GATK” page before using GATK; it changes frequently – Make sure the version of GATK you are using (it is updated
frequently) corresponds to the latest “best practices” – Always document what you did – this is best done in the form of a
Perl or Python script that glues GATK calls into a pipeline, and that documents its parameters.
VCF files • VCF stands for variant call format • Produced by the GATK, and describes all
variant positions derived from bam files • Meta information lines, preceded by ##,
indicate the source of data used, and how the file was generated.
• Variants are not annotated – to do this, using a tool such as SNPeff or ANNOVAR – Helps you zero in on which variants might be of
interest
Copy Number Variation
• Sequence coverage of a region can allow detection of amplifications or deletions – i.e. duplicated regions will have higher coverage.
• Power to detect such regions depends on their size, copy number and number of reads.
CNV Detection Power
Chiang et al, 2009. Nature Methods 6, 99-103.
Coverage Shows Amplification HXT7 HXT6
Structural Variation
• Using paired-end read data, it’s possible to identify structural variants:
Adapted from Korbel at al, 2007.
Structural Variant Identification
• There are several tools, e.g.: – BreakDancer (used by WashU Sequencing Center) – VariationHunter – MoDIL – inGAP-SV
• has a nice visualization tool (http://ingap.sourceforge.net/)
– PEMer (developed by the Gerstein lab) • No real robust comparison between these exists on a
standard dataset • Ask around, find out what others are using, get latest
versions and experiment
• interactions between nucleic acids and proteins"
• transcript identity"• transcript abundance"• RNA editing"• SNPs"• Allele specific expression"• Regulation"
• Nucleosome positioning"• 3D genome architecture"• Active promoters"• interactions between
nucleic acids and proteins"• chromatin modifications"
• genome variability"• metagenomics"• genome modifications"• detection of mutations"• association studies"• phylogeny"• evolution"
Applications of Next-Gen Sequencing
genome chromatin transcriptome"
de novo sequencing"
assembly"
annotation"
mapping"
resequencing"
detection of variants"
mapping"
Hi-C"
3D reconstruction"
mapping"
ChIP-Seq"
detection of binding sites"
mapping"
RNA-Seq"
transcript detection and quantification"
mapping"
ATAC-Seq"
Identify open
chromatin"
RNA-seq
• RNA-Seq is similar to performing expression profiling on microarrays
• Instead of detecting transcript abundance by hybridization signal, we count fragments of transcripts
• Several advantages over microarrays – No prior knowledge needed of which parts of the genome are
expressed – Allows splice site discovery – 5’ and 3’ UTR mapping – Novel transcript discovery – View RNA modifications (editing, other enzymatic changes) – With longer reads, can “phase” splice sites – Possibly discover many novel isoforms – More sensitive
Dynamic Range
How do we sequence mRNA?
Total RNA
DNAase Treatment
Oligo-‐dT beads
PolyA purified RNA
First Strand cDNA synthesis
AAA(A)n" 3’ poly A tail"5’ cap structure"
mRNA"
AAA(A)n" 3’ "5’ "3’" 5’" oligo (dT)12-18 primer"
3’" 5’"AAA(A)n" 3’ "5’ "
dNTPs reverse transcriptase"
AAA(A)n" 3’ "5’ "3’" 5’"
cDNA:mRNA hybrid"
Second Strand Synthesis
AAA(A)n" 3’ "5’ "3’" 5’"
dNTPs RNAaseH"E. coli polymerase I!
5’ "3’" 5’"
remnants of mRNA serve as primers for"synthesis of second strand of cDNA"
5’ "3’" 5’"
bacteriophage T4"DNA ligase"
5’ "3’" 5’"
3’ "double stranded cDNA"
Library construction, similar to genomic DNA, using forked adapters"
Shatter RNA, Prime with Random Hexamers
AAA(A)n" 3’ poly A tail"5’ cap structure"
Fragment RNA"
5’" 3’"
Prime 1st strand synthesis with random hexamers"
5’ "3’" 5’"
3’ "double stranded cDNA"
Library construction, using forked adapters"
Random Hexamer Induced Sequence Errors
van Gurp TP, McIntyre LM, Verhoeven KJ. (2013). Consistent errors in first strand cDNA due to random hexamer mispriming. PLoS One 8(12):e85583.
PolyA purified RNA
Fragmented RNA
5’ Ambion Fragmenta>on Buffer
5’ 3’
Polynucleo>de kinase in T4 RNA ligase buffer (which has ATP). Should be fresh.
RNA Fragments with 5’P and 3’OH
P-‐5’ 3’-‐OH
Remove ATP
Retaining Strand Specificity Through RNA ligation
Adapter Ligation
Ligate adenylated 3’ adaptor
T4 RNA ligase I (no ATP)
Size Select Fragments between 125 and 200bp on a polyacrylamide gel
Remove unligated adaptor
Ligate 5’ Adaptor
T4 RNA ligase
Creation of Strand Specific Library
Reverse Transcrip>on (Superscript II)
RNA hydrolysis
PCR
Sequence"
Using dUTP to create strand specificity
mRNA"
fragment"
1st strand synthesis with random hexamers
and normal dNTPs"5’" 3’"
forked adapter ligation"
5’" 3’"2nd strand synthesis with dTTP -> dUTP"
Creating Strand Specificity
UNG treatment"
Ad #1"Ad #2"
Pre-amplification and sequencing"
Mapping of Reads
• Map reads to both the genome, and the predicted spliced genome.
• Un-mappable reads may span unknown exon-exon junctions from novel transcripts or exons. – With short reads, a challenge to map these – TopHat (works in conjunction with Bowtie) can find
these quite efficiently. • Need to be able to accommodate mismatches.
Exonic Read Density
• To measure abundance when sequencing entire transcripts, you must normalize the data for the transcript length.
• Exonic Read Density = Reads per kb gene exon per million mapped reads – Developed by the Wold lab, but makes intuitive
sense.
Why Exonic Read Density? What we observe in mapped reads"
What was sequenced"
What was present in RNA"
1 rpkm 3 rpkm 1 rpkm
Experimental Considerations
• Paired-end vs. single end – Transcriptome complexity will dictate your choice
• Strand specific vs. strand agnostic – There’s no real reason to not use strand specific
library protocols • Total RNA vs. polyA+ purified vs. ribosomal
depleted vs. other RNA subpopulations.
Analysis Considerations • Read Mapping
– Unspliced Aligners – Spliced Aligners
• Transcriptome Reconstruction – Genome guided – Genome independent reconstruction
• Expression quantification – Gene quantification – Isoform quantification
• Differential Expression
Unspliced Aligners
• Limited to identifying known exons and junctions
• Requires a good reference transcriptome • BWT (e.g. bowtie, bwa) based aligners are
much faster • Seed and extend aligners (e.g. SHRiMP,
Stampy) are more sensitive
Spliced Aligners Align to whole genome, including intron-spanning reads that allow large gaps • Exon first (MapSplice, SpliceMap, TopHat)
– Two step process • Use unspliced alignment • Take unmapped reads, split, and look for possible spliced connections
– Typically faster • Seed-extend (GSNAP, QPALMA)
– Break reads into short seeds and place on genome, then examine with more sensitive methods
– Find more splice junctions, though not *yet* clear if they tend to be false positives
Garber et al, 2011, Nature Methods
Transcriptome Reconstruction
• Challenging because – Transcript abundance spans several orders of
magnitude – Reads will originate from mature mRNA, as
well as incompletely spliced precursor RNA – Reads are short, and genes can have many
isoforms, making it challenging to determine which isoform produced which read
Two Approaches • Genome Guided
– Relies on reference genome – Uses spliced reads to reconstruct the transcriptome – E.g. cufflinks (identifies minimal set of isoforms),
scripture (identifies maximal set of isoforms) • Genome Independent Approach
– Tries to de novo assemble transcripts – TransAbyss, Velvet – Sensitive to sequencing errors – Usually requires more computational resources
Two isoforms of the same gene:
Determining differential Expression
• A number of packages available – Cuffdiff, DE-Seq, EdgeR etc.
• Require replicates for each condition, so can compare within vs. between sample variance
• More abundant transcripts are more able to be determined to have differential expression
3’ SEQ
• Sometimes you have samples with low quality, fragmented RNA
• Alternatively, you may not want to sequence complete transcripts, but instead, simply quantitate them
• 3’ SEQ attempts to reduce read coverage across transcripts, while still quantifying them
3’ SEQ
Better Assaying Isoforms • To better understand a biological system, we really want to
understand all transcripts – Alternative splicing first seen in viruses in the 1970s
• Splicing generates complexity – Humans have only ~2X more genes than Drosophila – More than one gene one protein – >38,000 Dscam isoforms! – Alternative splicing can be altered in disease
• With relatively short reads, even with paired end sequencing, it’s not clear which exons ends up with which other exons in mature isoforms
• Long-Read RNA-Seq results in better isoform determination.
Long-Read RNA-Seq
Sharon D, Tilgner H, Grubert F, Snyder M. (2013). A single-molecule long-read survey of the human transcriptome. Nat Biotechnol 31(11):1009-14
TIF-Seq
• Transcript Isoform Sequencing • Does not capture exonic structure • Instead captures 5’ and 3’ ends of
transcripts • From only ~6000 genes in yeast, almost 2
million unique transcript isoforms identified • 371,087 major TIFs identified genome-wide
Pelechano V, Wei W, Steinmetz LM. (2013). Extensive transcriptional heterogeneity revealed by isoform profiling. Nature 497(7447):127-31.
TIF-Seq
TIF-Seq
Recommended Reading • Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., Durbin, R., and 1000 Genome Project Data
Processing Subgroup (2009). The Sequence Alignment/Map format and SAMtools. Bioinformatics 25(16):2078-9. • Thorvaldsdóttir, H., Robinson, J.T. and Mesirov, J.P. (2013). Integrative Genomics Viewer (IGV): high-performance genomics data
visualization and exploration. Brief Bioinform. 14(2):178-92. • Abeel, T., Van Parys, T., Saeys, Y., Galagan, J. and Van de Peer, Y. (2012). GenomeView: a next-generation genome browser. Nucleic
Acids Res. 40(2):e12. • Westesson, O., Skinner, M., Holmes, I. (2013). Visualizing next-generation sequencing data with JBrowse. Brief Bioinform. 14(2):172-7.
• Van der Auwera, G.A., Carneiro, M., Hartl, C., Poplin, R., del Angel, G., Levy-Moonshine, A., Jordan, T., Shakir, K., Roazen, D., Thibault, J., Banks, E., Garimella, K., Altshuler, D., Gabriel, S. and DePristo, M. (2013). From FastQ Data to High-Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline. Current Protocols in Bioinformatics 43:11.10.1-11.10.33.
• Danecek, P., Auton, A., Abecasis, G., Albers, C.A., Banks, E., DePristo, M.A., Handsaker, R.E., Lunter, G., Marth, G.T., Sherry, S.T., McVean, G., Durbin, R. and the 1000 Genomes Project Analysis Group (2011). The variant call format and VCFtools. Bioinformatics 27(15):2156-8.
• Parkhomchuk, D., Borodina, T., Amstislavskiy, V., Banaru, M., Hallen, L., Krobitsch, S., Lehrach, H., and Soldatov, A. (2009). Transcriptome analysis by strand-specific sequencing of complementary DNA. Nucleic Acids Res. 37(18):e123.
• Borodina, T., Adjaye, J., Sultan, M. (2011). A strand-specific library preparation protocol for RNA sequencing. Methods Enzymol. 500:79-98.
• van Gurp, T.P., McIntyre, L.M., Verhoeven, K.J. (2013). Consistent errors in first strand cDNA due to random hexamer mispriming. PLoS One 8(12):e85583.
• Beck, A.H., Weng, Z., Witten, D.M., Zhu, S., Foley, J.W., Lacroute, P., Smith, C.L., Tibshirani, R., van de Rijn, M., Sidow, A. and West, R.B. (2010). 3'-end sequencing for expression quantification (3SEQ) from archival tumor samples. PLoS One 5(1):e8768.
• Sharon, D., Tilgner, H., Grubert, F. and Snyder, M. (2013). A single-molecule long-read survey of the human transcriptome. Nat Biotechnol. 31(11):1009-14.
• Pelechano, V., Wei, W. and Steinmetz, L.M. (2013). Extensive transcriptional heterogeneity revealed by isoform profiling. Nature 497(7447):127-31.
• Trapnell, C., Pachter, L. and Salzberg, S.L. (2009). TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 25(9):1105-11. • Trapnell, C., Williams, B.A., Pertea, G., Mortazavi, A., Kwan, G., van Baren, M.J., Salzberg, S.L., Wold, B.J. and Pachter, L. (2010).
Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol. 28(5):511-5. Cufflinks