31
Biological question Differentially expressed genes Sample class prediction etc. Testing Biological verification and interpretation Microarray experiment Estimation Experimental design Image analysis Normalization Clustering Discrimination Churchill, March 15 Bult, Lecture 5 Bult, Lecture 6 Hibbs, Lectures 10 and Blake, Lecture 16 and 1

Biological question Differentially expressed genes Sample class prediction etc. Testing Biological verification and interpretation Microarray experiment

Embed Size (px)

Citation preview

Page 1: Biological question Differentially expressed genes Sample class prediction etc. Testing Biological verification and interpretation Microarray experiment

Biological questionDifferentially expressed genesSample class prediction etc.

Testing

Biological verification and interpretation

Microarray experiment

Estimation

Experimental design

Image analysis

Normalization

Clustering Discrimination

Churchill, March 15

Bult, Lecture 5

Bult, Lecture 6

Hibbs, Lectures 10 and 11

Blake, Lecture 16 and 17

Page 2: Biological question Differentially expressed genes Sample class prediction etc. Testing Biological verification and interpretation Microarray experiment

Project Steps

• Find and Download Array Data• Normalize Array Data• Analyze Data

– i.e., generate gene lists• Differentially expressed genes, genes in clusters, etc.

• Interpret Gene Lists– Use the annotations of genes in your lists

• Gene Ontology terms are available for many organisms, but not all

Page 3: Biological question Differentially expressed genes Sample class prediction etc. Testing Biological verification and interpretation Microarray experiment

Getting The Data

• Search GEO (or whatever) for a data set of interest.

• Download the data files– e.g., Affy .CEL files, Affy .CDF files, etc.

• Upload to home directory

Page 4: Biological question Differentially expressed genes Sample class prediction etc. Testing Biological verification and interpretation Microarray experiment

Normalize the Data

• Sent you all a script (2/23/2012) to RMA normalize the Ackerman array data available from my home directory

Page 5: Biological question Differentially expressed genes Sample class prediction etc. Testing Biological verification and interpretation Microarray experiment

library(affy)library(makecdfenv)

Array.CDF=make.cdf.env(“MoGene-1_0-st-v1.cdf”)CELData=ReadAffy()CELData@cdfName=“Array.CDF”rma.CELData = rma(CELData)rma.expr = exprs(rma.CELData)rma.expr.df = data.frame(ProbeID=row.names(rma.expr),rma.expr)write.table(rma.expr.df,"rma.expr.dat",sep="\t",row=F,quote=F)

Page 6: Biological question Differentially expressed genes Sample class prediction etc. Testing Biological verification and interpretation Microarray experiment

• What is a library?• What does the ReadAffy() function do?What

are possible arguments for the ReadAffy() function?

• What class of R object is rma.CELData?• What class of R object is rma.expr?• What class of R object is rma.expr.df?

Page 7: Biological question Differentially expressed genes Sample class prediction etc. Testing Biological verification and interpretation Microarray experiment

• slotNames(CELData)• phenoData(CELData)

Page 8: Biological question Differentially expressed genes Sample class prediction etc. Testing Biological verification and interpretation Microarray experiment

This is what rma.expr.df looks like in Excel……

Page 9: Biological question Differentially expressed genes Sample class prediction etc. Testing Biological verification and interpretation Microarray experiment

Plotting summarized probeset intensities across the Ackerman arrays….(non normalized)

jpeg("boxplot.jpeg")boxplot(CELData, names=CELData$sample, col="blue")dev.off()

Page 10: Biological question Differentially expressed genes Sample class prediction etc. Testing Biological verification and interpretation Microarray experiment

mydata=rma.expr.df

jpeg("normal_boxplot.jpg")boxplot(mydata[-1], main = "Normalized Intensities", xlab="Array", ylab="Intensities", col="blue")dev.off()

Plotting summarized probeset intensities across the Ackerman arrays….(normalized)

Page 11: Biological question Differentially expressed genes Sample class prediction etc. Testing Biological verification and interpretation Microarray experiment

Next time

• Posted articles from Gary Churchill. – If you only read one article, read Churchill 2004– See also Gary’s web site:

• http://churchill.jax.org/software/rmaanova.shtml– Look at Sample Data and Tutorial

• After that lecture we will begin analysis of microarray data– MAANOVA

Page 12: Biological question Differentially expressed genes Sample class prediction etc. Testing Biological verification and interpretation Microarray experiment
Page 13: Biological question Differentially expressed genes Sample class prediction etc. Testing Biological verification and interpretation Microarray experiment

19901992

19941997

19992001

20032005

20072009

0.00

10,000.00

20,000.00

30,000.00

40,000.00

50,000.00

60,000.00

70,000.00

$0.00

$20.00

$40.00

$60.00

$80.00

$100.00

$120.00

$140.00

Gig

abas

esCost per Kb

Lucinda Fulton, The Genome Center at Washington University

Cost Throughput

Page 14: Biological question Differentially expressed genes Sample class prediction etc. Testing Biological verification and interpretation Microarray experiment

Sequencing Technologies

http://www.geospiza.com/finchtalk/uploaded_images/plates-and-slides-718301.png

Page 15: Biological question Differentially expressed genes Sample class prediction etc. Testing Biological verification and interpretation Microarray experiment

Sequence “Space”• Roche 454 – Flow space

– Measure pyrophosphate released by a nucleotide when it is added to a growing DNA chain

– Flow space describes sequence in terms of these base incorporations– http://www.youtube.com/watch?v=bFNjxKHP8Jc

• AB SOLiD – Color space– Sequencing by DNA ligation via synthetic DNA molecules that contain two nested known

bases with a flouorescent dye– Each base sequenced twice– http://www.youtube.com/watch?v=nlvyF8bFDwM&feature=related

• Illumina/Solexa – Base space– Single base extentions of fluorescent-labeled nucleotides with protected 3 ‘ OH groups– Sequencing via cycles of base addition/detection followed deprotection of the 3’ OH– http://www.youtube.com/watch?v=77r5p8IBwJk&feature=related

• GenomeTV – Next Generation Sequencing (lecture)– http://www.youtube.com/watch?v=g0vGrNjpyA8&feature=related

http://finchtalk.geospiza.com/2008/03/color-space-flow-space-sequence-space_23.html

Page 16: Biological question Differentially expressed genes Sample class prediction etc. Testing Biological verification and interpretation Microarray experiment

“Standard” File formats

Sequence containersFASTAFASTQBAM/SAM

AlignmentsBAM/SAMMAF

AnnotationBEDGFF/GTF/GFF3WIG

VariationVCFGVF

Page 17: Biological question Differentially expressed genes Sample class prediction etc. Testing Biological verification and interpretation Microarray experiment

ToolsAlignments

BLAST: not for NGSBWABowtieMaq…

TranscriptomicsTophatCufflinks…

Variant callingssahaSNPMosaic…

Counting (Chip-Seq, etc)FindPeaksPeakSeq

Page 18: Biological question Differentially expressed genes Sample class prediction etc. Testing Biological verification and interpretation Microarray experiment

FASTQ: Data Format• FASTQ

– Text based– Encodes sequence calls and quality scores with ASCII characters– Stores minimal information about the sequence read– 4 lines per sequence

• Line 1: begins with @; followed by sequence identifier and optional description

• Line 2: the sequence• Line 3: begins with the “+” and is followed by sequence identifiers and

description (both are optional)• Line 4: encoding of quality scores for the sequence in line 2

• References/Documentation– http://maq.sourceforge.net/fastq.shtml– Cock et al. (2009). Nuc Acids Res 38:1767-1771.

Page 19: Biological question Differentially expressed genes Sample class prediction etc. Testing Biological verification and interpretation Microarray experiment

FASTQ Example

FASTQ example from: Cock et al. (2009). Nuc Acids Res 38:1767-1771.

For analysis, it may be necessary to convert to the Sanger form of FASTQ…For example,

Illumina stores quality scores ranging from 0-62;Sanger quality scores range from 0-93.

Solexa quality scores have to be converted to PHRED quality scores.

Page 20: Biological question Differentially expressed genes Sample class prediction etc. Testing Biological verification and interpretation Microarray experiment

SAM (Sequence Alignment/Map)

• It may not be necessary to align reads from scratch…you can instead use existing alignments in SAM format– SAM is the output of aligners that map reads to a

reference genome– Tab delimited w/ header section and alignment

section• Header sections begin with @ (are optional)• Alignment section has 11 mandatory fields

– BAM is the binary format of SAM

http://samtools.sourceforge.net/

Page 21: Biological question Differentially expressed genes Sample class prediction etc. Testing Biological verification and interpretation Microarray experiment

http://samtools.sourceforge.net/SAM1.pdf

Mandatory Alignment Fields

Page 22: Biological question Differentially expressed genes Sample class prediction etc. Testing Biological verification and interpretation Microarray experiment

http://samtools.sourceforge.net/SAM1.pdf

Alignment Examples

Alignments in SAM format

Page 23: Biological question Differentially expressed genes Sample class prediction etc. Testing Biological verification and interpretation Microarray experiment

chr1 86114265 86116346 nsv433165chr2 1841774 1846089 nsv433166chr16 2950446 2955264 nsv433167chr17 14350387 14351933 nsv433168chr17 32831694 32832761 nsv433169chr17 32831694 32832761 nsv433170chr18 61880550 61881930 nsv433171

chr1 16759829 16778548 chr1:21667704 270866 -chr1 16763194 16784844 chr1:146691804 407277 +chr1 16763194 16784844 chr1:144004664 408925 -chr1 16763194 16779513 chr1:142857141 291416 -chr1 16763194 16779513 chr1:143522082 293473 -chr1 16763194 16778548 chr1:146844175 284555 -chr1 16763194 16778548 chr1:147006260 284948 -chr1 16763411 16784844 chr1:144747517 405362 +

Valid BED files

Page 24: Biological question Differentially expressed genes Sample class prediction etc. Testing Biological verification and interpretation Microarray experiment

Galaxyhttp://main.g2.bx.psu.edu/

See Tutorial 1

Build and share data and analysis workflowsNo programming experience requiredStrong and growing development and user community

Page 25: Biological question Differentially expressed genes Sample class prediction etc. Testing Biological verification and interpretation Microarray experiment

Tools HistoryDialog/Parameter Selection

Page 26: Biological question Differentially expressed genes Sample class prediction etc. Testing Biological verification and interpretation Microarray experiment

Tutorial Web Sitehttp://www.ncbi.nlm.nih.gov/staff/church/GenomeAnalysis/index.shtml

Tutorial 5

Page 27: Biological question Differentially expressed genes Sample class prediction etc. Testing Biological verification and interpretation Microarray experiment

RNA Seq Workflow• Convert data to FASTQ• Upload files to Galaxy• Quality Control

– Throw out low quality sequence reads, etc.• Map reads to a reference genome

– Many algorithms available– Trade off between speed and sensitivity

• Data summarization– Associating alignments with genome annotations– Counts

• Data Visualization• Statistical Analysis

Page 28: Biological question Differentially expressed genes Sample class prediction etc. Testing Biological verification and interpretation Microarray experiment

Typical RNA_Seq Project Work Flow

Sequencing Sequencing

Tissue Sample Tissue Sample

Cufflinks Cufflinks

TopHat TopHat

FASTQ file FASTQ file

QC QC

Gene/Transcript/Exon Expression

Gene/Transcript/Exon Expression

VisualizationVisualization

Total RNA Total RNA mRNA mRNA cDNA cDNA

Statistical Analysis

Statistical Analysis

JAX Computational Sciences Service

Page 29: Biological question Differentially expressed genes Sample class prediction etc. Testing Biological verification and interpretation Microarray experiment

TopHat

Trapnell et al. (2009). Bioinformatics 25:1105-1111.

http://tophat.cbcb.umd.edu/

Figure from: Trapnell et al. (2010). Nature Biotechnology 28:511-515.

TopHat is a good tool for aligning RNA Seq data compared to other aligners (Maq, BWA) because it takes splicing into account during the alignment process.

Page 30: Biological question Differentially expressed genes Sample class prediction etc. Testing Biological verification and interpretation Microarray experiment

Trapnell C et al. Bioinformatics 2009;25:1105-1111

TopHat is built on the Bowtie alignment algorithm.

Page 31: Biological question Differentially expressed genes Sample class prediction etc. Testing Biological verification and interpretation Microarray experiment

Cufflinks

Trapnell et al. (2010). Nature Biotechnology 28:511-515.

http://cufflinks.cbcb.umd.edu/

• Assembles transcripts,• Estimates their abundances, and •Tests for differential expression and regulation in RNA-Seq samples