Contacts Data sets are available at: Post-workshop support: –support on data analysis course

ContactsContacts

• Data sets are available at: http://www.bioinformatica.unito.it/bioinformatics/DAGEL.II/

• Post-workshop support: – support on data analysis course will be supplied by

Prof. Calogero, sending an email to:• [email protected] containing in the Subject: Post DW

support

http://www.bioinformatica.unito.it/bioinformatics/DAGEL.II/

Workshop topicsWorkshop topics• We will analyse the following topics:We will analyse the following topics:

– Experimental design.Experimental design.– Quality controls.Quality controls.– Data filtering.Data filtering.– Statistical inference of differential Statistical inference of differential

expression.expression.– Annotation.Annotation.– Assessing the biological meaning of Assessing the biological meaning of

differential expression.differential expression.

BioconductorBioconductor

Platform specificPlatform specificdevicesdevices

Analysis pipe-lineAnalysis pipe-line

SampleSample

PreparatioPreparationn

Array Array FabricatioFabricatio

nn

HybridizatioHybridizationn

ScanninScanning g + +

Image Image AnalysisAnalysis

NormalizatioNormalizationn

FilterinFilteringg

statisticalstatisticalanalysisanalysis

AnnotatioAnnotationn

Biological Biological KnowledgKnowledg

eeextractionextraction

QualityQualitycontrolcontrol

Open source softwareOpen source software• BioconductorBioconductor

– is an open source and open development software project to provide tools for the analysis and comprehension of genomic data.

• TMEV 4.0TMEV 4.0– is an application that allows the viewing of

processed microarray slide representations and the identification of genes and expression patterns of interest. A variety of normalization algorithms and clustering analyses allow the user flexibility in creating meaningful views of the expression data.

Analysis of microarrayAnalysis of microarray• Bioconductor (www.bioconductor.org) is an

open source and open development software project to provide tools for the analysis and comprehension of genomic data.

• Advantages:– highly dinamic,– free of charge,– introductory and advance courses are available in

Europe/USA every year.

• Disadvantages:– limited graphical interface,– limited documentation.

oneChannelGUIoneChannelGUI

• This is a graphical interface to Bioconductor This is a graphical interface to Bioconductor libraries devoted to the analysis of data derived libraries devoted to the analysis of data derived from single channel platforms.from single channel platforms.

• affylmGUI is a graphical interfase to limma affylmGUI is a graphical interfase to limma library, which allows differential expression library, which allows differential expression detection by mean of linear model analysis.detection by mean of linear model analysis.

• oneChannelGUI is an extension of affylmGUI oneChannelGUI is an extension of affylmGUI capabilities.capabilities.


• 3’ IVT arrays (e.g. HGU133plus2):3’ IVT arrays (e.g. HGU133plus2):– Primary (probe level QC, probe set summary and normalization), Primary (probe level QC, probe set summary and normalization),

secondary analysis (replicates QC, filtering, statistical analysis, secondary analysis (replicates QC, filtering, statistical analysis, classification) and data mining (GO enrichment).classification) and data mining (GO enrichment).

• Exon arrays:Exon arrays:– Secondary analysis (replicates QC, filtering, statistical analysis, Secondary analysis (replicates QC, filtering, statistical analysis,

classification, basic Splice Index inspection) using expression classification, basic Splice Index inspection) using expression console as source of primary data. console as source of primary data.

• Large data set (i.e. probe set expression in tab delimited Large data set (i.e. probe set expression in tab delimited format):format):– Secondary analysis (replicates QC, filtering, statistical analysis, Secondary analysis (replicates QC, filtering, statistical analysis,

classification) using expression console/GEO/ArrayExpress data classification) using expression console/GEO/ArrayExpress data as source of primary data.as source of primary data.

Setting the virtual RAM at 2GB: C:\..\R\R-2.3.0\bin\Rgui.exe --max-mem-size=2048M

Setting the virtual RAM at 2GB: C:\..\R\R-2.3.0\bin\Rgui.exe --max-mem-size=2048M

Starting R and oneChannelGUI

Double click on “R” to startDouble click on “R” to startA

B

Click on “Package” to load Bioconductor packagesClick on “Package” to load Bioconductor packages

A

B

Click on “Load package” to select the oneChannelGUI packageClick on “OK” to load the oneChannelGUI package

Click on “Load package” to select the oneChannelGUI packageClick on “OK” to load the oneChannelGUI package

A

B

C

Click on “Yes” to start the affylmGUI interface.Click on “Yes” to start the affylmGUI interface.A

Yes

Yes

B

C

Click on “Yes” to start the oneChannelGUI interface.Click on “Yes” to start the oneChannelGUI interface.

Waitfew seconds!

Overlaying oneChannelGUI to affylmGUI will change the default affylmGUI menu to the oneChannelGUI menu for 3’IVT Affymetrix arrays

Overlaying oneChannelGUI to affylmGUI will change the default affylmGUI menu to the oneChannelGUI menu for 3’IVT Affymetrix arrays

Standard affylmGUI menuStandard affylmGUI menu

oneChannelGUI menu for 3’IVT arraysoneChannelGUI menu for 3’IVT arrays


• Analysis steps for 3’IVT arrays:Analysis steps for 3’IVT arrays:– Loading .CEL files to be analyzedLoading .CEL files to be analyzed– Arrays quality control:Arrays quality control:

• Raw data plotsRaw data plots• Robust probe-level model library (affyPLM)Robust probe-level model library (affyPLM)

– NUSE plotNUSE plot– RLE plotRLE plot

A

Summary of loaded data: none is available since no CEL files have been loadedSummary of loaded data: none is available since no CEL files have been loaded

A

Click on “File” to start a new projectClick on “File” to start a new project

B

Selected as working dir the folder containing the .CEL files

Selected as working dir the folder containing the .CEL files

Click on “New” to start a new projectClick on “New” to start a new project

C

D

Selected 3’IVT arraysSelected 3’IVT arrays

Selected the “targets” file.Then press OK to continue

Selected the “targets” file.Then press OK to continue

Targets file is a tab delimited text file containing the description of the experiment. It is made of three columns:Name: the name you want to assign to each array.FileName: the names of the corresponding .CEL fileTarget: the experimental condition associated to the array (e.g. mock, treated, etc).At least two conditions should be present.

Targets file is a tab delimited text file containing the description of the experiment. It is made of three columns:Name: the name you want to assign to each array.FileName: the names of the corresponding .CEL fileTarget: the experimental condition associated to the array (e.g. mock, treated, etc).At least two conditions should be present.

Define the name of you analysis.Press OK to continue.…. Now the array will be loaded in a specific R object called environment.

Define the name of you analysis.Press OK to continue.…. Now the array will be loaded in a specific R object called environment.

Raw data are now loaded and are ready for normalization.

Raw data are now loaded and are ready for normalization.

Exercise 1 Exercise 1 (15 minutes)(15 minutes)

• Data set for this exercise:Data set for this exercise:– spugnini.et.alspugnini.et.al: 7 CEL files were generated for : 7 CEL files were generated for

two prototypic situations (biological replicas):two prototypic situations (biological replicas):– CC, untreated MSTO-211H cells:, untreated MSTO-211H cells:

• baldi.c1_b.CEL, baldi.c2.CEL, baldi.c3.CELbaldi.c1_b.CEL, baldi.c2.CEL, baldi.c3.CEL

– TT, MSTO-211H cells treated with 780 μM , MSTO-211H cells treated with 780 μM piroxicam for 48 hours:piroxicam for 48 hours:• baldi.f1_b.CEL, baldi.f2.CEL, baldi.f3.CEL, baldi.f1_b.CEL, baldi.f2.CEL, baldi.f3.CEL,

baldi.f4.CELbaldi.f4.CEL

See next page

Exercise 1Exercise 1

• Go in the folder Go in the folder spugnini.et.alspugnini.et.al..• Create, with excel, Create, with excel, a tab delimited filea tab delimited file named named

targets.txt:targets.txt:– Targets file is made of three columns with the Targets file is made of three columns with the

following header:following header:• NameName• FileNameFileName• TargetTarget

– In column In column NameName place a brief name (e.g. c1, c2, etc) place a brief name (e.g. c1, c2, etc)– In column In column FileNameFileName place the name of the place the name of the

corresponding .CEL filecorresponding .CEL file– In column In column TargetTarget place the experimental conditions place the experimental conditions

(e.g. control, treatment, etc)(e.g. control, treatment, etc)

See next page

Exercise 1Exercise 1

• Open ROpen R

• Load the oneChannelGUILoad the oneChannelGUI

• Start a new project:Start a new project:– Change the working dir in spugnini.et.alChange the working dir in spugnini.et.al– Load the targets.txt file (previously created)Load the targets.txt file (previously created)– Set as project name: ex1Set as project name: ex1


NormalizationNormalization FilteringFiltering StatisticalStatisticalanalysisanalysis

AnnotationAnnotationBiological Biological KnowledgeKnowledgeextractionextraction


The next steps are few simple basic quality controls.

The next steps are few simple basic quality controls.

Click on Quality Control menuClick on Quality Control menu

A

B

You can now evaluate:Intensity histogram for one array at a time.

You can now evaluate:Intensity histogram for one array at a time.

A

C

D

E

You can now evaluate:Intensity density plot for one array at a time.

You can now evaluate:Intensity density plot for one array at a time.

A

C

D

E

You can now evaluate:all arrays intensities as box plots.

You can now evaluate:all arrays intensities as box plots.

A

C

It is possible that cRNA concentration in sample sE2 was over estimated and a low cRNA amount was loaded on the array.As result a lot of signals are below the value 100 [log2(100) = 6.44]

It is possible that cRNA concentration in sample sE2 was over estimated and a low cRNA amount was loaded on the array.As result a lot of signals are below the value 100 [log2(100) = 6.44]

A

B

A

B

Some other basic controls can be done after the calculation of the probe set intensity summary using a special Bioconductor library affyPLM

Some other basic controls can be done after the calculation of the probe set intensity summary using a special Bioconductor library affyPLM

A

B

Fit the model(BE PATIENT!!!)

Fit the model(BE PATIENT!!!)

The end of the fitting procedure is given by a message.Then the NUSE/RLE function is automatically called

The end of the fitting procedure is given by a message.Then the NUSE/RLE function is automatically called C

affyPLM QC libraryaffyPLM QC library• affyPLM provides a number of useful affyPLM provides a number of useful

tools based on probe-level modelling tools based on probe-level modelling procedures.procedures.

• affyPLM package allows arrays quality affyPLM package allows arrays quality controls.controls.

What is a Probe Level Model?What is a Probe Level Model?• A A PProbe robe LLevel evel MModel (PLM) is a model odel (PLM) is a model

that is fit to probe-intensity data. that is fit to probe-intensity data.

• affyPLM fits a model with probe level and affyPLM fits a model with probe level and chip level parameters on a probe set by chip level parameters on a probe set by probe set basis.probe set basis.

• In quality control chip level parameters In quality control chip level parameters are a factor variable with a level for each are a factor variable with a level for each array.array.

What is a PLMset?What is a PLMset?• The main function for fitting PLM is the The main function for fitting PLM is the

function fitPLM.function fitPLM.• This function will fit a linear model with an This function will fit a linear model with an

effect estimated for each chip and an effect estimated for each chip and an effect for each probe.effect for each probe.

• fitPLM implements iteratively re-weighted fitPLM implements iteratively re-weighted least squares M-estimation regression.least squares M-estimation regression.

• The fitted model is stored in a PLMset The fitted model is stored in a PLMset object containing chip level parameter object containing chip level parameter estimates and the corresponding estimates and the corresponding standard errors.standard errors.

Default fitted modelDefault fitted model

• where where kjkj is the log is the log22 probe set expression value on probe set expression value on

array j for probeset k and array j for probeset k and kiki are probe effects. are probe effects.

• To make the model identifiable the constrainTo make the model identifiable the constrain

is used.is used.• For this default model, the parameter estimates For this default model, the parameter estimates

given are probe set expression values.given are probe set expression values.

01

I

i ki

kijkikjkijPM 2log

Relative Log Expression (RLE)Relative Log Expression (RLE)• RLE values are computed for each probe set by RLE values are computed for each probe set by

comparing the expression value on each array against comparing the expression value on each array against the median expression value for that probeset across all the median expression value for that probeset across all arrays.arrays.

• Assuming that most genes are not changing in Assuming that most genes are not changing in expression across arrays means ideally most of these expression across arrays means ideally most of these RLE values will be near 0. RLE values will be near 0.

• Boxplots of these values, for each array, provides a Boxplots of these values, for each array, provides a quality assessment tool.quality assessment tool.

• RLE plots:RLE plots:– Estimation of expression Estimation of expression gigi for each gene g on each array i. for each gene g on each array i.– Compute the median value across arrays for each geneCompute the median value across arrays for each gene

Relative Log Expression (RLE)Relative Log Expression (RLE)

Normalized Unscaled Standard Normalized Unscaled Standard Errors (NUSE)Errors (NUSE)

• Standard error measures the amount of errors done fitting y for every Standard error measures the amount of errors done fitting y for every x value.x value.

• Normalized Unscaled Standard Errors (NUSE) can also be used for Normalized Unscaled Standard Errors (NUSE) can also be used for assessing quality. assessing quality.

• The standard error estimates obtained for each gene on each array The standard error estimates obtained for each gene on each array from fitPLM are taken and standardized across arrays so that the from fitPLM are taken and standardized across arrays so that the median standard error for that genes is 1 across all arrays. median standard error for that genes is 1 across all arrays.

• This process accounts for differences in variability between genes. This process accounts for differences in variability between genes. • An array were there are elevated SE relative to the other arrays is An array were there are elevated SE relative to the other arrays is

typically of lower quality.typically of lower quality.• Boxplots of these values, separated by array can be used to compare Boxplots of these values, separated by array can be used to compare

arrays.arrays.

se=se=

gi

gigi

SEmed

SENUSE

ˆ

ˆˆ

A

B

C

A

B

A

Since the fitPLM object can be very big. It is a good idea, to delete it after quality control.

Since the fitPLM object can be very big. It is a good idea, to delete it after quality control.

Before Delete PLM

After Delete PLM


• Starting from the data set you have loaded Starting from the data set you have loaded – check the Raw data with the available plots check the Raw data with the available plots – analyze the data using NUSE and RLE plots.analyze the data using NUSE and RLE plots.

• Answer the following questions:Answer the following questions:– Is there any array characterized by a very Is there any array characterized by a very

narrow probe intensity distribution?narrow probe intensity distribution?• YESYES NONO

– Is there any array which is significantly Is there any array which is significantly different with respect to the others by mean of different with respect to the others by mean of NUSE or RLE plots?NUSE or RLE plots?• YESYES NONO

See next page

Exercise 2Exercise 2• If any array has a NUSE very different If any array has a NUSE very different

from the others create a new target file from the others create a new target file without it and load again the without it and load again the remaining .CEL files.remaining .CEL files.





affylmGUIaffylmGUI

• Analysis steps:Analysis steps:– Calculating probe set summaries:Calculating probe set summaries:

• RMARMA• GCRMAGCRMA• PLIERPLIER

– Normalization:Normalization:• Quantile methodQuantile method

Brief summary about probe set Brief summary about probe set intensity calculationintensity calculation

• RMA methodology (Irizarry et al., 2003) performs background correction, normalization, and summarization in a modular way. RMA does not take in account unspecific probe hybridization in probe set background calculation.

• GCRMA is a version of RMA with a background correction component that makes use of probe sequence information (Wu et al., 2004).

• The PLIER (Probe Logarithmic Error Intensity Estimate) method produces an improved signal by accounting for experimentally observed patterns in probe behavior and handling error at the appropriately at low and high signal values.

• Methods such as PLIER+16 and GCRMA, which use model-based background correction, maintain relatively good accuracy without losing much precision.

•G,C stretch length better perform with respect to G + C content.

•The distribution of 4 G-C strata in the non-specific sample well approximate to a log-normal distribution.

Yeast DNA on human chips

Observed log (base 2) expression versus nominal log concentration (in picoMolar).Observed log (base 2) expression versus nominal log concentration (in picoMolar).

Why Normalization ?Why Normalization ?

• Sample preparationSample preparation• Variability in hybridizationVariability in hybridization• Spatial effectsSpatial effects• Scanner settingsScanner settings• Experimenter biasExperimenter bias

To remove systematic biases, which To remove systematic biases, which include,include,

Extracted from D. Hyle presentation, http://www.bioinf.man.ac.uk/microarray

What Normalization Is & What It Isn’tWhat Normalization Is & What It Isn’t

• Methods and AlgorithmsMethods and Algorithms

• Applied after some Image AnalysisApplied after some Image Analysis

• Applied before subsequent Data AnalysisApplied before subsequent Data Analysis

• Allows comparison of experimentsAllows comparison of experiments

• Not a cure for poor dataNot a cure for poor data.

Where Normalization Fits InWhere Normalization Fits In

Sample Sample PreparatioPreparatio

nn

Array Array FabricatioFabricatio

nn

HybridizatioHybridizationn

Scanning Scanning + Image + Image AnalysisAnalysis


Data Data AnalysisAnalysis

Spot location, Spot location, assignment of assignment of intensities, intensities, background background correction etc.correction etc.


Subsequent Subsequent analysis, e.g analysis, e.g clustering, clustering, uncovering uncovering genetic networksgenetic networks

Extracted from Irizarry presentation at Bioconductor Course (Brixen IT, 2005)

Quantile normalizationQuantile normalization




Quantile normalized

M-A PlotsM-A Plots

A

M

log G

log R

45°

M-A plot is 45° rotation of standard M-A plot is 45° rotation of standard scatter plot scatter plot

M = log R – log G

M = Minus

A = ½[ log R + log G ]

A = Add

0

The next step is normalization and calculation of probe set summary.

The next step is normalization and calculation of probe set summary.

Click on probe set menuand select the probe set summary and normalization option.

Click on probe set menuand select the probe set summary and normalization option.

A

B

Normalization and intensity calculation come together.Three Normalization/intensity calculation option are available:RMA + quantile normalizationGCRMA + quantile normalizationPLM + quantile normalization

Normalization and intensity calculation come together.Three Normalization/intensity calculation option are available:RMA + quantile normalizationGCRMA + quantile normalizationPLM + quantile normalization

At any time it is possible to check the structure of the normalized data set

At any time it is possible to check the structure of the normalized data set

Exercise 3a Exercise 3a (20 minutes)(20 minutes)

• Data set for following exercises:Data set for following exercises:• estrogen.IGF1 estrogen.IGF1 : 18 CEL files were generated for 4 : 18 CEL files were generated for 4

prototypic situations (biological replicas):prototypic situations (biological replicas):– MCF7 ctrlMCF7 ctrl, untreated MCF7:, untreated MCF7:

• M1.CEL, M4.CEL, M7.CELM1.CEL, M4.CEL, M7.CEL

– MCF7 E2MCF7 E2, MCF7 treated for 3 hours with Estrogen:, MCF7 treated for 3 hours with Estrogen:• M3.CEL, M6.CEL, M9.CELM3.CEL, M6.CEL, M9.CEL

– MCF7 IGFMCF7 IGF, MCF7 treated for 3 hours with IGF1:, MCF7 treated for 3 hours with IGF1:• M2.CEL, M5.CEL, M8.CELM2.CEL, M5.CEL, M8.CEL

– SKER3 ctrlSKER3 ctrl, untreated SKER3:, untreated SKER3:• s1.CEL, s4.CEL, s7.CELs1.CEL, s4.CEL, s7.CEL

– SKER3 E2SKER3 E2, SKER3 treated for 3 hours with Estrogen:, SKER3 treated for 3 hours with Estrogen:• s3.CEL, s6.CEL, s9.CELs3.CEL, s6.CEL, s9.CEL

– SKER3 IGFSKER3 IGF, SKER3 treated for 3 hours with IGF1:, SKER3 treated for 3 hours with IGF1:• s2.CEL, s5.CEL, s8.CELs2.CEL, s5.CEL, s8.CEL

See next page

Exercise 3aExercise 3a

• Go in the folder Go in the folder estrogen.IGF1estrogen.IGF1..• Create, with excel, Create, with excel, a tab delimited filea tab delimited file named targets.txt: named targets.txt:

– Targets file is made of three columns with the following header:Targets file is made of three columns with the following header:• NameName• FileNameFileName• TargetTarget

– In column In column NameName place a brief name (e.g. c1, c2, etc) place a brief name (e.g. c1, c2, etc)– In column In column FileNameFileName place the name of the corresponding .CEL place the name of the corresponding .CEL

filefile– In column In column TargetTarget place the experimental conditions (e.g. control, place the experimental conditions (e.g. control,

treatment, etc)treatment, etc)

• Create a target only for MCF7 with/without IGF1 Create a target only for MCF7 with/without IGF1 treatmenttreatment

See next page

Exercise 3aExercise 3a

• Calculate Probe set summaries with Calculate Probe set summaries with GCRMA for the estrogen.IGF1 set.GCRMA for the estrogen.IGF1 set.

• Save the limma project as ex3.Save the limma project as ex3.

See next page

Replicates quality controlReplicates quality control

• To evaluate sample replicates quality we To evaluate sample replicates quality we will use a partition technique called will use a partition technique called Principal component analysis (PCA) .Principal component analysis (PCA) .

Principal component analysisPrincipal component analysis

• Principal component analysis (PCA) involves a Principal component analysis (PCA) involves a mathematical procedure that transforms a number of mathematical procedure that transforms a number of (possibly) correlated variables into a (smaller) number of (possibly) correlated variables into a (smaller) number of uncorrelated variables called uncorrelated variables called principal componentsprincipal components. .

• The first principal component accounts for as much of The first principal component accounts for as much of the variability in the data as possiblethe variability in the data as possible

• Each succeeding component accounts for as much of Each succeeding component accounts for as much of the remaining variability as possible. the remaining variability as possible.

• The components can be thought of as axes in n-The components can be thought of as axes in n-dimensional space, where n is the number of dimensional space, where n is the number of components. Each axis represents a different trend in components. Each axis represents a different trend in the data.the data.

PCA1

PCA2

PCA

2

1

2° PC will be orthogonal to the 1st

In general the first three components account for nearly all the variability.Therefore, PCA can be reasonably

represented in a 3D space.

In general the first three components account for nearly all the variability.Therefore, PCA can be reasonably

represented in a 3D space.

To perform sample replicates QC we use principal component analysis (PCA) This check is performed on probe set summaries!

To perform sample replicates QC we use principal component analysis (PCA) This check is performed on probe set summaries!

A

B

t4 is clearly an outlier!t4 is clearly an outlier!

Exercise 3bExercise 3b

• Use the Use the estrogen.IGF1estrogen.IGF1 data set to data set to evaluate quality replicas by mean of PCA evaluate quality replicas by mean of PCA analysis.analysis.

• Questions:Questions:– CTRL and TRT are well separated?CTRL and TRT are well separated?

• YESYES NONO

– Replicates are homogeneous?Replicates are homogeneous?• YESYES NONO





FFilteringiltering• Filtering affects the false discovery rate.Filtering affects the false discovery rate.

• Researcher is interested in keeping the number Researcher is interested in keeping the number of tests/genes as low as possible while keeping of tests/genes as low as possible while keeping the interesting genes in the selected subset.the interesting genes in the selected subset.

• If the truly differentially expressed genes are If the truly differentially expressed genes are overrepresented among those selected in the overrepresented among those selected in the filtering step, filtering step, the FDR associated with a certain the FDR associated with a certain threshold of the test statistic will be lowered due threshold of the test statistic will be lowered due to the filteringto the filtering..

Extracted from: Heydebreck et al. Bioconductor Project Working Papers 2004

Filtering can be performed at Filtering can be performed at various levels:various levels:

• Annotation features:Annotation features:– Specific gene features (i.e. GO term, presence Specific gene features (i.e. GO term, presence

of transcriptional regulative elements in of transcriptional regulative elements in promoters, etc.)promoters, etc.)

• Signal features:Signal features:– % intensities greater of a user defined value% intensities greater of a user defined value– Interquantile range (IQR) greater of a defined Interquantile range (IQR) greater of a defined

valuevalue

Specific gene featureSpecific gene feature• In transcriptional studies focusing on genes In transcriptional studies focusing on genes

characterized by specific feature (characterized by specific feature (i.e. i.e. transcription factor elements in promoterstranscription factor elements in promoters) the ) the best filtering approach is selecting only those best filtering approach is selecting only those genes linked to the “genes linked to the “peculiar featurepeculiar feature”.”.

• For example:For example:– Identification of genes modulated by estradiol:ER or Identification of genes modulated by estradiol:ER or

IGF1 by direct binding to Estrogen-Responsive IGF1 by direct binding to Estrogen-Responsive Elements (ERE):Elements (ERE):

• HGU133plus2:HGU133plus2:– 54675 probe sets 54675 probe sets 19951 Entrez Genes 19951 Entrez Genes

• HGU133plus2 with ERE in putative promoter regions:HGU133plus2 with ERE in putative promoter regions:– 6764 probe sets 6764 probe sets 3058 Entrez Genes 3058 Entrez Genes

Specific gene featureSpecific gene feature• Data derived from specifically devoted annotation data Data derived from specifically devoted annotation data

set can be used for functional filtering.set can be used for functional filtering.• The The Ingenuity Pathways Knowledge BaseIngenuity Pathways Knowledge Base is the world's is the world's

largest curated database of biological networks created largest curated database of biological networks created from millions of individually modeled relationships from millions of individually modeled relationships between:between:– proteins, genes, complexes, cells, tissues, drugs, diseases.proteins, genes, complexes, cells, tissues, drugs, diseases.

• The Ingenuity Pathways Analysis software (IPA) identifies relations between genes.

• The relations that can be grasped are:– Regulates– Regulated by– Binds

Start an Ingenuity session at:https://analysis.ingenuity.com/pa/login/login.jsp

Specific classes of proteins can be searched and exported

A key word can also be used to perform a wide search

After selection of the Functions & diseases of interest genes should be visualized as gene details before exportation in a file to be used for filtering expression data

Exporting results in a table as previously

The Entrez Gene IDs present in this file can be used to extract e specific subset of genes.To use filtering using a list of EG you need to extract from the IPA table only the Entrez genes of interest and save them on a text file without header.

Exercise 4 (Exercise 4 (30 minutes30 minutes))

• Export from Ingenuity a table related to Export from Ingenuity a table related to transcription factorstranscription factors

• Create a file containing only the Entrez Create a file containing only the Entrez genes without header and use it to filter genes without header and use it to filter the data set of exercise 3.the data set of exercise 3.

• Save the affylm project as ex4.TF.lmaSave the affylm project as ex4.TF.lma• Inspect the sample PCA:Inspect the sample PCA:

– CTRL and TRT are well separated?CTRL and TRT are well separated?• YESYES NONO

Non–specificNon–specific filtering filtering• This technique has as its premise the removal of This technique has as its premise the removal of

genes that are deemed to be genes that are deemed to be not expressednot expressed or or unchangedunchanged according to some specific criterion according to some specific criterion that is under the control of the user.that is under the control of the user.

• The aim of non–specific filtering is to remove The aim of non–specific filtering is to remove genes that, genes that, e. g. due to their low overall intensity e. g. due to their low overall intensity or variabilityor variability, are unlikely to carry information , are unlikely to carry information about the phenotypes under investigation. about the phenotypes under investigation.

Extracted from: Heydebreck et al. Bioconductor Project Working Papers 2004

Intensity distributionsIntensity distributions

RMA GCRMA

Bg level probe setsBg level probe sets

How to define the efficacy of a How to define the efficacy of a filtering procedure?filtering procedure?

• This enrichment is very similar to that used to evaluate the purification This enrichment is very similar to that used to evaluate the purification folds of a protein after a chromatographic step.folds of a protein after a chromatographic step.

inspikeingfterFilterprobesetsA

probesetsteringinAfterFilspike

NN

NNenrichment

100

mBeforeChroEAfterChromgP

mBeforeChrogPAfterChromEenrichment

..

..100

Filtering by genefilter pOverAFiltering by genefilter pOverA(keep if ≥ 25% probe sets have intensities ≥ log(keep if ≥ 25% probe sets have intensities ≥ log22(100))(100))

5553 5553 42/42 SpikeIn42/42 SpikeInEnrichment: Enrichment:

401%401%

223002230042/42 SpikeIn42/42 SpikeInEnrichment: Enrichment:

100%100%

Filtering by InterQuantile RangeFiltering by InterQuantile RangeIQR25% 75%

How filtering by genefilter IQR works?How filtering by genefilter IQR works?The distribution of all intensity values of a differential expression experiment are the The distribution of all intensity values of a differential expression experiment are the summary of the distribution of each gene expression over the experimental conditionssummary of the distribution of each gene expression over the experimental conditions

The filter removes genes that show little changes within the experimental pointsThe filter removes genes that show little changes within the experimental points

How filtering by IQR works?How filtering by IQR works?

Filtering by genefilter IQRFiltering by genefilter IQR(removing if intensities IQR(removing if intensities IQR0.25, 0.5)0.25, 0.5)


32794%32794%

223002230042/42 SpikeIn42/42 SpikeInEnrichment: Enrichment:

100%100%


9139%9139%

AC

D

B

A

B

CIn this example will be selected only those genes characterized by having in at least 50% of the arrays an intensity ≥ 100.

In this example will be selected only those genes characterized by having in at least 50% of the arrays an intensity ≥ 100.


• Use the estrogen.IGF1 data normalized with gcrma Use the estrogen.IGF1 data normalized with gcrma (ex3).(ex3).

• Perform and IQR filter at 0.25 followed by an intensity Perform and IQR filter at 0.25 followed by an intensity filter at 50% of the arrays with and intensity over 100.filter at 50% of the arrays with and intensity over 100.

• Export the data as tab delimited file.Export the data as tab delimited file.• Question:Question:

– How many probe sets are left after the first and the second filter?How many probe sets are left after the first and the second filter?• ……………………………………………………………………………………………………………………

– Which are the changes observable at the level of probe sets Which are the changes observable at the level of probe sets distribution after the first and the second filter?distribution after the first and the second filter?

• …………………………………………………………………………………………………………………………………………………………

• Save the affylm project as ex5.lma.Save the affylm project as ex5.lma.

• Make the same filtering on ex4.TF.lmaMake the same filtering on ex4.TF.lma• Question:Question:

– How many probe sets are left after the first and the second filter?How many probe sets are left after the first and the second filter?

Expression Console®®

(Affymetrix person)• Presentation of Expression Console®®:

– General structure– Loading exon libraries– QC for raw data– Exporting gene level and exon level analysis

QC and filtering for exon dataQC and filtering for exon data

• At the time oneChannelGUI was setup At the time oneChannelGUI was setup Bioconductor tools for handling raw data from Bioconductor tools for handling raw data from Affymetrix exon arrays were not available.Affymetrix exon arrays were not available.

• For this reason the oneChannelGUI uses the For this reason the oneChannelGUI uses the libraries and primary analysis outputs from libraries and primary analysis outputs from Affymetrix Expression ConsoleAffymetrix Expression Console®®..

• Exon raw data quality control is perfomed using Exon raw data quality control is perfomed using the Expression Consolethe Expression Console®®. .

• Sample QC and filtering are performed on Sample QC and filtering are performed on oneChannleGUI.oneChannleGUI.

Exon arrays on oneChannelGUIExon arrays on oneChannelGUI

• On oneChannelGUI gene level and exon On oneChannelGUI gene level and exon level data from Expression Consolelevel data from Expression Console®® are are loaded.loaded.

• User needs to specify where Expression User needs to specify where Expression ConsoleConsole®® library files are located, at any library files are located, at any time a new exon data set is loaded.time a new exon data set is loaded.

Loading an exon array data set it is necessary to indicate the organism and which kind of exon data are going to be loaded (core, extended, full)

Loading an exon array data set it is necessary to specify the location of Expression Console libraries.

Subsequently three files have to be loaded:The target file, which has the same structure previously described.The tab delimited files containing GENE and EXON level data exported form the Expression Console.

A new Menu is then available for exon data

Exon arrays QC on oneChannelGUIExon arrays QC on oneChannelGUI

The brain (b) replicates are very poor. The quality is particularly bad for exon data. However, we have to consider that these data are derived from tissues coming from different post-mortem donors.

Exon arrays filteringExon arrays filteringSince the knowledge on exon data is still relatively limited we have little empirical information about background threshold.Exon/intron housekeeping gene information available in exon data might be a possible approach to define it.

Different color lines indicate the possible thresholds to be selected. In black are shown the intensity density plots for introns as in red those for exons.

IQR filter works as described for 3’IVT arrays. However, any filter done at gene level will also affect the corresponding exon data.

Starting condition

After filtering

Intensity filter is instead based on the threshold previously selected on the basis of exon/intron HK expression signals.In this example we are keeping only the genes where all samples have a signal greater than the pre-defined BG.


• Use breast and cerebellum Affymetrix exon data.Use breast and cerebellum Affymetrix exon data.• Calculate, with Expression Console, gene/exon Calculate, with Expression Console, gene/exon

level expression for the CORE sub set.level expression for the CORE sub set.• Evaluate the sample quality using gene/exon Evaluate the sample quality using gene/exon

PCA.PCA.• Question:Question:

– Are replicates homogeneous?Are replicates homogeneous?– YES NOYES NO

• Define a BG threshold using introns information.Define a BG threshold using introns information.• Perform an IQR filter at 0.25 followed by Perform an IQR filter at 0.25 followed by

intensity filter at a threshold of your choice.intensity filter at a threshold of your choice.• Save the affylm project as ex6.lma.Save the affylm project as ex6.lma.

Splice IndexSplice Index• The Splicing Index captures the basic The Splicing Index captures the basic

metric for the analysis of alternative metric for the analysis of alternative splicing. splicing.

• It is a measure of how much exon specific It is a measure of how much exon specific expression (with gene induction factored expression (with gene induction factored out) differs between two samples. out) differs between two samples.

Defining function-oriented data set Defining function-oriented data set for splice index calculation for splice index calculation Defining function-oriented data set Defining function-oriented data set for splice index calculation for splice index calculation

A

BC

Use a set of function-oriented EGs to select probe set IDs

Use a set of function-oriented EGs to select probe set IDs

Use the selected probe set IDs for “Filtering using a list of probe sets”.

Use the selected probe set IDs for “Filtering using a list of probe sets”.

A

B

Splice Index inspection is performed modelling the splice index exon profiles for two experimental conditions.

Splice Index inspection is performed modelling the splice index exon profiles for two experimental conditions.

C ATTENTION: this is only a very rough descriptive instrument!Much work needs to be done on exon analysis!

Results are saved on a pdf file in your working dir.

Results are saved on a pdf file in your working dir.

The sub set of splice indexes to be inspected is defined using two filters:The sub set of splice indexes to be inspected is defined using two filters:

Example of one gene outputExample of one gene output

A

B C

D

Model of splice indexes over the two experimental conditions.Red dashed lines indicate the confidence interval of the model.

Model of splice indexes over the two experimental conditions.Red dashed lines indicate the confidence interval of the model.

A

This plot gives some advise about the scattering levels of the Splice Indexes over the gene under analysis

This plot gives some advise about the scattering levels of the Splice Indexes over the gene under analysis

Plots of significance p-value of the alternative splicing versus the average Splice Index values. In this example only one exon seems to be differentially spliced : 14.

Plots of significance p-value of the alternative splicing versus the average Splice Index values. In this example only one exon seems to be differentially spliced : 14.

Significance p-value of the alternative splicing versus the average Splice Index values. IN this example only one exon seems to be differentially spliced.

Significance p-value of the alternative splicing versus the average Splice Index values. IN this example only one exon seems to be differentially spliced.

Filtering conditions are shown over the plot of intensity values versus exon number.

Filtering conditions are shown over the plot of intensity values versus exon number.


• Use ex6.lma.Use ex6.lma.• Perform a filtering on the basis of a probe sets Perform a filtering on the basis of a probe sets

related to TF.related to TF.– Create in NetAffx a Transcript cluster view containing Create in NetAffx a Transcript cluster view containing

only Transcript Cluster IDonly Transcript Cluster ID– Gene level probe set list is created interrogating Gene level probe set list is created interrogating

NetaAffx with the EG used in exercise 4NetaAffx with the EG used in exercise 4• Calculate on the data sub set the splice Index.Calculate on the data sub set the splice Index.• Save the affylm project as ex7.lma.Save the affylm project as ex7.lma.• Filter for p-value Filter for p-value ≤≤0.01 and mean average Splice 0.01 and mean average Splice

Index differences Index differences ≥≥2.2.• Inspect the output plots. Inspect the output plots.





This step is the same for 3’IVT arrays and exon arrays gene This step is the same for 3’IVT arrays and exon arrays gene level analyseslevel analyses

LogLog22(T/C) is frequently used to evaluate (T/C) is frequently used to evaluate

fold change variationfold change variation

log2(t/c)

H07

498

U65

39

G75

99

L875

4

AA

238

S0

983

45

MN

098

7

AC

876

54

PT

765

F6

5439

-4.00

-2.00

0.00

2.00

4.00

6.00

8.00

log2(t/c)

t/c

Diff

ere

nti

al expre

ssio

n

down-regulation compression

200, 400, 800, 1600, 32000100 100 100 100 100200 400 800 1600 32000

100, 100, 100, 100, 100

Statistical analysisStatistical analysis• Intensity changes between Intensity changes between

experimental groups (i.e. experimental groups (i.e. control versus treated) are control versus treated) are known as:known as:– Fold change. Fold change. – Ranking genes based on Ranking genes based on

fold change alone fold change alone implicitly assigns equal implicitly assigns equal variance to every gene.variance to every gene.

• Fold change alone is not Fold change alone is not sufficient to indicate the sufficient to indicate the significance of the expression significance of the expression changes.changes.

• Fold change has to be Fold change has to be supported by statistical supported by statistical information. information.

Multiple testing errorsMultiple testing errors

• Performing multiple statistical tests two types of Performing multiple statistical tests two types of errors can occur:errors can occur:– Type I error (False positive)Type I error (False positive)

– Type II error (False negative)Type II error (False negative)

• Reduction of type I errors increases the number Reduction of type I errors increases the number of type II errors.of type II errors.

• It is important to identify an approach that It is important to identify an approach that reduces reduces false positivesfalse positives with the minimum loss of with the minimum loss of information (information (false negativefalse negative))

• If the number of samples increases the tails of a If the number of samples increases the tails of a distribution are getting more populated.distribution are getting more populated.

The multiple tests problemThe multiple tests problemThe multiple tests problemThe multiple tests problem

Type I error correctionType I error correction

• Null hypothesis (H0): Null hypothesis (H0): the mean of treated and the the mean of treated and the mean of control for a gene mean of control for a gene ii belong to the same belong to the same distribution.distribution.

• Type I errorType I error: H0 is false.: H0 is false.

• Sidak significance point:Sidak significance point:

• If the p-values are lower of K (g,If the p-values are lower of K (g,) all the ) all the remaining H0 are considered true.remaining H0 are considered true.

ggK 11),(ggK 11),(

= acceptance level (es 0.05)= acceptance level (es 0.05)gg= n. of independent tests= n. of independent tests

P of diff. exprs. genes P of diff. exprs. genes ’’<10<10-6-6 1 – (1 – 0.05)1 – (1 – 0.05)1/51/5== 0.1020.102< 10< 10-6-6 1 – (1 – 0.05)1 – (1 – 0.05)1/41/4== 0.01270.01272* 102* 10-5-5 1 – (1 – 0.05)1 – (1 – 0.05)1/31/3== 0.01700.01700.0470.047 1 – (1 – 0.05)1 – (1 – 0.05)1/21/2== 0.02530.0253……

ggK 11),(ggK 11),(

Type I error correction (FWER)Type I error correction (FWER)

Statistical analysisStatistical analysis

• The sensitivity of statistical tests is affected by The sensitivity of statistical tests is affected by the number of available replicates.the number of available replicates.

• Replicates can be:Replicates can be:– TechnicalTechnical– BiologicalBiological

• Biological replicates better summarize the Biological replicates better summarize the variability of samples belonging to a common variability of samples belonging to a common group.group.

• The minimum number of replicates is an The minimum number of replicates is an important issue!important issue!

How much replicates are importantHow much replicates are important??Yang YH e Speed T, 2002

Sample sizeSample size

• Microarray experiments are often performed with a Microarray experiments are often performed with a small number of biological replicates, resulting in small number of biological replicates, resulting in low statistical power for detecting differentially low statistical power for detecting differentially expressed genes and concomitant high false expressed genes and concomitant high false positive rates. positive rates.

• The issue of how many replicates are required in a The issue of how many replicates are required in a typical experimental system needs to be typical experimental system needs to be addressed.addressed.

• Of particular interest is the difference in required Of particular interest is the difference in required sample sizes for similar experiments in sample sizes for similar experiments in inbredinbred vs. vs. outbredoutbred populations (e.g. mouse and rat vs. populations (e.g. mouse and rat vs. human).human).

Assessing sample sizes inAssessing sample sizes inmicroarray experimentsmicroarray experiments

• The R package, sizepower, is used to calculate sample size and power in the planning stage of a microarray study.

• It helps the user to determine how many samples are needed to achieve a specified power for a test of whether a gene is differentially expressed or, in reverse, to determine the power of a given sample size.


• Use the breast/cerebellum filtered exon Use the breast/cerebellum filtered exon data (ex6.lma).data (ex6.lma).

• Using the sizepower implementationUsing the sizepower implementation– Check the sample size and the experimental Check the sample size and the experimental

power selecting various threshold of false power selecting various threshold of false positives and log2(FC)positives and log2(FC)

Comments about experimental Comments about experimental designdesign

• If the biological material is not a limiting If the biological material is not a limiting factor “factor “THINK WIDETHINK WIDE””

Selecting differentially expressed Selecting differentially expressed genesgenes

Differential expressionlinked to a specific

biological event.

Statistical validationmethod I

Statistical validationmethod III

Statistical validationmethod II

Selecting differentially expressed Selecting differentially expressed genesgenes

• Each method grasps some true signals but Each method grasps some true signals but not all.not all.

• Each method catches some false signals.Each method catches some false signals.

• The trick is to find the best condition to The trick is to find the best condition to maximize true signals while minimizing maximize true signals while minimizing fakes.fakes.

• Each method grasps some true signals but Each method grasps some true signals but not all.not all.

• Each method catches some false signals.Each method catches some false signals.

• The trick is to find the best condition to The trick is to find the best condition to maximize true signals while minimizing maximize true signals while minimizing fakes.fakes.

Population Ctrl

Population Trtd

Mean C Mean T

Sample mean “s”

Less than a 5% chance that the sample with mean s came from population C, i.e., s is significantly different from “mean C” at the p < 0.05 significance level. But we cannot reject the hypothesis that the sample came from population T.

C

C

T

T

CT

nn

mmt

22

C

C

T

T

CT

nn

mmt

22

• T-statistics is widespread in assessing T-statistics is widespread in assessing differential expression.differential expression.

• Unstable variance estimates that arise Unstable variance estimates that arise when sample size is small can be when sample size is small can be corrected using:corrected using:– Error fudge factors (SAM)Error fudge factors (SAM)– Bayesian methods (Limma) Bayesian methods (Limma)

• T-statistics is widespread in assessing T-statistics is widespread in assessing differential expression.differential expression.

• Unstable variance estimates that arise Unstable variance estimates that arise when sample size is small can be when sample size is small can be corrected using:corrected using:– Error fudge factors (SAM)Error fudge factors (SAM)– Bayesian methods (Limma) Bayesian methods (Limma)

SAMSAM

Significance Analysis of Microarray

SAM SAM (Significance analysis of (Significance analysis of microarrays)microarrays)(Tusher et al. 2001)(Tusher et al. 2001)

fudge factor regularizes fudge factor regularizes the the t t -statistic -statistic by inflating theby inflating thedenominatordenominator

fudge factor regularizes fudge factor regularizes the the t t -statistic -statistic by inflating theby inflating thedenominatordenominator

s(i) is the pooled standard deviation, taking into account differinggene-specific variation across arrays.

s(i) is the pooled standard deviation, taking into account differinggene-specific variation across arrays.

Two-class unpairedTwo-class unpaired: : to pick out genes whose to pick out genes whose mean expression level is significantly different mean expression level is significantly different between two groups of samples (between two groups of samples (analogous to analogous to between subjects t-testbetween subjects t-test).).

Two-class pairedTwo-class paired: : samples are split into two samples are split into two groups, and there is a 1-to-1 correspondence groups, and there is a 1-to-1 correspondence between an sample in group A and one in group between an sample in group A and one in group B (B (analogous to paired t-testanalogous to paired t-test).).

Multi-classMulti-class: : picks up genes whose mean picks up genes whose mean expression is different across > 2 groups of expression is different across > 2 groups of samples (samples (analogous to one-way ANOVAanalogous to one-way ANOVA))

OthersOthers:: ………………………………………………………………………………………………………………………………

Some SAM designsSome SAM designs

• SAM uses data permutations to define a SAM uses data permutations to define a set of significant differential expression.set of significant differential expression.

N N N

T T T

N

N

N

T

T

T N

N NT

T T N

N

N

T

T

T N

N NT

T T{ }

FDR is given by p0 * False / Calledp0 is the prior probability pi0 that a gene is not differentially expressed

FDR is given by p0 * False / Calledp0 is the prior probability pi0 that a gene is not differentially expressed

How SAM calculates the False Discovery How SAM calculates the False Discovery

Rate for a specific delta?Rate for a specific delta?

Permutations1234

Mean falseMean false

720

SAM analysis can be performed in Bioconductor using the siggenes library.Two class or multi class analysis is selected automatically due to the structure of Target information

The delta table prompts to the user the information related to the amount of differentially expressed genes given a certain FDR.

A

B

C

The user selects a delta value and check the behaviour of the differentially expressed genes.

The user selects a delta value and check the behaviour of the differentially expressed genes.

Subsequently the user performs a log2(fold change) filter to produce a table of differentially expressed genes.

Subsequently the user performs a log2(fold change) filter to produce a table of differentially expressed genes.

The table can be saved in a tab delimited file

Raw p-valueFold change

Significance measurement derived from raw p-value

relative difference in gene expression

Standard deviation


• Use the estrogen.IGF1 data set.Use the estrogen.IGF1 data set.

• After a filtering, perform a two class After a filtering, perform a two class analysis on ctrl versus IGF1 treatment, analysis on ctrl versus IGF1 treatment, after IQR and/or intensity filter.after IQR and/or intensity filter.

• Select a FDR Select a FDR ≤ 0.1 and a |log≤ 0.1 and a |log22(FC)| ≥ 1(FC)| ≥ 1

• Save results in ex9.xlsSave results in ex9.xls

Documents

Contacts Data sets are available at: Post-workshop support: –support on data analysis course