Unlocking the potential of public available gene expression data for large-scale analysis

Jonatan TaminauPhD defense, November 2012

Introduction

• In this thesis:•Focus on data to information step.•Focus on microarrays technology.

Data KnowledgeInformation

Introduction

Data Information

Data Repositories: + Massive amounts + Examples: GEO, ArrayExpress + Publicly available!

Analysis Software: + Commercial: CLC Bio, Spotfire, etc. + Free: Bioconductor, Genepattern, Galaxy, etc. + A lot of existing research

Introduction

“Although hundreds of thousands of samples are publicly available, and several powerful analysis software solutions exist, the research community is facing a chasm between these two resources.” (Coletta et al, 2012)

“One of the challenges for the future is how to integrate all the DNA microarray data that have been generated and deposited in public databases.” (Larsson et al, 2006)

Introduction

• We identified two hurdles for large-scale microarray analysis:

① Consistent retrieval of individual datasets.

② Integrative analysis of multiple data sets.

Outline

Chapter 1

Chapter 2

Chapter 3

Chapter 4

Chapter 5

Chapter 6 Chapter 7

Chapter 8

Chapter 9

Outline

Retrievalof data

IntegrativeAnalysis

Problem Statement

inSilico DB

Problem Statement

Meta-Analysis Merging

Application

Outline

Retrievalof data

Problem Statement

inSilico DB

Problem Statement

Application

IntegrativeAnalysis

Retrieval of genomic data

•Data is online, freely available•But: difficult to consistently retrieve

the data (Example: Baggerly & Combes, 2011)

•What does it mean?•Data retrieval is reproducible and

tractable•No manual intervention needed•All data is preprocessed the same

•Typical microarray workflow:

CELfileScanner Prepro-

cessing

DNAmicroarray

ImageAnalysis

numerical(‘raw’) data

Gene expressionmatrix

CELfile Prepro-

cessingnumerical(‘raw’) data

Gene expressionmatrix

Complex + normalization/background correction + probe-to-gene mapping + versioning issues + etc. Not Documented!

“only 48% of all data in GEO and ArrayExpress was submitted with raw data” (Larsson et al. 2006)

+ Features+ Genes or probes+ range: 20k-30k

+ Instances+ Patients, tissues, etc.+ range: 10-100

Gene Expression Value: + Expression of gene i in sample j + range between 2-14 + log2 scaled

•What about phenotypical data or meta-data ? •Extra information about the samples

(age, gender, disease, etc.)•No standard way of formatting this

information•MIAME / Ontologies / Free text / etc.•Also still an open problem

•Why is consistent retrieval from public repositories so important?•Reproducibility of results•Comparison of new results with

existing studies•Combining different studies

Outline

Retrievalof data

Problem Statement

inSilico DB

Problem Statement

Application

IntegrativeAnalysis

The inSilico Database

•Result of InSilico project• Innoviris (2007-2012)•8 persons from VUB & ULB

•Provides consistently preprocessed and expert-curated genomic data

•Being commercialized

The inSilico Database

•What makes the inSilico Database so valuable ?•Not the fact that all data is

precomputed•But how it is precomputed

•What is the underlying engine ?•Genomic Pipelines•Backbone

The inSilico DB | Genomic Pipelines

•For every data type there is a different pipeline

•Microarray pipeline:

• Jobs• Dependencies• Backbone

The inSilico DB | Backbone

•Automatic Workflow System•Barely manual intervention needed•Control of intermediate results•Pre-computation saves time (for the

user)•Streamlined Error management•Automatic Monitoring

The inSilico DB | Backbone

•How does it works?• Java daemon (recently replaced by

application server)•Configuration Files

inSilicoDb package

•One thing missing for large-scale analysis...•Programmatic access via scripting

•Contains the basic functionality of InSilico DB

•Makes automatic retrieval of data possible!

•Seamlessly integrates with other bioconductor analysis tools

•Published in Bioinformatics, download > 2000 times

Outline

Retrievalof data

Problem Statement

inSilico DB

Problem Statement

Application

IntegrativeAnalysis

Integrative Analysis

•“Combining the information of multiple, independent but related studies in order to extract more general and more reliable results”

•Problem: •How to do it ?

•Two approaches:•Meta-Analysis•Merging

Integrative AnalysisMergingMeta-Analysis

Outline

Retrievalof data

Problem Statement

inSilico DB

Problem Statement

Application

IntegrativeAnalysis

Meta-Analysis

+ Combining p-values + Combining effect sizes + Combining Ranks + Vote Counting + etc.

+ Depends on goal + Much focus on finding DEGs + Defines what the results look like

+ Consistent Retrieval is essential ! + inSilicoDb package

Meta-Analysis | Stable Genes

•365 studies were screened for stable genes

•Motivation:• Interested in reference genes•Currently used genes (housekeeping

genes) are not ideal•Need a compact and diverse list of

genes that are stable under most conditions

• In collaboration with Dr Bram de Craene (VIB-UGent)

(1) Retrieve Data + inSilicoDb package + All 365 datasets downloaded in less than 100 min

(2) Calculate Stability Scores + For each gene: + Coefficient of Variation (CV) sd / mean + avoid lowly expressed genes

(3) Combine Stability Scores + For each gene take median of CVs + Rank and take top 100

(4) Semantic Similarity Filtering + Exclude genes that are related + Uses gene annotation from GO + Innovative Step! + From 100 to 10 genes

•Status:

•August 2012 | waiting for results…•September 2012 | first positive

results!•November 2012 | second test case,

positive feedback from NAR, manuscript in preparation…

Outline

Retrievalof data

Problem Statement

inSilico DB

Problem Statement

Application

IntegrativeAnalysis

Merging

+ Consistent Retrieval is essential ! + inSilicoDb package

+ Batch effects + Methods to remove - Location-scale - Matrix Factorization - Discretization+ Makes data compatible+ Preprocessing not

sufficient

+ Same as with single studies + Increased sample size !

Merging | Batch Effects

• Illustrative Example what batch effects can cause:•We merged 4 different studies with

thyroid samples•All studies contained normal and

tumor samples• In collaboration with Wilma Van

Staveren (IRIBHM, ULB)

•Samples are plotted in MDS space•We expect two clusters

Merging | Batch Effects

Merging without batch effect removal Merging with batch effect removal

Legend: + symbol for study + color for normal/tumor

inSilicoMerging package

•R/Bioconductor package combining:•6 different merging methods•5 visual inspection tools•6 quantitative measures

•Only resource so far combining all this functionality !

•Seamlessly integrates with inSilicoDb package

Outline

Retrievalof data

Problem Statement

inSilico DB

Problem Statement

Application

IntegrativeAnalysis

Identification of DEGs in Lung Cancer

• Idea: compare meta-analysis and merging approaches for integrative analysis

•We used lung cancer as case based on the content of inSilico DB.

• Ignore subtypes: DEGs can be seen as playing a role in the basic mechanisms of lung cancer

•What is our hypothesis ?

•Due to the small sample sizes of individual studies there are a lot or False Negatives when using meta-analysis

•Can we avoid this by using merging as an alternative approach?

Identification of DEGs in Lung CancerMergingMeta-Analysis

Constraints: + fRMA preprocessed + > 30 samples + both normal and tumor + GPL96 or GPL570 Methodology: + apply limma - p-value < 0.05 - FC > 2+ robustness test - 100 iterations with 90% of data - resampling

+ inSilicoMerging package

+ take intersection

• Meta-Analysis:

• Merging:

• Findings:• Resampling helps to remove false

positives• Relatively low impact of batch effect

removal methods• More DEGs identified through merging

(102) than via meta-analysis (25)“Deriving separate statistics and then averaging is often less powerful than directly computing statistics from aggregated data.” (Xu et al, 2008)

no False Positives? + checked literature + initial pathway analysis

Outline

Retrievalof data

Problem Statement

inSilico DB

Problem Statement

Application

IntegrativeAnalysis

+ Contributions+ Conclusions

Contributions

•Genomic pipelines / backbone (Ch 4)•Release of 2 publicly available

R/Bioconductor packages (Ch 4 & 7)•Survey of batch effect removal methods

(Ch 7)•Two applications• Identification of stable genes via meta-

analysis (Ch 6)•Screening of potential biomarkers via

integrative analysis (Ch 8)

Conclusions

• We identified two hurdles for large-scale microarray analysis:

① Consistent retrieval of individual datasets.

② Integration of multiple data sets for integrative analysis.

Conclusions

① Consistent retrieval of individual datasets. inSilicoDb package

② Integration of multiple data sets for integrative analysis. inSilicoMerging package

Paving the road towards unlocking the potential of public available gene expression studies

Thanks!

+ InSilico Team!+ Jury!

+ Audience!

+ Yann-Michaël!

Unlocking the potential of public available gene expression data for large-scale analysis

Documents

Gene Expression - Center For Teaching & Learningcontent.njctl.org/courses/science/ap-biology/gene-expression/gene... · PSI AP Biology Gene Expression ... Frederick Griffith,

Unlocking the potential of public available gene expression data for large-scale analysis Jonatan Taminau PhD defense, November 2012

1 Gene Expression Overview. 2 Gene Expression Gene Expression The Gene Structure The Gene Structure Protein Synthesis Protein Synthesis

Regulation of Gene Expression In Prokaryotes. Regulation of Gene Expression Constituitive Gene Expression (promoters) Regulating Metabolism (promoters

Measuring Gene Expression Part 2 - Gene … Gene Expression Part 2 David Wishart Bioinformatics 301 david.wishart@ualberta.ca Measuring Gene Expression • Differential Display •

Chapter 12 Gene Expression Unlocking the secrets of DNA

Chapter 11: Gene Expression 11-1 Control of Gene Expression 11-2 Gene Expression and Development

Gene expression

Gene expression,Regulation of gene expression by dr.Tasnim

Microarrays and Gene Expression Analysis. 2 Gene Expression Data Microarray experiments Applications Data analysis Gene Expression Databases

6. The Gene Expression Omnibus (GEO): A Gene Expression

Multiple Choice Review Gene Expressioncontent.njctl.org/courses/science/ap-biology/gene-expression/gene... · PSI AP Biology Gene Expression Multiple Choice Review – Gene Expression

Gene expression Gene Regulation - Biostatistics

Horner's Class/Gene Expression/Gene... · Chapter menu Gene Expression Chapter 11 Table of Contents Section 1 Control of Gene Expression Section 2 Gene Expression in Development