Data Analysis in Metabolomics - Ecetoc •Overview of metabolomics data processing workflow •Differences between metabolomics and transcriptomics data •Approaches to improving

Data Analysis in Metabolomics

Tim EbbelsImperial College London

Themes

• Overview of metabolomics data processing workflow

• Differences between metabolomics and transcriptomics data

• Approaches to improving reproducibility & data quality

• Key challenges and bottlenecks

Metabolomics

METABOLIC PROFILE

TISSUEBIOFLUID

The study of the complement of small molecules within biological systems

CELL

Hormones

Untargeted: No prior hypothesis of specific metabolites involved

Biologicalquestion

Samplepreparation

Experi-mentaldesign

Data acquisition

Data pre-processing

Biologicalinter-

pretation

DataanalysisSampling

Raw data Data table Relevant metabolites,

connectivities, models

Metabolites

SamplesProtocol

Metabolite identification

Metabolomics workflow

Transcriptomic vs. Metabolomic DataTranscriptomics (microarrays

/ sequencing)Metabolomics

Number of genes / metabolites known?

Yes No

Identity of genes / metabolites known?

Yes (sequence/locus) No (a priori)

Coverage Whole genome Very low (few %) of metabolome

Number of platforms Single Multiple (in same experiment)

Standardisation –analytical technology

High Constantly changing

Standardisation – data analysis

Relatively high Low

Correlation between variables

Medium Very high

LC‐MS Metabolomics Data

LC-MS Metabolic Profiles

~10,000s signals,100-1000s (?) metabolites

LC-MS preprocessingPeak detection Peak matching

Retention time alignmentPeak table

Raw data

Peak integration

Peak filling

XCMS – Smith et al. Anal Chem 78, 779 (2006)

Quality Control Samples• Representative biological sample, e.g. pool of study samples

• Repeated analysis throughout analytical run

9Gika, H. G., Theodoridis, G. A., Wingate, J. E., and Wilson, I. D., J. Proteome Res. 6 (8), 3291 (2007).

Study samples

Pooled QC sample

Run order…

Quality Control and Data Filtering

Repeatability filter• E.g. Filter out all features with CV<30% in QC samples

Linearity filter• E.g. Filter out all features with correlation to dilution < 0.8

Normalisation• Correct global intensity drift

Drift correction• Correct feature specific drift within a batch

Batch correction• Correct drift across batches

10

Drift Correction

• Instrument response changes smoothly over the run

• Use QC samples to estimate changes

• Typically local regression (e.g. LOESS) with cross‐validation

• Requires frequent QC injections

Dunn, W. B. et al. Nat Protoc 6, 1060 (2011).

Filtering for Repeatability

• Remove features with low repeatability in QC samples (e.g. coefficient of variation, CV<30%)

12345678910

Lab C Positive ESI, 100% CV Threshold Lab C Positive ESI, 10% CV Threshold

-700

-600

-500

-400

-300

-200

-100

0

100

200

300

400

500

600

700

-1000 -800 -600 -400 -200 0 200 400 600 800 1000

t[2

]

t[1]

-120

-100

-80

-60

-40

-20

0

20

40

60

80

100

120

-160 -140 -120 -100 -80 -60 -40 -20 0 20 40 60 80 100 120 140 160

t[2]

t[1]COMET2 / Rob Whiffin

Filtering for Linearity• Some metabolite concentrations will be– Outside linear range of instrument, or

– Contaminants, solvent artefacts etc…

• Use a dilution series to select features which respond linearly

CV (%)

R2

Inte

nsity

Dilution factor

NMR Metabolomics Data

NMR Metabolic Profiles

~100s signals,10-100s (?) metabolites

NMR Metabolic Profiles: Problems

• Problems:– Assignment

• Knowns• Unknowns

– Peak overlap– Peak shift

?

Peak shifts• Caused primarily by pH &

ionic strength variations• Some peaks more

susceptible than others• Peaks for same molecule

generally do NOT shift – In same direction– By same amount

Restrict pH variation using buffer

Try to keep in physiological range (~7‐8)

• pH shift may be the effect you’re looking for!

Urine titration series

pH 12

pH 2

Binning• Integrate spectral intensity

in each region one variable

• Benefits: reduces problems of– Peak shift– Large number of data

points• Drawbacks

– Bins not easily assigned –can be one or several compounds

– Statistical models not easily interpreted

Raw spectrum

Binned spectrum

Full resolution spectra

• Benefits:– Reduces difficulty of assignment (still manual)

• Drawbacks: does not overcome– Overlap– Shift– Large number of data points

Full resolution + alignment

• Move peaks until positions in different spectra match

• Difficult task, usually requires manual validation

• Can produce artefacts– Misassignment– Artificial signal– Warping of peak shape and/or area

2.62.72.82.933.1

Sam

ple

num

ber

2.62.72.82.933.1

20

40

60

80

100

120

140

0

5

10

15

20

25

30

35

40

Inte

nsity

(a.

u.)

Non-aligned data RSPA corrected data

ppm ppm

Veselkov et al. Anal. Chem. 2009

Peak fitting• E.g. Chenomx NMR suite• Manual process, requiring manual validation

Succinate

Glutamine MalateGlutamate

Normalisation

• Transformation on each sample– Removing unwanted variation– Making samples more comparable

• What variation is unwanted? Examples:– Changes in detector response– Differences in urine volume/dilution

• Classically achieved by setting total signal to a constant ( x = 1)

-200000

0

200000

400000

600000

800000

1000000

1200000

1400000

1600000

1800000

9.96

9.72

9.48

9.24 98.

76

8.52

8.28

8.04 7.8

7.56

7.32

7.08

6.84 6.6

6.36

6.12 4.4

4.16

3.92

3.68

3.44 3.2

2.96

2.72

2.48

2.24 21.

76

1.52

1.28

1.04 0.8

0.56

0.32

-200

0

200

400

600

800

1000

1200

1400

1600

1 7 13 19 25 31 37 43 49 55 61 67 73 79 85 91 97 103

109

115

121

127

133

139

145

151

157

163

169

175

181

187

193

199

205

Constant sum

Raw data

Normalisation to constant sum

Normalisation

• Account for gross sample to sample changes

• Global, e.g.– Median fold change– Total intensity

• Intensity dependent, e.g.– LOESS– Quantile

Veselkov, K. A. et al. Anal. Chem. 83, 5864 (2011).

Median fold change normalise

Comparison of Normalisation Methods• Simulated data• 4 normalisations:

– Total area– Median fold change– Minimum entropy– PCA scores

• Minimum entropy– Difference (test‐ref) is

constant for dilution variables low entropy

• Other methods:– Histogram– Quantile– Robust regression

Hector Keun / Jake Pearce

Summary

• Metabolomic data share many characteristics with other omics– But fundamentally different: cannot copy data analysis pipeline

• Current bottlenecks/challenges:– Metabolite identification– Standardisation of

• Sample collection• Analytical procedure• Data analysis

Documents

Data Analysis in Metabolomics - Ecetoc •Overview of metabolomics data processing workflow •Differences between metabolomics and transcriptomics data •Approaches to improving