Upload
duonghanh
View
222
Download
0
Embed Size (px)
Citation preview
Data Analysis in Metabolomics
Tim EbbelsImperial College London
Themes
• Overview of metabolomics data processing workflow
• Differences between metabolomics and transcriptomics data
• Approaches to improving reproducibility & data quality
• Key challenges and bottlenecks
Metabolomics
METABOLIC PROFILE
TISSUEBIOFLUID
The study of the complement of small molecules within biological systems
CELL
Hormones
Untargeted: No prior hypothesis of specific metabolites involved
Biologicalquestion
Samplepreparation
Experi-mentaldesign
Data acquisition
Data pre-processing
Biologicalinter-
pretation
DataanalysisSampling
Raw data Data table Relevant metabolites,
connectivities, models
Metabolites
SamplesProtocol
Metabolite identification
Metabolomics workflow
Transcriptomic vs. Metabolomic DataTranscriptomics (microarrays
/ sequencing)Metabolomics
Number of genes / metabolites known?
Yes No
Identity of genes / metabolites known?
Yes (sequence/locus) No (a priori)
Coverage Whole genome Very low (few %) of metabolome
Number of platforms Single Multiple (in same experiment)
Standardisation –analytical technology
High Constantly changing
Standardisation – data analysis
Relatively high Low
Correlation between variables
Medium Very high
LC‐MS Metabolomics Data
LC-MS Metabolic Profiles
~10,000s signals,100-1000s (?) metabolites
LC-MS preprocessingPeak detection Peak matching
Retention time alignmentPeak table
Raw data
Peak integration
Peak filling
XCMS – Smith et al. Anal Chem 78, 779 (2006)
Quality Control Samples• Representative biological sample, e.g. pool of study samples
• Repeated analysis throughout analytical run
9Gika, H. G., Theodoridis, G. A., Wingate, J. E., and Wilson, I. D., J. Proteome Res. 6 (8), 3291 (2007).
Study samples
Pooled QC sample
Run order…
Quality Control and Data Filtering
Repeatability filter• E.g. Filter out all features with CV<30% in QC samples
Linearity filter• E.g. Filter out all features with correlation to dilution < 0.8
Normalisation• Correct global intensity drift
Drift correction• Correct feature specific drift within a batch
Batch correction• Correct drift across batches
10
Drift Correction
• Instrument response changes smoothly over the run
• Use QC samples to estimate changes
• Typically local regression (e.g. LOESS) with cross‐validation
• Requires frequent QC injections
Dunn, W. B. et al. Nat Protoc 6, 1060 (2011).
Filtering for Repeatability
• Remove features with low repeatability in QC samples (e.g. coefficient of variation, CV<30%)
12345678910
Lab C Positive ESI, 100% CV Threshold Lab C Positive ESI, 10% CV Threshold
-700
-600
-500
-400
-300
-200
-100
0
100
200
300
400
500
600
700
-1000 -800 -600 -400 -200 0 200 400 600 800 1000
t[2
]
t[1]
-120
-100
-80
-60
-40
-20
0
20
40
60
80
100
120
-160 -140 -120 -100 -80 -60 -40 -20 0 20 40 60 80 100 120 140 160
t[2]
t[1]COMET2 / Rob Whiffin
Filtering for Linearity• Some metabolite concentrations will be– Outside linear range of instrument, or
– Contaminants, solvent artefacts etc…
• Use a dilution series to select features which respond linearly
CV (%)
R2
Inte
nsity
Dilution factor
NMR Metabolomics Data
NMR Metabolic Profiles
~100s signals,10-100s (?) metabolites
NMR Metabolic Profiles: Problems
• Problems:– Assignment
• Knowns• Unknowns
– Peak overlap– Peak shift
?
Peak shifts• Caused primarily by pH &
ionic strength variations• Some peaks more
susceptible than others• Peaks for same molecule
generally do NOT shift – In same direction– By same amount
Restrict pH variation using buffer
Try to keep in physiological range (~7‐8)
• pH shift may be the effect you’re looking for!
Urine titration series
pH 12
pH 2
Binning• Integrate spectral intensity
in each region one variable
• Benefits: reduces problems of– Peak shift– Large number of data
points• Drawbacks
– Bins not easily assigned –can be one or several compounds
– Statistical models not easily interpreted
Raw spectrum
Binned spectrum
Full resolution spectra
• Benefits:– Reduces difficulty of assignment (still manual)
• Drawbacks: does not overcome– Overlap– Shift– Large number of data points
Full resolution + alignment
• Move peaks until positions in different spectra match
• Difficult task, usually requires manual validation
• Can produce artefacts– Misassignment– Artificial signal– Warping of peak shape and/or area
2.62.72.82.933.1
Sam
ple
num
ber
2.62.72.82.933.1
20
40
60
80
100
120
140
0
5
10
15
20
25
30
35
40
Inte
nsity
(a.
u.)
Non-aligned data RSPA corrected data
ppm ppm
Veselkov et al. Anal. Chem. 2009
Peak fitting• E.g. Chenomx NMR suite• Manual process, requiring manual validation
Succinate
Glutamine MalateGlutamate
Normalisation
• Transformation on each sample– Removing unwanted variation– Making samples more comparable
• What variation is unwanted? Examples:– Changes in detector response– Differences in urine volume/dilution
• Classically achieved by setting total signal to a constant ( x = 1)
-200000
0
200000
400000
600000
800000
1000000
1200000
1400000
1600000
1800000
9.96
9.72
9.48
9.24 98.
76
8.52
8.28
8.04 7.8
7.56
7.32
7.08
6.84 6.6
6.36
6.12 4.4
4.16
3.92
3.68
3.44 3.2
2.96
2.72
2.48
2.24 21.
76
1.52
1.28
1.04 0.8
0.56
0.32
-200
0
200
400
600
800
1000
1200
1400
1600
1 7 13 19 25 31 37 43 49 55 61 67 73 79 85 91 97 103
109
115
121
127
133
139
145
151
157
163
169
175
181
187
193
199
205
Constant sum
Raw data
Normalisation to constant sum
Normalisation
• Account for gross sample to sample changes
• Global, e.g.– Median fold change– Total intensity
• Intensity dependent, e.g.– LOESS– Quantile
Veselkov, K. A. et al. Anal. Chem. 83, 5864 (2011).
Median fold change normalise
Comparison of Normalisation Methods• Simulated data• 4 normalisations:
– Total area– Median fold change– Minimum entropy– PCA scores
• Minimum entropy– Difference (test‐ref) is
constant for dilution variables low entropy
• Other methods:– Histogram– Quantile– Robust regression
Hector Keun / Jake Pearce
Summary
• Metabolomic data share many characteristics with other omics– But fundamentally different: cannot copy data analysis pipeline
• Current bottlenecks/challenges:– Metabolite identification– Standardisation of
• Sample collection• Analytical procedure• Data analysis