133
UvA-DARE is a service provided by the library of the University of Amsterdam (https://dare.uva.nl) UvA-DARE (Digital Academic Repository) Fusing prior knowledge with microbial metabolomics Verouden, M.P.H. Publication date 2012 Document Version Final published version Link to publication Citation for published version (APA): Verouden, M. P. H. (2012). Fusing prior knowledge with microbial metabolomics. General rights It is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright holder(s), other than for strictly personal, individual use, unless the work is under an open content license (like Creative Commons). Disclaimer/Complaints regulations If you believe that digital publication of certain material infringes any of your rights or (privacy) interests, please let the Library know, stating your reasons. In case of a legitimate complaint, the Library will make the material inaccessible and/or remove it from the website. Please Ask the Library: https://uba.uva.nl/en/contact, or a letter to: Library of the University of Amsterdam, Secretariat, Singel 425, 1012 WP Amsterdam, The Netherlands. You will be contacted as soon as possible. Download date:22 May 2021

Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

UvA-DARE is a service provided by the library of the University of Amsterdam (https://dare.uva.nl)

UvA-DARE (Digital Academic Repository)

Fusing prior knowledge with microbial metabolomics

Verouden, M.P.H.

Publication date2012Document VersionFinal published version

Link to publication

Citation for published version (APA):Verouden, M. P. H. (2012). Fusing prior knowledge with microbial metabolomics.

General rightsIt is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s)and/or copyright holder(s), other than for strictly personal, individual use, unless the work is under an opencontent license (like Creative Commons).

Disclaimer/Complaints regulationsIf you believe that digital publication of certain material infringes any of your rights or (privacy) interests, pleaselet the Library know, stating your reasons. In case of a legitimate complaint, the Library will make the materialinaccessible and/or remove it from the website. Please Ask the Library: https://uba.uva.nl/en/contact, or a letterto: Library of the University of Amsterdam, Secretariat, Singel 425, 1012 WP Amsterdam, The Netherlands. Youwill be contacted as soon as possible.

Download date:22 May 2021

Page 2: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

Maikel P.H. Verouden

Fusing prior knowledge w

ith microbial m

etabolomics

Fusing prior knowledgewith microbial metabolomics

7,3 mm

ISBN 978-94-6182-148-5

Maikel P.H

. Verouden

Page 3: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

Fusing prior knowledge

with microbial metabolomics

Maikel P.H. Verouden

Page 4: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

Fusing prior knowledge with microbial metabolomicsMaikel P.H. VeroudenPhD thesis with summary in DutchISBN–10 94–6182–148–4ISBN–13 978–94–6182–148–5

All rights reserved. No part of this publication may be reproduced in any formwithout written permission from the copyright owner.Chapter 1, 5 and 6 Copyright © 2011-2012 Maikel Verouden.Chapter 2 and 3 Copyright © 2007–2009 John Wiley & Sons, Ltd.Chapter 4 Copyright © 2009 Elsevier B.V.

Typeset in LATEX 2ε with the memoir class using TEXLive-2011 and TEXmaker 3.3.4

Published and printed by Off Page, Amsterdam the Netherlands.

Cover design by Maikel Verouden, inspired by metabolic concentration profiles.

Page 5: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

Fusing prior knowledge

with microbial metabolomics

ACADEMISCH PROEFSCHRIFT

ter verkrijging van de graad van doctoraan de Universiteit van Amsterdamop gezag van de Rector Magnificus

prof. dr. D. C. van den Boomten overstaan van een door het college voor promoties ingestelde

commissie, in het openbaar te verdedigen in de Agnietenkapel

op donderdag 11 oktober 2012, te 10.00 uur

door

Maikel Paul Hendrik Verouden

geboren te Roermond

Page 6: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

Promotiecommissie

Promotor:

• prof. dr. A. K. Smilde

Copromotor:

• dr. J. A. Westerhuis

Overige leden:

• prof. dr. C. G. de Koster

• prof. dr. M. J. Teixeira de Mattos

• prof. dr. P. J. Punt

• prof. dr. ir. M. J. T. Reinders

• prof. dr. B. Teusink

• dr. R. H. Jellema

Faculteit der Natuurwetenschappen, Wiskunde en Informatica

The research reported in this thesis was carried out at the Swammerdam Institutefor Life Sciences, Faculty of Science, University of Amsterdam (Science Park 904,1098 XH Amsterdam, The Netherlands) and was part of the BioRange programmeof the Netherlands Bioinformatics Centre (NBIC), which is supported by a BSIKgrant through the Netherlands Genomics Initiative (NGI).

The publication of this thesis was supported financially by the Netherlands Bioin-formatics Centre.

Page 7: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

Opgedragen aan allen die

mij gevormd hebben totde persoon die ik nu ben,

in het bijzondermijn ouders en zus

Page 8: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics
Page 9: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

Contents

Contents vii

1 General introduction 1

1.1 Metabolism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.1.1 Metabolism: fluxes and concentrations. . . . . . . . . . . . . . . 21.1.2 Metabolomics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Fermentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2.1 Microbial growth during batch fermentation . . . . . . . . . . . 61.2.2 Sampling and quenching . . . . . . . . . . . . . . . . . . . . . . 7

1.3 Metabolome analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.3.1 Analysis techniques . . . . . . . . . . . . . . . . . . . . . . . . . 81.3.2 Post measurement processing . . . . . . . . . . . . . . . . . . . . 9

1.4 Analysis of metabolomics data . . . . . . . . . . . . . . . . . . . . . . . 91.4.1 Methods, data and their challenges . . . . . . . . . . . . . . . . 101.4.2 Fusing Prior knowledge . . . . . . . . . . . . . . . . . . . . . . . 12

1.5 Scope and outline of the thesis . . . . . . . . . . . . . . . . . . . . . . . 12

2 Multi-way analysis of flux distributions across multiple conditions 15

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.2 Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.2.1 Reconstruction metabolic network . . . . . . . . . . . . . . . . . 182.2.2 Constraint-based modeling to infer flux states of metabolism . 19

2.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.3.1 Blocked reactions and futile cycles . . . . . . . . . . . . . . . . . 212.3.2 Correlation matrices . . . . . . . . . . . . . . . . . . . . . . . . . 212.3.3 Mean centering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.3.4 PARAFAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.3.5 PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242.4.1 Blocked reactions and futile cycles . . . . . . . . . . . . . . . . . 242.4.2 Correlation matrices . . . . . . . . . . . . . . . . . . . . . . . . . 262.4.3 Invariant correlations . . . . . . . . . . . . . . . . . . . . . . . . 302.4.4 PARAFAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332.4.5 PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

vii

Page 10: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

Contents

2.5 Discussion and conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3 Maximum likelihood scaling (MALS) 41

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.2.1 Autoscaling (AS) . . . . . . . . . . . . . . . . . . . . . . . . . . . 443.2.2 Maximum likelihood scaling (MALS) . . . . . . . . . . . . . . . 443.2.3 Maximum likelihood principal component analysis (MLPCA) . 44

3.3 Test cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453.3.1 Artificial data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453.3.2 Homoscedastic case . . . . . . . . . . . . . . . . . . . . . . . . . 463.3.3 Heteroscedastic case . . . . . . . . . . . . . . . . . . . . . . . . . 483.3.4 Real data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483.3.5 Determination of the weight matrix W . . . . . . . . . . . . . . 483.3.6 Determination of the number of principal components . . . . . 50

3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4 Exploring the analysis of structured metabolomics data 55

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574.2 Methods and Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.2.1 Principal Component Analysis (PCA) . . . . . . . . . . . . . . . 594.2.2 ANOVA-Simultaneous Component Analysis (ASCA) . . . . . . 594.2.3 Differences between PCA and ASCA . . . . . . . . . . . . . . . 624.2.4 Simulated data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 624.2.5 Measures for the ability of modeling the induced variation

and non-induced variation and measurement error . . . . . . . 644.2.6 Escherichia coli batch fermentation metabolomics data . . . . . . 64

4.3 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 664.3.1 Simulated data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 664.3.2 PCA/ASCA on E. coli batch fermentation metabolomics data . 70

4.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5 Weighted Smooth Principal Component Analysis: validation and appli-

cation to missing value estimation 75

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 765.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 775.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

5.2.1 Weighted Smooth Principal Component Analysis . . . . . . . . 785.2.2 Cross-validation procedure . . . . . . . . . . . . . . . . . . . . . 805.2.3 Missing value estimation . . . . . . . . . . . . . . . . . . . . . . 81

5.3 Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 815.3.1 Simulated data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 815.3.2 E. coli batch fermentation metabolomics data . . . . . . . . . . . 83

viii

Page 11: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

5.4 Results and discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 845.4.1 WSPCA method . . . . . . . . . . . . . . . . . . . . . . . . . . . 845.4.2 Leave elements out cross-validation . . . . . . . . . . . . . . . . 885.4.3 E. coli metabolomics data from a batch fermentation . . . . . . 91

5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

6 Conclusion and Outlook 95

6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 966.2 Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

Bibliography 99

Summary 111

Samenvatting 115

Dankwoord 119

Page 12: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics
Page 13: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

Ch

ap

te

r

1General introduction†

†Copyright © 2012 Maikel P.H. Verouden.

1

Page 14: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

1. General introduction

1.1 Metabolism

Metabolism [1] comprises the whole set of biochemical reactions that occur in thecells of living organisms to sustain life. These reactions allow for uptake of nutri-ents from the environment and conversion into energy and building blocks neces-sary for organisms to grow, reproduce, maintain their structures, and respond totheir environments. Metabolism can be divided into two categories. Catabolismis the set of reactions that break down large molecules such as polysaccharides,lipids, nucleic acids and proteins into smaller units such as monosaccharides, fattyacids, nucleotides, and amino acids, respectively and release energy. Whereas theopposite of catabolism, which is named anabolism, comprises the set of reactionsfor construction of molecules from smaller units under requirement of energy. Sub-strates, intermediates and end-products of metabolism are typically low molecularweight organic compounds, known as metabolites. The step-by-step modificationof metabolites in living cells occurs through series of biochemical reactions (de-noted as metabolic pathways), most of which are catalyzed by proteins (enzymes)that are encoded on the genome (an organism’s hereditary information). Becauseintermediates and end-products of one biochemical reaction can be the substrateof another, metabolic pathways form extensive networks, which are referred to asmetabolic networks (see Figure 1.1). Metabolites form, so to say, the links thatconnect pathways in the metabolic network.

1.1.1 Metabolism: fluxes and concentrations.

There are many fields of research that study metabolism and use information aboutmetabolic networks and its pathways. For example in cellular systems biologythe behaviour of metabolism in the context of cell growth is studied in terms ofmetabolic fluxes [2, 3]. Another example is metabolic engineering, which is de-fined as the directed improvement of cellular properties through the modificationof specific biochemical reactions or the introduction of new ones with the use ofrecombinant DNA technology. To achieve this goal metabolic engineering does notfocus on individual enzymatic reactions but interactions of biochemical reactionsand metabolic pathways in a metabolic network with emphasis on metabolic fluxesand their control under in vivo conditions [4].

The metabolic flux can be defined as the rate at which material is processedthrough a metabolic pathway [5]. The flux is a fundamental determinant of cellphysiology and a critical parameter of a metabolic pathway. Along with intracel-lular metabolite concentrations, fluxes define the information used for capturingmetabolism and cell physiology in a certain environmental condition. Fluxes alsodetermine the degree of engagement of various enzymes in a conversion process.However, engagement does not state anything about activity, for enzymes may bepresent and active yet carry very little flux. The determined pathway flux definesthe extent to which pathway enzymes participate in a conversion process.

2

Page 15: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

1.1. Metabolism

Figure 1.1: Pathways in a metabolic network. Source: Kyoto Encyclopedia of Genes and Genomes

1.1.2 Metabolomics

Metabolomics [6], as one of the most recent members in functional genomics [7],has become an accepted and valuable tool in life sciences for studying metabolism.It deals with the identification, qualitative and quantitative (in terms of concentra-

3

Page 16: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

1. General introduction

tions) measurement of all metabolites in a sample in a specified biological state.The total set of metabolites has been termed the metabolome [8] and is a reflectionof the phenotype of the system under the studied conditions [9]. Metabolomicsstudies are set up around a biological question with the main goal to gain newinsights into biochemical processes and pathways, and to relate these to charac-teristic features at the other functional genomics levels (genomics, transcriptomicsand proteomics), the physiological and phenotypic level. An important step in thepipeline or workflow [10, 11] of a metabolomics study is to set up experiments andgenerate data. The center of attention in this thesis is microbial metabolomics, sometabolomics data that finds its origin in micro-organisms.

1.2 Fermentation

For microbial metabolomics studies fermentations form the experiments from whichsamples are taken to acquire data necessary for the study. To ensure that enoughvariation is induced and captured relevant to the biological question for which thestudy was set up, fermentations are often performed according to an experimentaldesign. Examples of fermentations according to a full factorial design are givenin [12, 13]. Fermentation is a very ambiguous term, as over time it got differentmeanings to biochemists, microbiologists and industrial microbiologists. Indus-trial microbiologists use the term fermentation to describe any process for makingproducts that are useful for humans, through the intentional use of mass culturesof micro-organisms [14], such as bacteria and fungi. Fermented products find theirapplication as food, as well as in general industry. Food products such as bread,beer, wine, cheese, curds and yoghurt come to mind and these products have beenproduced this way for thousands of years, long before mankind had any knowledgeof the micro-organisms involved. Industry focuses on products that have economicvalue, such as pharmaceutical and medical compounds (e.g. antibiotics, hormones,steroids), solvents, organic acids, chemical feedstocks, amino acids, and enzymes.There are five major groups of commercially important fermentations:(i) those thatproduce microbial cells (or biomass) as the product (e.g. yeast), (ii) those that pro-duce microbial enzymes, (iii) those that produce microbial metabolites, (iv) thosethat produce recombinant products, and (v) those that modify a compound whichis added to the fermentation (the transformation process).

Fermentations are performed in bioreactors [15], as shown in Figure 1.2, rangingin size from laboratory grade to industrial. Bioreactors can be operated in differentmodes: (i) batch, where the fermentation proceeds without addition of fresh growthmedium; (ii) fed-batch, where nutrients (in the form of growth medium) are addedincrementally at various times during the fermentation; no growth medium is re-moved until the end of the process; (iii) continuous, where fresh growth mediumis added continuously during fermentation, but there is also concomitant removalof an equal volume of spent medium containing suspended micro-organisms. Anexample of a continuous bioreactor is the chemostat [16, 17], in which the growthrate of the micro-organism can be easily controlled by changing the rate with whichmedium is added to the bioreactor. Batch operation is the most common laboratory

4

Page 17: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

1.2. Fermentation

Figure 1.2: General design of a bioreactor

Exponentialphase

Figure 1.3: Phases of bacterial growth. Source: Medical Illustrations by Michał Komorniczak (PL)

5

Page 18: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

1. General introduction

growth method in which bacterial growth is studied and also a commonly usedmode for microbial metabolomics studies, but it is only one of many.

1.2.1 Microbial growth during batch fermentation

The growth of a microbial culture [18] during a batch fermentation can be dividedinto a number of stages as is depicted in Figure 1.3. At first a particular micro-organism needs to be introduced into the selected growth medium, the medium issaid to be inoculated with the particular micro-organism. Growth of the inoculumdoes not appear to occur immediately, there is the period of adaptation, referred toas the lag phase. Following the lag phase, the rate of growth of the micro-organismsteadily increases until the cells grow at a constant, maximum rate. This periodis called the log, or exponential, phase. The rate of growth slows down after acertain time of exponential phase, due to the continuously falling concentrations ofnutrients and/or a continuously increasing (accumulating) concentrations of toxicsubstances excreted by the micro-organism into the medium. Eventually, growthceases and the microbes enter the so-called stationary phase. Within the stationaryphase the biomass remains constant, except when certain accumulated chemicalsin the culture lyse the cells (chemolysis). After a further period of time the viablecell number declines, when the culture enters the death phase.

As well as the kinetic description of growth, given in the previous paragraph,the behaviour of a culture may also be described according to the products whichit produces during the various stages of the growth curve. During the log phase ofgrowth the products produced are essential to the growth, development and repro-duction of the cells of the micro-organism and include amino acids, nucleotides,proteins, nucleic acids, lipids, carbohydrates, etc. These products are referred toas the primary metabolites [19] and many of them are of considerable economicimportance and are in industry produced by means of fermentation. During thestationary phase some microbial cultures synthesize compounds that are not es-sential for growth or survival of the producing organism, however this is not truefor all types of microbes. These compounds are referred to as the secondary com-pounds of metabolism, also called secondary metabolites [20]. The production ofthese metabolites is tightly regulated and dependent on the immediate environ-ment. Although secondary metabolites are not directly related to growth, devel-opment and reproduction these compounds do find their origin in intermediatesof primary metabolism. The group of secondary metabolites includes both sim-ple molecules such as alcohols, sugars and organic acids; and complex compoundssuch as polyketides, flavonoids, terpenes and non-ribosomal peptide compounds.They show a huge functional diversity, including functional classes such as an-tibiotics, pigments (photoprotection), hormones/pheromones, cytostatics, systemictoxins (phytotoxins, fungicides, insecticides and immunosuppressives) and manyother [21] and are like primary metabolites also of great interest for industry.

6

Page 19: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

1.3. Metabolome analysis

1.2.2 Sampling and quenching

To obtain microbial metabolomics data samples have to be taken during a fermen-tation from both the micro-organisms, as well as the supernatant (the medium inwhich the microbes are suspended). These samples have to be taken in such a waythat the micro-organisms do not get damaged, for this would cause the intracellu-lar metabolites to leak into the supernatant. For some studies samples are taken atdifferent points in time during the fermentation, which leads to data with a time-resolved (dynamic) aspect. Another important issue is that after taking a samplethe metabolism inside the micro-organism is stopped instantaneously (quenchingof the sample) in order to give a good snapshot of the metabolic state at the timeof sampling. Also during quenching leakage of intracellular metabolites into thequenching solution should be prevented, for some quenching methods leakage hasbeen observed although it rarely has been quantified [22].

1.3 Metabolome analysis

Once fermentation samples have been processed (extraction, addition of standards,etc.) to a state ready for measurement of the metabolome a metabolite analysisstrategy has to be chosen. Different strategies [9,23–25] exist depending on coverageand quantitation (compound identity, sensitivity, accuracy):

1. Metabolite target analysis: Quantitative analysis of one or a few metabolitesof interest, ignoring all the non-target peaks present in the sample.

2. Metabolite/metabolic profiling: Analysis for identification and approximatequantification of a group of metabolites related by similar physical and chem-ical properties or associated to specific metabolic pathways.

3. Metabolic fingerprinting: Global, rapid and high-throughput analysis ofsamples without identification and quantification of metabolites for purposeof classification or screening.

Going from targeted analysis to fingerprinting the data quality decreases, whereasthe number of metabolites considered greatly increases [26]. Metabolomics, as de-fined in section 1.1.2, can be placed as analysis strategy being one step furtherthan metabolic profiling [9, 27]: instead of aiming to obtain an inventory and ap-proximate quantification of a group of metabolites present in a sample, it aims atidentifying and quantifying the full metabolome. Not one analytical technique, orcombination of techniques, can currently determine the full metabolome in samplesfor the simple reason that the metabolome comprises too many different compoundclasses for one technique to handle and not all metabolites are known to date. Be-cause of this complexity of the metabolome it has been suggested that the termmetabolomics should not be used as an analytical subsection but should rather betaken as a scientific keyword [28, 29]. Metabolomics, therefore usually consist of acombination of aforementioned analysis strategies. Identification of metabolites issometimes performed after data analysis [11], which can pinpoint the most relevantcompounds (biomarkers).

7

Page 20: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

1. General introduction

Apart from analysis strategies focussing on intra-cellular metabolites, there isalso one called metabolic footprinting [30]. It is the high-throughput classificationby global measurement of metabolites secreted from the intra-cellular volume intothe extra-cellular spent growth medium, referred to as the exo-metabolome [31].Metabolites in the intra-cellular volume are in this context referred to as the endo-metabolome. Metabolic footprinting represents a niche within metabolomics, be-cause of its focus on the analysis of the exo-metabolome. Although metabolic foot-printing represents only a fraction of the entire metabolome, it provides importantinformation for functional genomics and strain characterization [32]. Informationabout the exo-metabolome could provide a key to understanding cell communi-cation mechanisms and aid metabolic engineering of industrial biotechnologicalprocesses. Due to the strong and intertwined relationship between intracellularmetabolism and metabolic footprinting, metabolic footprinting can provide pre-cious information about the intracellular metabolic status and assist in further in-terpretation of metabolic networks and metabolic fluxes.

1.3.1 Analysis techniques

As mentioned in the previous section the metabolome consist of many differentclasses of compounds and multiple analysis strategies are employed with the ulti-mate goal of analysing a large fraction or all of the metabolites. Therefore, a rangeof analytical techniques and not a single one needs to be employed to analyse themetabolome [33, 34].

Two analytical techniques commonly associated with metabolome analyses us-ing the fingerprinting strategy are Nuclear Magnetic Resonance (NMR) [35–37]and Mass Spectrometry (MS) [28, 38, 39]. Both are capable of handling a widerange of metabolites in a single measurement without pre-selection of specific ana-lytes. These technologies allow both the identification and quantification of metabo-lites, meaning they are also applicable with other strategies. The main advantageof NMR is that it does not require physical or chemical treatment and is non-destructive, therefore, allowing samples to subsequently be analysed with anothertechnique. On the other hand MS is more sensitive than NMR.

Many variations of MS exist, but the more traditional approach is coupling MSas detection method to chromatographic techniques, e.g. gas chromatography (GC)and liquid chromatography (LC). GC-MS [40] provides a very sensitive technique,however is limited to small compounds that are thermally stable, volatile, or canbe made chemically volatile by means of derivatization. Contrary to GC-MS, LC-MS [41] allows for the separation and characterization of the majority of metabolitesincluding different groups of compounds, hydrophilic as well as hydrophobic, salts,acids, bases, etc. The separation of each group in LC-MS is dictated by the proper-ties of the metabolite, which determines which column type (stationary phase) andmobile phase to use for successful separation [42]. This makes LC-MS a powerfultool for metabolomic studies. For microbial metabolomics GC-MS and LC-MS areanalysis techniques that are commonly used for identification and quantification ofintra-cellular metabolites.

8

Page 21: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

1.4. Analysis of metabolomics data

1.3.2 Post measurement processing

Once samples have been measured the raw instrumental data needs to be pro-cessed into extracted data (peak tables consisting of identified metabolites and theirconcentration in the measured samples). Raw data from chromatographic instru-ments hyphenated with MS-detection like GC-MS and LC-MS, as mentioned in theprevious section, is two-dimensional in nature consisting of intensities for chro-matographic retention times (rt) and mass to charge ratios (m

z ). It becomes three-dimensional when multiple samples have been measured, which adds a mode tothe collected raw data. Processing this type of data should take into account and/orcorrect:

• baseline drift with respect to rt.

• peak alignment with respect to rt and mz in order to line up peaks so that

chromatographic shifts are reduced to a minimum and variations caused bycolumn ageing and column cuts would be eliminated.

• detector response with respect the measured intensities of known control sam-ples.

• noise and artefacts by the measurement instrument by setting thresholds.

When analysing complex mixtures like metabolomes, elution of two or more metabo-lites with the same rt is encountered commonly and peaks will overlap [43,44]. Thisseverely complicates detection and quantification. Deconvolution [45] is the mostpromising strategy to extract pure metabolite signals from the data. It is based onthe idea that a raw signal is a superposition of multiple single-metabolite responses.The underlying MS spectra and chromatograms can be obtained by imposing amodel of single-metabolite contributions on the raw signal. The derived pure chro-matograms can subsequently be used to calculate relative concentrations and thepure MS spectra for metabolite identification by matching the spectra against refer-ence libraries. Reference libraries exist for GC-MS [46] but are far from covering allmetabolites. Reference libraries for LC-MS, however, pose a problem because of thelarge diversity of compounds measured, large variations in fragmentation patternsfor different types of instruments and low between instrument reproducibility. Themost common strategy for quantification in LC-MS for that reason is peak-picking.Identification and quantification is labour intensive process, that generally takesmuch longer than the measurement the samples. For absolute quantification ofmetabolite concentrations internal standards are required. Often measurements re-main intensities due to the lack of proper standards and the requirement for toomany standards.

1.4 Analysis of metabolomics data

Data analysis is of crucial importance to metabolomics studies after post measure-ment processing, since it is the way to extract valuable information from the pro-cessed data (peak tables) that can answer the biological question for which the

9

Page 22: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

1. General introduction

study was set up. For microbial metabolomics typical research questions are amongothers: (1) finding bottlenecks in the production of desired products for strain op-timization [13], (2) optimization of the growth medium to improve product yield,(3) identification of product onset, (4) identification of differences between wildtype and overproducing strains, (5) characterization of mutant strains [47], (6) find-ing metabolites that are related to desired product formation in order to unveil themetabolic pathway involved, (7) predicting the effect of quality differences of differ-ent batches of complex media on productivity, and (8) identification of metabolitedependent regulatory interactions.

1.4.1 Methods, data and their challenges

The methods used for analysis of microbial metabolomics data are similar or equalto methods from chemometrics [48, 49], which deals with the analysis of data ob-tained from experiments in chemistry. Those methods can roughly be divided intotwo catergories: (i) methods directed at identifying differences between samplesfrom different groups with a clear goal of prediction for new samples, these are so-called supervised methods and contain regression as well as classification methods(ii) methods directed at detecting patterns by data exploration without a predictiongoal, which are so-called unsupervised methods.

Methods from the first category that are commonly applied to microbial metabo-lomics data are Partial Least Squares (PLS) [50], Partial Least Squares DiscriminantAnalysis (PLS-DA) [51]. A feature which is shared by these methods is that they useone or several biological properties of the data set, e.g. yield or information aboutthe specific biological group the data belongs to like wild-type vs. mutant strain.Various modifications of PLS and PLS-DA exist, like the multiway generalizationnPLS(-DA) [52, 53] and a recent modification called Orthogonal-PLS(-DA) [54, 55].In OPLS(-DA) only the part of the processed data that is linearly related to theregression/classification problem is considered, with the intention to simplify theinterpretation.

Exploratory methods for discovering patterns in highly multivariate data (sec-ond category as mentioned above) are often based on finding a low-dimensionalrepresentation of the data. Principal Component Analysis (PCA) [56] is such amethod and can easily be called the workhorse of metabolomics data analysis, sinceit is the most applied method for exploring metabolomics data in general.

Each available method, whether it falls into the first or second category, hassome underlying assumptions with respect to the data to be analysed. The mostcommon assumptions of currently used methods are that the explored relationshipsbetween variables (metabolites) are linear in character and the model componentsneeded for describing these relationships are independent of each other (orthogo-nal to each other). Another very important assumption is that the measurement ofeach metabolite is independent and identically-distributed (i.i.d). These assump-tions with respect to microbial metabolomics data are hardly ever satisfied, whichcomplicates data analysis and subsequent interpretation of analysis results. On theother hand microbial metabolomics data itself present multiple challenges whichcomplicate data analysis and interpretation of analysis results:

10

Page 23: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

1.4. Analysis of metabolomics data

• it is high dimensional, more metabolites than samples are measured (so-calledcurse of dimensionality [57]) and metabolites are highly correlated, which caneasily lead to overfitting.

• high noise levels, which can overwhelm biological phenomena present in thedata.

• complex structure, where the data consist of contributions from several ex-perimental design factors that are not separated during data analyis and canseriously hamper the interpretation.

• longitudinal character and the problem of synchronisation of data stemmingfrom different fermentation batches.

• missing data due to instrument failure, non detects in the measurement ormetabolite concentrations below the detection limit.

Also the fact that processed data often needs to be pretreated prior to data analysisprovides a challenge. One pretreatment can enhance the results of a data analyticalmethod and aid the interpretation of the analysis results, whereas it might obscurethe results of another analysis method. The choice of pretreatment is, therefore,often restricted by the chosen data analysis method, which in turn depends on thebiological question to be answered and the structure of the data to be analysed.Three classes of applied data pretreatment methods [58] are:(i) centering (ii) scalingand (iii) transformations. The most applied form of centering is mean centering,which converts all metabolite concentrations to fluctuations around zero instead ofaround the mean of the metabolite concentrations with the purpose of focussing ondifferences between samples. The effect of mean-centering with respect of metabo-lite concentrations is, however, that after performing it the obtained values lose theirdirect link to concentrations since now negative values also occur. The value 0 alsogets a new meaning and no longer means not present or not important. Scaling ofprocessed metabolomics data is performed by dividing each metabolite by a factor,the scaling factor, which is different for each metabolite. Various forms of scalingexist and have been described in detail with respect to metabolomics data in [58].The purpose of scaling usually is to make metabolites comparable to their biologi-cal response or equalize their relative importance. Scaling of processed data priorto analysis should be done with care for it could result in noise amplification andcause an increase in heteroscedasticity of the data. Transformations are nonlinearconversions of processed data, e.g. the log transformation and the power trans-formation. They are generally applied to correct for heteroscedasticity [59], con-vert multiplicative into additive relations and adjust skewed distributions to makethem (more) symmetric. As mentioned before currently applied methods assumerelations between metabolites to be linear, however relations between metabolitesin metabolomics data can also be multiplicative in nature. A transformation ofthe processed data and subsequent analysis with a method that assumes a linearrelation between metabolites could then be used to identify these multiplicativerelations between metabolites.

11

Page 24: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

1. General introduction

Which method to use from the large variety of available methods is not alwaysclear, but the choice has to be such that it fits the research question and the structureof the data to be analysed. Choosing for example PCA when the goal is to identifymetabolites that strongly relate to the yield of fermentations would not be logical,since this is typically a supervised regression problem. With respect to the structureof the data an example would be to choose a multi-way analysis method whenthe data consists of metabolite concentrations from samples taken over time fromfermentations performed with different carbon sources.

1.4.2 Fusing Prior knowledge

As mentioned in section 1.4.1 the analysis of microbial metabolomics data and fol-lowing interpretation suffers from inconsistencies regarding underlying assump-tions of the commonly applied methods and challenges posed by the data itself.There are, therefore, two ways to deal with these problems: (1) adjust the data so itmatches the assumptions of underlying the methods , and (2) adjust the methods sothey can deal with the challenges the data present. Both adjustments can be guidedby prior knowledge about the metabolomics study:

1. information available from previous experiments and from literature aboutthe underlying (micro)biology with respect to the biological question for whichthe study was set up.

2. how the study was set up and the way experimental data was generated, e.g.the experimental design.

3. how the metabolomics data was collected and measured, e.g. where samplestaken over time from a fermentation (longitudinal data) and/or is somethingknown about the sampling/analytical measurement errors.

Development of methodology guided by prior knowledge to facilitate data anal-ysis and interpretation has only recently been engaged and for that reason not manymethods are available in the field of metabolomics. ANOVA-Simultaneous Compo-nent Analysis (ASCA) [60, 61] is an example in which the underlying expermentaldesign is used to partition contributions of different factors within the highly mul-tivariate metabolomics data set. Another recent example that is more suited forincorporating (micro)biological prior knowledge is called Grey Component Analy-sis [62]. This exploratory method is based on PCA, but uses a soft penalty on thescores. The penalty allows the user (e.g. a microbiologist) to direct the scores tosome predefined values, that represent the hard prior knowledge. The amount ofemphasis on the penalty defines the confidence the user has in the prior knowl-edge being present, but also allows for a weighted average between data and priorknowledge.

1.5 Scope and outline of the thesis

This thesis, entitled “Fusing prior knowledge with microbial metabolomics”, dealswith combining prior knowledge in microbial metabolomics. It not only focuses on

12

Page 25: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

1.5. Scope and outline of the thesis

Figure 1.4: Relation between chapters in the thesis and the various aspects of a microbial metabolomicsstudy visualised in a metabolomics pipeline.

fusing prior knowledge with the data analytical aspect of a microbial metabolomicsstudy, as described in section 1.4.2, but also on the use of prior knowledge withinvarious aspects of such a study. Figure 1.4 schematically displays how the chaptersin this thesis relate to the aspects of a microbial metabolomics study visualised in ametabolomics pipeline.

In a microbial metabolomics study the data for studying metabolism is obtainedby sampling from fermentations followed by metabolome analysis of these sam-ples. Data for studying metabolism of a micro-organism can, however, also beobtained in silico. In the construction of a genome-scale network of micro-organismprior knowledge is used about the genome sequence and gene-protein-reaction as-sociations with respect to the organism obtained from databases, textbooks andother scientific publications. Within the framework of constraint-based model-ing genome-scale networks of micro-organisms can be used to infer flux states ofmetabolism under different environmental conditions by means of in silico simu-lations. In Chapter 2 the genome-scale network of a lactic acid bacterium, namedLactococcus lactis MG1363, is used to generate flux distributions for multiple in sil-

ico environmental conditions, mimicking laboratory growth conditions. These fluxdistributions serve as raw data to investigate and pinpoint differences and similari-ties in metabolism between the various environmental conditions. Hence Chapter 2exemplifies how prior knowledge can be used in the creation of data for studyingmetabolism.

As mentioned in section 1.4 scaling is often performed prior to data analysis,but has to be done with care for it might result in noise amplification and increasethe heteroscedasticity of the data. Additionally methods commonly used for dataanalysis assume that each measured variable (metabolite) in the data set has thesame probability distribution as the others and all are mutually independent (i.i.d.),however this is hardly ever the case for metabolomics data. Adjustment of the dataguided by prior knowledge could deal with this problem and allow commonly usedmethods to be applied and possibly improve interpretation of the analysed data.Prior knowledge about the noise characteristics of the data is used in Chapter 3

13

Page 26: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

1. General introduction

to adjust the processed raw measurement data. A procedure is introduced formultivariate data, that does not suffer from noise amplification by scaling. Theprocedure is used as a filter and can be used prior to any subsequent scaling andmultivariate analysis of the data with commonly applied data analysis methods.

A large number of metabolites are measured in a microbial metabolomics studythat reflect the cellular state under the experimental conditions studied. In manyoccasions the experiments are performed according to an experimental design tomake sure that sufficient variation is induced in the data. However, as metabolomicsis a holistic approach, also a large number of metabolites are measured in which novariation is induced by the experimental design. The presence of such non-inducedmetabolites hampers traditional data analysis methods as PCA to estimate the truemodel of the induced variation. The greediness of PCA leads to a clear overfit ofthe metabolomics data and can lead to a bad selection of important metabolites.Chapter 4 explores how, why and how severe PCA overfits data with an under-lying experimental design. The results are compared to analysis of the same datawith ASCA, a method that uses the prior knowledge of the experimental designas mentioned in section 1.4.2, to show the improvement of model estimation andreduction of overfit.

Longitudinal data plays an important role in the various fields of functional ge-nomics to improve understanding and knowledge of the dynamics within biologicalsystems. The time-resolved data of microbial metabolomics studies are expected tocontain underlying dynamic profiles that are smooth. However, estimating theseunderlying smooth dynamic phenomena from such data is complicated due to thehigh complexity of the data and the limited number of techniques that can deal withthis type of data. Traditional multivariate data analytical techniques, such as Prin-cipal Component Analysis, ignore the underlying dynamics in the data and givesolutions that tend towards explaining variance rather than explaining dynamicsand understanding biology. A new method is presented in Chapter 5 that uses theprior knowledge of expected smoothness in the underlying dynamics of the databy incorporating it into the data analysis. The method has been baptised with thename Weighted Smooth Principal Component Analysis (WSPCA) and incorporatessmoothness into the scores of PCA by using a roughness penalty. For determinationof the model meta parameters a cross-validation procedure is introduced based onpredictions of elements randomly left out of the data. The method is applied tosimulated noisy data containing true underlying smooth dynamic profiles to showthe capability of capturing these profiles. Since the cross-validation is based onpredictions of left out elements the method has also been applied to estimate trulymissing data in a microbial metabolomics data set.

Chapter 6, the final one, contains a conclusion about the work described andan outlook for future research opportunities with respect to the use of biologicalknowledge in a microbial metabolomics study.

The chapters in this thesis represent a collection of articles, which are eitherpublished in or submitted to several scientific journals. As outlined in this sectionthe chapters are linked into a framework but can and may be read independentlyof each other.

14

Page 27: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

Ch

ap

te

r

2Multi-way analysis of flux distributions across

multiple conditions†

†This chapter is published as: Maikel P.H. Verouden, Richard A. Notebaart, Johan A. Westerhuis,Mariët J. van der Werf, Bas Teusink and Age K. Smilde J. Chemom. 2009;23(7–8):406–420. Copyright ©2009 John Wiley & Sons, Ltd. DOI: 10.1002/cem.1238

15

Page 28: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

2. Multi-way analysis of flux distributions

Abstract

With the availability of genome sequences of many organisms and informationabout gene-protein-reaction associations with respect to these organisms genome-scale metabolic networks can be reconstructed. In cellular systems biology thesenetworks are used to model the behavior of metabolism in context of cell growth interms of fluxes (reaction rates) through reactions in the network. Because the fluxthrough each reaction can generally vary within a range, many flux distributionsof the entire network are possible. However, since reactions are connected by com-mon metabolites, reactions that are functionally coherent, are expected to highlycorrelate in terms of their flux value over different flux distributions.

In this paper the genome-scale network of a lactic acid bacterium, named Lac-tococcus lactis MG1363, is used to generate flux distributions for multiple in silico

environmental conditions, mimicking laboratory growth conditions. The flux dis-tributions per condition are used to calculate a correlation matrix for each condition.Subsequently the correlations between the reactions are analyzed in a multivariateapproach across the in silico environmental conditions in order to identify correla-tions that are invariant (i.e. independent of the environment) and correlations thatare variant across conditions (i.e. dependent of the environment). The applied mul-tivariate methods are parallel factor analysis (PARAFAC) and principal componentanalysis (PCA). The discussion of the results of both methods leads to the questionwhether latent variable models are suitable analyzing this type of data.

KEYWORDS: PARAFAC, PCA, correlation, flux distributions, genome-scale net-work.

16

Page 29: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

2.1. Introduction

2.1 Introduction

The field of molecular cell biology experienced rapid progress from the momentthat methods became available to sequence the genomes of organisms. The genomerepresents all the heritage material of organisms. Parts of the genome are referredto as genes and code for proteins which carry out all kinds of functions within thecell. A very important functionality of the cell is metabolism. This process allowsfor the uptake of nutrients from the environment and converts them into energyand building blocks of which the cell is built up from. The latter is essential forreproduction (growth). Conversion of chemical components (like nutrients) intointermediates or endproducts is a metabolic reaction and is carried out by proteins,called enzymes, which are encoded on the genome. The whole network of allpossible metabolic reactions in a cell is called the metabolic network. Since for manyorganisms, including bacteria and even human, the genome sequence is available, itis possible to reconstruct genome-scale metabolic networks [63,64]. Many enzymesthat are encoded on the genome can be deduced if the genome is annotated, i.e.genes and their function are identified. Subsequently, reactions can be assignedto enzymes (and thus genes) by exploring specific enzyme-reaction databases. Thecollection of gene-protein-reaction (GPR) associations form a network of interactingreactions through the use of common low weight chemical compounds, also knownas metabolites. For example, a certain reaction produces a certain metabolite andanother reaction converts it up to the final production of biomass components (i.e.necessary components of which the cell is built up from). A reconstructed genome-scale metabolic network contains hundreds of GPRs and can be obtained usingautomated procedures [65–67].

It is of interest in cellular systems biology to model the behavior of metabolismin the context of cell growth at genome-scale [68–71]. This behavior can be ex-pressed in terms of fluxes through reactions, which is in fact a rate of metaboliteconversion, as response to nutrient uptake. Fluxes can be determined by laboratoryexperiments or by in silico simulations. Due to current experimental limitationsin determining all fluxes at genome-scale, the flux through each reaction is saidto vary within a range. The latter means that many flux distributions of the en-tire network are possible. Since reactions are connected on the basis of commonmetabolites it is expected to find correlations between reactions in terms of theirflux value over different flux distributions. It is thus possible to infer functionallycoherent reactions on the basis of the entire network behavior [72, 73].

The question now arises why we are interested in correlated reactions? It leadsto the definition of reaction modules which can be useful for in-depth investigationof the reactions without considering the entire network, e.g. by adding a kineticmodel to the reactions in such a module to provide more mechanistic detail. More-over, it could lead to insights into the regulation of enzymes that catalyze reac-tions [74]. Recently a number of methods have been developed to infer correlationsbetween reactions within specific environments [72, 73, 75, 76]. Usually, the flux ofeach reaction in the network across many possible flux distributions is examined tocalculate the correlation between reactions. However, for the analysis of metabolic

17

Page 30: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

2. Multi-way analysis of flux distributions

systems it may not only be interesting to study reaction correlations in individualenvironments, but also across different environmental conditions. This could leadto the definition of correlating reactions that appear to be dependent or indepen-dent of the environment. In other words, which reaction correlations are robustagainst changing environment and which are not? Reactions that correlate withinand between conditions can then be considered as strong functionally associated.Since for each environmental condition the correlations between the reactions inthe metabolic network from many possible flux distributions can be calculated, thequestion arises how to identify reaction correlations that are invariant, i.e. do notchange, and which reaction correlations are variant across environmental condi-tions.

In this paper flux distributions of several different in silico environmental condi-tions mimicking laboratory growth conditions are used to calculate correlations be-tween reactions in a genome-scale metabolic network of a bacterium. Subsequentlythe correlations between the reactions across these in silico environmental condi-tions are analyzed in a multivariate approach with the goal to identify correlationsthat are invariant and reaction correlations that are variant across conditions. Theapplied multivariate data analysis methods are Parallel Factor Analysis (PARAFAC)and Principal Component Analysis (PCA).

The materials section introduces the genome-scale metabolic network and ex-plains how the flux distributions in the different environments have been obtained.The next section ‘Methods’ reveals how the flux distributions are preprocessed, howthe correlations are calculated and elaborates on PARAFAC and PCA in relation tothe analysis of correlations across multi-environmental conditions. In section 2.4the results of analyzing changes in correlations between reactions across the envi-ronmental conditions are given and commented on. Finally in the last section theapplicability of PARAFAC and PCA for analyzing these changes will be discussedand some important findings are concluded.

2.2 Materials

2.2.1 Reconstruction metabolic network

In this study we focus on the calculation of reaction correlations within and be-tween environmental conditions using an in silico metabolic network of a lacticacid bacterium called Lactococcus lactis MG1363. The metabolic network has beenreconstructed on the basis of the sequenced genome [77] using a semi-automaticapproach [65] and manual adjustments. Genome-scale metabolic networks con-tain information about which genes and gene products (i.e. protein) catalyzeswhich metabolic reactions. Using this semi-automatic approach [65] we determinedequivalent genes between L. lactis and organisms such as Lactococcus plantarum andEscherichia coli, for which a manually curated genome-scale metabolic network hasbeen published. Thereby, we inferred metabolic reactions for L. lactis which togetherform a network of interacting chemical compounds leading to the production ofbiomass components (i.e. components of which the bacterium is built up from).

18

Page 31: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

2.2. Materials

The network consists of 616 metabolites (M) involved in 598 metabolic reactions, ofwhich 87% has been associated to proteins/genes. The latter means that 13% of themetabolic reactions has been added to allow the bacterium to grow in the in silicosimulation. The model includes 513 genes and 445 proteins (including complexes).Moreover, 91 exchange reactions and 1 biomass reaction have been included, lead-ing to a total of 690 reactions (N). Exchange reactions define the link between themetabolic network and the environment, thereby allowing for uptake and secretionof nutrients from and to the environment. The biomass reaction serves as a sinkfor metabolites that are precursors for growth (i.e. metabolites that are used togenerate biomass).

2.2.2 Constraint-based modeling to infer flux states of metabolism

We have applied constraint-based modeling to explore flux distributions of me-tabolism of L. lactis MG1363 by in silico simulation in different environments [70].The different environments are theoretical scenarios that mimic possible real lifelaboratory environmental conditions. For constraint-based modeling it is essen-tial to structure the in silico metabolic network into the so-called the stoichiometricmatrix S [MxN]. Each row of the matrix represents a metabolite, each column a re-action and each element the stoichiometry coefficient of the metabolite in that reac-tion. The stoichiometry coefficient is negative when the metabolite is consumed ina particular reaction and positive when produced. After structuring the metabolicnetwork in the stoichiometric matrix S, dynamic mass balances around metabolitesare defined in terms of fluxes (metabolite conversion rates) through each reactionand the stoichiometry of those reactions around the metabolites [70, 78, 79] in theform of

dc

dt= Sv (2.1)

where v [Nx1] denotes a vector of fluxes through all reactions in the network andc [Mx1] is a vector representing all metabolites in the network. At steady state thereis no accumulation or depletion of metabolites in a metabolic network, therefore,the rate of production of each metabolite in the network must equal its rate ofconsumption, i.e. the change in the amount of any metabolite within the networkover time becomes zero for all metabolites in all reactions ( dcm

dt = 0 where cm is the

mth metabolite in the vector c). In mathematical terms this is written as,

Sv = 0 (2.2)

Equation 2.2 limits the solution space of the allowable flux distributions to thenullspace of matrix S and eliminates the time derivatives in equation 2.1. Thesteady-state assumption is relevant for intracellular reactions since these reactionsare typically much faster than the rate of change in the resultant phenotype such ascell growth (biomass production) [68]. Besides the stoichiometric constraint there isanother important constraint, called capacity constraint, to restrict the flux throughreactions. This constraint defines the range of flux values that can be taken by each

19

Page 32: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

2. Multi-way analysis of flux distributions

reaction in the in silico network and is written as,

vi, min ≤ vi ≤ vi, max (2.3)

Setting capacity constraints is especially important for the exchange reactions, be-cause this is the way to vary the influx of nutrients from the environment in betweencertain minimal and maximal rates. It allows for in silico simulation of metabolismin the context of the environment, i.e. simulating theoretical scenarios reflectingpossible real life laboratory environmental conditions.

Imposing these two constraints results in a space of allowable flux distributionsof the network [80]. In practice the number of possible flux distributions is toolarge for direct interpretation and therefore approaches have been developed to se-lect flux distributions. One of these methods is based on random sampling of points(i.e. flux distributions) from the solution space of allowable flux distributions [72].In order to calculate correlations between reactions in the metabolic network fromthe space of allowable flux distributions we apply random sampling to sample 2000flux distributions using the COBRA toolbox [75] with default settings. We have per-formed the approach for several environments to study not only correlations withinsingle environments, but also across environments. In total six different environ-mental conditions (theoretical scenarios reflecting real life laboratory conditions)are examined, including

• anaerobic growth (without the presence of oxygen), scenario in which the fluxcapacity constraint for the exchange reaction of oxygen has been set to zero.

• aerobic growth (in the presence of oxygen), scenario in which the flux capac-ity constraint for the exchange reaction of oxygen has been set in between aminimum and a maximum value (unequal to zero).

• aerobic respiratory growth (in the presence of oxygen under addition of haem,which aids the transport of oxygen into L. lactis), scenario in which the fluxcapacity constraint for the exchange reaction of oxygen and the flux capacityconstraints for haem-dependent reactions have been set in between a mini-mum and a maximum value (unequal to zero).

in rich medium, containing glucose and all amino acids, and minimal medium,containing glucose and seven amino acids that are minimally required for growthof L. lactis. In the rich and minimal medium the flux capacity constraints for allexchange reactions of amino acids present in the medium have been set betweenboundaries (minimal and maximum flux). For the minimal medium the flux ca-pacity constraints for all exchange reactions of amino acids that are not present inthe medium have been set to zero. Table 2.1 displays the order and design of theenvironmental conditions as used throughout this paper.

The sampled flux distributions of fluxes through the reactions in the L. lactis

metabolic network for each environmental condition have been stored in a matrixX*

k [IxJ*], with k representing the environmental condition number as given in col-umn 1 of Table 2.1, I the number of sampled flux distributions (2000) and J* thenumber of reactions in the reconstructed metabolic network (690).

20

Page 33: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

2.3. Methods

Table 2.1: Order, design and number of blocked reactions of the environmental conditions.

Environmental Medium Aeration # Blocked

condition Rich Minimal Anaerob Aerob Aerob resp.

1 1 0 1 0 0 102 1 0 0 1 0 33 1 0 0 0 1 04 0 1 1 0 0 365 0 1 0 1 0 296 0 1 0 0 1 26

2.3 Methods

2.3.1 Blocked reactions and futile cycles

Before correlation matrices can be calculated from the flux distributions it is nec-essary to remove reactions that have zero flux in all environmental conditions forevery flux distribution, and reactions that form futile cycles. Reactions with zeroflux for every flux distribution in all conditions are blocked [73], i.e. inactive, forall environmental conditions. Futile cycles [81] are created when two metabolic re-actions run simultaneously in opposite directions or a series of metabolic reactionsform a circular path. These cycles have no overall effect other than wasting energy,but may have a role in metabolic regulation [82]. Here, however, these cycles arean artifact of the constrained based modeling, because only mass balance and fluxcapacity constraints are active, and need to be removed. Assignment of additionalconstraints, e.g. thermodynamic or energetic [70, 83–85] can help to prevent futilecycles from appearing in a metabolic network.

The flux distribution matrices for the different environmental conditions fromwhich the overall blocked reactions and reactions involved in futile cycles havebeen removed, denoted by Xk [IxJ], may still contain reactions with zero fluxes forall flux distributions. These reactions, however, are blocked only within specificenvironmental conditions and are not removed from the flux distribution matrices(Xk) in order to keep the reaction dimension (J) of all environmental conditions thesame to allow for comparison between and over conditions.

2.3.2 Correlation matrices

The flux distribution matrices (Xk) are used to calculate a Pearson correlation ma-trix, between the fluxes carried by pairs of reactions for all sampled flux distribu-tions, for each environmental condition, as denoted by Φk [JxJ]. Pearson correlationis used, because the sampled fluxes of the reactions follow normal distributions. Asmentioned in section 2.3.1 the flux distribution matrix of a specific condition may

21

Page 34: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

2. Multi-way analysis of flux distributions

still contain reactions that are blocked for that specific condition and contain zeroflux for all sampled flux distributions of that reaction. Obviously no correlation canbe calculated between a reaction containing zero flux for all flux distributions andany other reaction. In order to create a valid correlation matrix a correlation of zerohas been imputed for the correlations between a reaction containing zero flux forall flux distributions and a reaction containing fluxes. A correlation of one has beenimputed for correlations between reactions that both contain zero flux for all fluxdistributions.

The Pearson correlation coefficient for a pair of reactions (s and t) is denotedby ϕst and by definition falls within in the range −1 ≤ ϕst ≤ 1. Two special caseswithin this range exist [76]:

ϕst =±1: The fluxes carried by the pair of reactions, s and t, are in a fixed ratio forall sampled flux distributions. The reactions, therefore, behave the same andbelong to same reaction subset [86].

ϕst = 0: The fluxes carried by the pair of reactions, s and t, are not related to eachother for all the sampled flux distributions. The reactions belong to differentreaction subsets.

Correlations between zero and one or minus one can be found for fluxes aroundbranching points. A branching point in a metabolic network is a metabolite thathas one flux through an incoming reaction and two fluxes through two outgoingreactions. When the inbound flux of a metabolite can have any value within its fluxcapacity constraint and the outbound fluxes are not in a fixed ratio, the correlationbetween the incoming reaction and an outgoing reaction will take a value in therange given above for ϕst.

The sign of the correlation ϕst between a pair of reactions, s and t depends onhow the stoichiometry of both reactions is initially specified in the stoichiometrymatrix S. If for example a reaction system containing metabolites A, B and C withreactions s and t among those metabolites is specified as

s : A → Bt : B → C

with the arrow indicating which metabolite is initially specified as reactant andproduct. The flux carried by reaction s will be exactly the same as the flux carriedby reaction t. The reactions behave the same, which means they belong to the samesubset and the correlation will, therefore, be ϕst = 1. However, if the system wasinitially specified as

s : A → B

t : C → B

The fluxes carried by reaction s and reaction t would have the same values butalways have opposite signs. The reactions would still be in the same subset, buttheir correlation would be ϕst = −1. Because of this sign indeterminacy of thecorrelation, absolute correlations will be used throughout this paper. However,the positive semidefinite property of the correlation matrices is removed by using

22

Page 35: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

2.3. Methods

absolute correlations. The resulting absolute correlation matrices are still symmetricand there still exists dependency between the elements in the absolute correlationmatrices, meaning that changing an absolute correlation between a pair of reactionsalso changes the correlations with reactions that highly correlate with this pair.

To visualize the correlation matrices of the environmental conditions one spe-cific order of the reactions will be applied. To determine the order of the reactionshierarchical clustering with average linkage has been used for environmental condi-tion 1 of Table 2.1 (anaerobic growth in rich medium) with one minus the absolutecorrelation serving as distance measure. We will use this reaction order throughoutthe paper.

2.3.3 Mean centering

In this paper we are interested in whether correlations between reactions stay thesame or change across different environmental conditions. We will, therefore, applymean centering of the absolute correlations across environmental conditions [87]prior to multivariate data analysis. The correlations between reactions that areinvariant across all environmental conditions, i.e. their values remain the same,will have only zeros after mean centering. These invariant correlations contain novariation and are, therefore, not described by the applied multivariate methods. Thevariant correlations, i.e. correlations that change across environmental conditions,contain variation after mean centering on which the applied multivariate methodswill focus.

In section 2.4.3 we will show and comment on the invariant correlations. Thevariant correlations will be analyzed with PARAFAC and PCA.

2.3.4 PARAFAC

By stacking absolute correlation matrices of multi-environmental conditions, as de-noted by |Φk| [JxJ], on top of each other a datacube, represented by Φ [KxJxJ], isobtained that has a three-way structure. Multi-way analysis after mean centering,therefore, is the obvious approach for identifying correlations between reactionsthat change across environmental conditions.

The most appropriate multi-way method for analyzing correlation matrices ofmultiple environmental conditions is an INDSCAL (individual difference scaling)model [88, 89]. Although, our datacube contains absolute correlation matrices thatare no longer positive semidefinite but still symmetric with remaining dependencybetween the correlations within each environmental condition, we assume that theINDSCAL model still applies. The INDSCAL model, consisting of loading matricesA [KxR] and B [JxR] both containing the same number of components R, can berepresented by

|Φk| − Φ = BDkBT + Ek (2.4)

where |Φk| is the kth horizontal slice of Φ, Φ [JxJ] contains the mean of Φ overall environmental conditions (k = 1, . . . ,K), Dk is a diagonal matrix with the kthrow of loading matrix A on its diagonal (elements ak1, . . . , akR) and Ek [JxJ] denotes

23

Page 36: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

2. Multi-way analysis of flux distributions

the residual term of the kth horizontal slice of Φ. It has been shown that a sym-metric case of a canonical decomposition (CANDECOMP) model [89] can be usedto estimate the parameters of the INDSCAL model [90, p.388–389]. In practice aPARAFAC model [91, 92], which is mathematically equivalent to a CANDECOMPmodel, can also be used to estimate the parameters of the INDSCAL model. The re-sulting PARAFAC model, as shown in equation 2.5 [93, p.59–64], consists of loadingmatrices A, B and C [JxR] with B ≈ C [94].

|Φk| − Φ = BDkCT + Ek (2.5)

with B ≈ C

2.3.5 PCA

Datamatrix V [Kx(JxJ)] is created by putting the vectorized absolute correlation ma-trices of each environmental condition, |Φk|, into its rows. The rows of V representthe environmental conditions as given in Table 2.1.

After mean centering the PCA model [95, 96] summarizes the mean centereddatamatrix V in a bilinear model containing a set of scores T [KxR] and loadingsP [Rx(JxJ)], with the number of components R min(K, JxJ), while the residuals inE contain the nonsystematic variation that can not be modeled.

V − 1K

1K1TKV = TPT + E (2.6)

The PCA model does not consider the dependency between the absolute correla-tions when building the model, but this dependency is, however, still present indatamatrix V.

2.4 Results

2.4.1 Blocked reactions and futile cycles

Within the six specific environmental conditions given in Table 2.1 in total 210 re-actions are blocked for all conditions and 42 reactions involved in futile cycles havebeen identified and removed from the data. The remaining number of reactions,therefore, equals J = 438.

As stated in section 2.3.1 each individual environmental condition can still con-tain blocked reactions, but these are specific for that environmental condition. Thelast column of Table 2.1 shows the number of blocked reactions per specific condi-tion. The number of blocked reactions for environmental condition 4–6 is higherthan for environmental condition 1–3. The explanation for this observation is thatin environmental condition 4–6 a minimal growth medium has been used that con-tains only seven amino acids, whereas in environmental condition 1–3 a rich growthmedium has been used containing all amino acids. Transport and exchange reac-tions for amino acids not present in the minimal medium, as given in Table 2.2,become blocked.

24

Page 37: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

2.4. Results

Table 2.2: Reactions that get blocked going from rich medium to minimal medium environmental con-ditions under the same aeration condition.

Abbreviation Description

EX_ala-L(e) L-Alanine exchangeEX_arg-L(e) L-Arginine exchangeEX_asn-L(e) L-Asparagine exchangeEX_asp-L(e) L-Aspartate exchangeEX_gln-L(e) L-Glutamine exchangeEX_gly(e) Glycine exchangeEX_lys-L(e) L-Lysine exchangeEX_orn-L(e) Ornithine exchangeEX_phe-L(e) L-Phenylalanine exchangeEX_pro-L(e) L-Proline exchangeEX_ser-L(e) L-Serine exchangeEX_thr-L(e) L-Threonine exchangeEX_trp-L(e) L-Tryptophan exchangeEX_tyr-L(e) L-Tyrosine exchangeARGORNt3 Arginine/ornithine antiporterARGabc L-arginine transport via ABC systemARGt2 L-arginine transport in via proton symportASNt2 L-asparagine transport in via proton symportASPt2 L-aspartate transport in via proton symportGLNabc L-glutamine transport via ABC systemLYSt6 L-lysine transport in/out via proton symportPHEt6 L-phenylalanine transport in/out via proton symportPROabc L-proline transport via ABC systemSERt6 L-serine transport in/out via proton symportTRPt6 L-tryptophan transport in/out via proton symportTYRt6 L-tyrosine transport in/out via proton symport

Both within environmental conditions 1–3 and 4–6 there is a decrease in thenumber of blocked reactions that can be explained by the change in aeration con-dition. Table 2.3 shows which reactions are blocked under anaerobic growth con-ditions. The first seven reactions in this table get unblocked when changing fromanaerobic to aerobic growth conditions (change from environmental condition 1 to2 and 4 to 5), this is explained by the fact that these reactions all involve oxygen us-age. The last three reactions get unblocked when changing from aerobic to aerobicrespiratory conditions (change from environmental condition 2 to 3 and 5 to 6) andcan be explained, because they are involved in the respiratory cycle.

25

Page 38: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

2. Multi-way analysis of flux distributions

Table 2.3: Reactions that are blocked under anaerobic growth condition.

No. Abbreviation Description

1 EX_o2(e) O2 exchange2 ALOX oxidative decarboxylation of acetolacate (chemical)3 GSHPO glutathione peroxidase4 GTHRD glutathione-disulfide reductase5 NOX2 NADH oxidase (H2O forming)6 NPR NADH peroxidase7 O2t O2 transport in via diffusion8 CYTB_B2 menaquinol oxidase (7:1 protons)9 G3PD4 glycerol-3-phosphate dehydrogenase (menaquinone 7)10 NADH4 NADH dehydrogenase (menaquinone 7 & no proton)

2.4.2 Correlation matrices

Pearson correlation matrices have been calculated for all environmental conditions(Table 2.1). Because of sign indeterminacy of the correlations between reaction pairsin the metabolic network (described in section 2.3.2) absolute correlations are used.Taking the absolute value of a correlation matrix, however, removes the positivesemidefinite property, but the symmetry and dependency between the correlationswithin the matrix remains unchanged.

Figure 2.1 shows the absolute correlation matrix for, respectively, environmentalcondition 2 (aerobic growth in rich medium) and 5 (aerobic growth in minimalmedium) of Table 2.1. The reactions in both correlation matrices have been orderedas discussed in section 2.3.2. Reactions behaving the same can be seen in Figure 2.1as black blocks on the diagonal with correlation one. There is a very large groupof reactions behaving the same and there are several small sets of reactions thatperfectly correlate. When comparing Figure 2.1(a) and (b) changes in correlationscan be observed. Some clusters change in size, meaning that reactions becomeuncorrelated to the other reactions, and sometimes the correlation between clustersis different.

To enhance the visualization of the blocks of reactions on the diagonal and thecorrelation between blocks, logical correlation matrices have been created by settingall correlations within each environmental condition smaller than one to a value ofzero and the correlations of the blocked reactions within each condition to a valueof three. Setting the blocked reactions to a value of three enables the visualization ofreactions that become blocked and unblocked when comparing two environmen-tal conditions. Figure 2.2(a) shows the difference between the logical correlationmatrices of Figure 2.1(a) and Figure 2.1(b) and clearly visualizes the changes in cor-relations between aerobic growth in a rich medium and a minimal medium. Theblack dots (value of minus three) in Figure 2.2(a) are correlations between reactions,

26

Page 39: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

2.4. Results

absolute correlations environmental condition 2

50 100 150 200 250 300 350 400

50

100

150

200

250

300

350

400

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

(a)

absolute correlations environmental condition 5

50 100 150 200 250 300 350 400

50

100

150

200

250

300

350

400

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

(b)

Figure 2.1: Sorted absolute correlations between reactions for:(a) condition 2 of Table 2.1 (rich mediumaerob);(b) condition 5 of Table 2.1 (minimal medium aerob).

27

Page 40: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

2. Multi-way analysis of flux distributions

differences logical correlations conditions 2 and 5

50 100 150 200 250 300 350 400

50

100

150

200

250

300

350

400

−3

−2.5

−2

−1.5

−1

−0.5

0

(a)

zoomed differences logical correlations conditions 1 and 3

390 395 400 405 410 415 420 425 430 435

390

395

400

405

410

415

420

425

430

435

0

0.5

1

1.5

2

2.5

3

(b)

Figure 2.2: Difference between logical correlations correlations matrices for growth:(a) in rich mediumand minimal medium under aerobic conditions;(b) under anaerobic and aerobic respiratory conditionsin rich medium.

28

Page 41: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

2.4. Results

Table 2.4: Reaction cluster that add to an existing cluster of reactions when changing from aerobicgrowth in rich medium to minimal medium.

Bar no. Abbreviation Description

1 G5SADs L-glutamate 5-semialdehyde dehydratase (spontaneous)G5SD glutamate-5-semialdehyde dehydrogenaseGLU5K glutamate 5-kinaseP5CR pyrroline-5-carboxylate reductaseANPRT anthranilate phosphoribosyltransferaseANS1 anthranilate synthaseIGPS indole-3-glycerol-phosphate synthasePRAI phosphoribosylanthranilate isomeraseTRPS1 tryptophan synthase (indoleglycerol phosphate)

2 ADPDS acetyl-diaminopimelate deacetylaseADPTA acetyl-diaminopimelate transaminaseAPAT apolipoprotein N-acyl transferaseDAPDC diaminopimelate decarboxylaseDAPE diaminopimelate epimeraseDHDPRy dihydrodipicolinate reductase (NADPH)DHDPS dihydrodipicolinate synthase

3 PPND prephenate dehydrogenase4 ACGK acetylglutamate kinase

ACOTA acetylornithine transaminaseAGPR N-acetyl-g-glutamyl-phosphate reductaseORNTAC ornithine transacetylase

5 ARGSL argininosuccinate lyaseARGSS argininosuccinate synthaseOCBT ornithine carbamoyltransferase

that get blocked when L. lactis grows in a minimal medium (Table 2.2) and involvetransport and exchange reactions of amino acids that are not present in the mini-mal growth medium. The grey bars (value of minus one) in Figure 2.2(a) containreaction clusters (Table 2.4) whose correlation changes with an existing cluster ofreactions (they add to the existing cluster, as visible in Figure 2.1(b)). The reac-tions in these clusters are related to the amino acid biosynthesis of amino acidsthat can not be taken up from the medium, because they are simply not present.Figure 2.2(b) visualizes the changes in correlations between anaerobic and aerobicrespiratory growth conditions in rich medium (only the part containing changes isshown, the rest of the figure contained only zeros). The reactions in the lower rightcorner of this figure (values of plus three) are the ones that become unblocked goingfrom anaerobic to aerobic respiratory growth condition (Table 2.3). In this clusterof reactions there are some reactions (values of plus two) that become unblockedunder growth in aerobic respiratory condition and cluster together (numbers 1/7

29

Page 42: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

2. Multi-way analysis of flux distributions

and 3/4 from Table 2.3 form two separate clusters). In the upper left part of Fig-ure 2.2(b) there are also three reactions visible in light grey, i.e. EX_diact(e): diacetylexchange; DIACTt: diacetyl diffusion and ACTD: acetoin dehydrogenase, that forma cluster in rich medium under anaerobic growth. However, in rich medium underaerobic respiratory growth condition one of reactions (ACTD: acetoin dehydroge-nase) drops out of this cluster.

Comparisons between two environmental conditions reveal the differences be-tween those conditions, but do not reveal the underlying concepts of which cor-relations change (i.e. are variant) across multi-environmental conditions due to achange in a specific environmental factor (e.g. aeration condition). When the effectsof environmental factors can be identified in a model of the performed experiments,this can possibly help the choice of the environmental factor settings when planningnew experiments.

2.4.3 Invariant correlations

The invariant correlations can be identified by examining the standard deviationof the correlations across all environmental conditions, where a standard devia-tion close to zero signifies correlations that are invariant. Figure 2.3(a) displaysthe standard deviations of the correlations across all conditions and clearly showswhich clusters of reactions (seen as blocks with standard deviation close to zero)are invariant across conditions. For example the large block almost in the middleof Figure 2.3(a) contains invariant correlations between reactions (129 out of 438reactions), that are directly linked to the production of biomass (lipids, proteins, vi-tamins, polysaccharides and cell wall components). Table 2.5 shows some reactionsfrom this biomass production related cluster with indication to which class theybelong. The variant correlations between reactions that are linked to amino acidbiosynthesis, as described in the previous section (Figure 2.2(a) and Table 2.4), addto the large block of invariant correlations and are also directly linked to biomassproduction in the minimal medium. Figure 2.3(a) also shows which correlations be-tween clusters or parts of clusters on the diagonal are invariant across conditions,e.g. correlations between and within reactions 55–63 and 124–131 are invariant(Table 2.6).

Invariance of correlations, merely, states that the correlations do not changeacross environmental conditions and does not imply that these correlations are highor low. By studying the mean of the correlations over environmental conditions animpression of the actual value of correlations can be obtained. Figure 2.3(b) showsthe mean correlation over all conditions and average correlations close to one showreaction pairs that behave very similar and belong to the same subset. Althoughthe correlations between and within the reactions in Table 2.6 are invariant andthe reactions within each group belong to the same subset, there is no relationbetween the two subsets because the mean correlation of the reactions betweenthe subsets is close to zero. The most interesting reactions are the ones that haveinvariant correlations across conditions and high mean correlation over conditions,as for example the reactions related to biomass production. Correlations with highmean and low standard deviation across conditions between subsets of reactions, of

30

Page 43: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

2.4. Results

standard deviation correlation across environmental conditions

50 100 150 200 250 300 350 400

50

100

150

200

250

300

350

400

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

(a)

mean correlation across environmental conditions

50 100 150 200 250 300 350 400

50

100

150

200

250

300

350

400 0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

(b)

Figure 2.3: Standard deviation and mean of the correlations across all environmental conditions:(a)standard deviation;(b) mean.

31

Page 44: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

2. Multi-way analysis of flux distributions

Table 2.5: Examples of reactions in the block related to biomass production.

Class Abbreviation Description

Coenzyme A synthesis DPCOAK dephospho-CoA kinasePNTK pantothenate kinasePPCDC phosphopantothenoylcysteine decar-

boxylaseDNA synthesis DNAS_LLA DNA synthesis, LLA specificFatty acid biosynthesis ACACT1r acetyl-CoA C-acetyltransferase

FABM Fatty acid enoyl isomerase(FabM reaction)

MACPMT Malonyl-CoA:[acyl-carrier-protein]S-malonyltransferase

kaasIII beta-ketoacyl-ACP synthase IIILipid biosynthesis DMATT dimethylallyltranstransferase

DPMVD diphosphomevalonate decarboxylaseHMGCOAS Hydroxymethylglutaryl CoA synthaseMEVK mevalonate kinaseUDCPDPS Undecaprenyl diphosphate synthase

Peptidoglycan and GALTAL Galactose lipoteichoic acid ligaseteichoic acid GLUR glutamate racemasebiosynthesis PGAMT phosphoglucosamine mutase

UDCPDP undecaprenyl-diphosphatasePhospholipid biosynthesis GLYK glycerol kinase

CLPNS_LLA Cardiolipin Synthase (lactis specific)LPGS_LLA lysylphosphatidyl-glycerol synthetase

Polysaccharide metabolism G1PTMT glucose-1-phosphate thymidylyl-transferase

PGMT phosphoglucomutaseUDPG4E UDPglucose 4-epimerase

Protein synthesis ALATRS Alanyl-tRNA synthetaseLEUTRS Leucyl-tRNA synthetaseTRPTRS Tryptophanyl-tRNA synthetase

Pyrimidine biosynthesis DTMPK dTMP kinaseRNA synthesis RNAS_LLA RNA synthesis, lactis specificVitamins and cofactor DHFR dihydrofolate reductasemetabolism TMDS thymidylate synthase

which the individual subsets also have high mean correlations with low standarddeviation across conditions, are also very interesting because these subsets displaythe same behavior and might be closely related in the sense of functionality.

32

Page 45: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

2.4. Results

Table 2.6: Clusters of reactions that have invariant correlations between eachother.

Cluster Reaction Abbreviation Descriptionnumber

1 55 CHORS chorismate synthase56 DAHPS 3-deoxy-D-arabino-heptulosonate 7-phosphate

synthetase57 DHQD 3-dehydroquinate dehydratase58 DHQS 3-dehydroquinate synthase59 PSCVT 3-phosphoshikimate 1-carboxyvinyltransferase60 RPE ribulose 5-phosphate 3-epimerase61 SHK3D shikimate dehdrogenase62 SHKK shikimate kinase63 TKT2 transketolase

2 124 ADSL2 adenylosuccinate lyase125 AIRC phosphoribosylaminoimidazole carboxylase126 GARFT phosphoribosylglycinamide formyltransferase127 GLUPRT glutamine phosphoribosyldiphosphate amido-

transferase128 PRAGS phosphoribosylglycinamide synthetase129 PRAIS phosphoribosylaminoimidazole synthetase130 PRASCS phosphoribosylaminoimidazolesuccinocarbox-

amide synthase131 PRFGS phosphoribosylformylglycinamidine synthase

2.4.4 PARAFAC

In order to find the underlying concepts of which correlations between reactionschange due to changes in environmental conditions multivariate data analysis onthe multi-environmental absolute correlation matrices has been applied. We startwith a PARAFAC model, because the data (stacked absolute correlation matrices)in itself has a multi-way structure and the PARAFAC model keeps the dependencybetween the absolute correlations intact. In the PARAFAC model we have chosenfor two components. Addition of a third component did not result in a better inter-pretable model. Mean centering across conditions is applied before data analysis toshow differences between the correlations over all environmental conditions.

A two-component unconstrained PARAFAC model explains 61% of the varia-tion in datacube Φ. Figure 2.4(a) shows that the scores for the environmental mode(A) are strongly related to each other and one component would be enough to de-scribe the separation between the environmental conditions, whereas the expectednumber of components is at least two because of the two design factors (mediumand aeration) in the environmental conditions. The strong relation between thescores for the environmental mode is also confirmed by the Tucker congruence

33

Page 46: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

2. Multi-way analysis of flux distributions

−60 −50 −40 −30 −20 −10 0 10 20 30 40 50 60−60

−50

−40

−30

−20

−10

0

10

20

30

40

50

60

1 2 3

4 5

6

scores for mode A component 1

scor

es fo

r m

ode

A c

ompo

nent

2

scores for mode A component 2 vs. component 1

(a)

−50 −40 −30 −20 −10 0 10 20 30 40 50−50

−40

−30

−20

−10

0

10

20

30

40

50

1

2

3

4

5 6

scores for mode A component 1

scor

es fo

r m

ode

A c

ompo

nent

2

scores for mode A component 2 vs. component 1

(b)

Figure 2.4: Scores on component 2 versus component 1 for the environmental mode (A) of, (a) an uncon-strained PARAFAC model;(b) a PARAFAC model with orthogonality constraints on the environmentalmode.

34

Page 47: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

2.4. Results

value between component 1 and 2 of the environmental mode A (TA12 = −1.0000),

which indicates a degeneracy of that mode. The Tucker congruence values betweencomponent 1 and 2 for mode B and C are, TB

12 = TC12 = 0.9929. Multiplication of

the Tucker congruence values of all three modes results in Tucker’s congruencecoefficient for component 1 and 2 (T12 = −0.9858). For a degenerated model, i.e.PARAFAC is unable to correctly fit the trilinear model, Tucker’s congruence coeffi-cient will be close to minus one [93, p.107–108].

To resolve the relationship between the two components in the PARAFAC modelwe have built a two component PARAFAC model with an orthogonality constrainton environmental mode A. This model explains 45% of the variation in datacube Φ.The scores on the environmental mode shown in Figure 2.4(b) now display a direc-tion clearly visualizing the separation between rich and minimal growth mediumand a direction that describes some of the variation caused by changes in aeration.However, both effects are mixed over the two components. The Tucker congruencevalues of the second and third mode for component 1 and 2 (TB

12 = TC12 = 0.9196)

show that, although the degeneracy of the environmental mode (TA12 = 0) has been

resolved, a degeneracy problem within the components of mode B and C still ex-ists. The outer product of the second (B) and third (C) mode for component one isshown in Figure 2.5(a) and component two is displayed in Figure 2.5(b). Althoughthe values in both figures are not exactly the same, the pattern displayed is thevery similar for both components, as could be expected from the degeneracy of thesecond and third mode. Comparison of the patterns in Figure 2.5 with the patternin Figure 2.2(a) shows that both components mainly describe the variation in thecorrelations caused by a change in growth medium. The variation in correlationscaused by a change in aeration condition, which results in changes of blocked reac-tions (visible in the lower right part of Figures 2.2(b) and given in Table 2.3), is notat all explained by the PARAFAC model. The outer product between the loadingsof mode B and C for both components (Figure 2.5) also show that the large blockof reactions related to biomass product get a value, whereas those correlations areinvariant across conditions (section 2.4.3). The reason is that it is impossible formulti-way component models to simultaneously explain zeros and non-zero valueswhen there are blocks with zero value present in the data. This raises the questionwhether PARAFAC is a good choice to analyze this data.

2.4.5 PCA

The second applied multivariate data analysis method is PCA after mean centeringof the vectorized absolute correlations matrices of all environmental conditions isused. Mean centering is again applied to show differences between correlationsacross environmental conditions.

A two-component PCA-model of the mean centered datamatrix V captures 82%of the variation. Two components have been chosen here and a scree graph of theeigenvalues versus the number of components [96] clearly showed an elbow at twocomponents. This number of components matches the underlying design of theenvironmental conditions, namely growth medium and aeration condition. Onecomponent for the aeration condition might not be sufficient because it has three

35

Page 48: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

2. Multi-way analysis of flux distributions

PARAFAC (mode B)*(mode C) T component 1

50 100 150 200 250 300 350 400

50

100

150

200

250

300

350

400

0

0.005

0.01

0.015

0.02

(a)

PARAFAC (mode B)*(mode C) T component 2

50 100 150 200 250 300 350 400

50

100

150

200

250

300

350

400

0.005

0.01

0.015

0.02

0.025

(b)

Figure 2.5: Outer product between mode B and mode C for PARAFAC mode with orthogonality con-straints on the environmental mode A:(a) component 1;(b) component 2.

36

Page 49: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

2.4. Results

−50 −40 −30 −20 −10 0 10 20 30 40 50−50

−40

−30

−20

−10

0

10

20

30

40

50

1

2

3

4

5 6

Scores on PC #1 (71%)

Sco

res

on P

C #

2 (1

1%)

Scores for PC #2 versus PC #1

Figure 2.6: scores on PC2 vs. PC1 for a two-component PCA model

levels, but addition of an extra component did not reveal extra information withrespect to the aeration condition.

A plot of the scores on principal component 2 (PC2) versus principal component1 (PC1), as shown in Figure 2.6, reveals that PC1 (explaining 71% of the variationin the data) nicely separates the rich medium environmental conditions from theminimal medium conditions. PC2 does not show a clear separation between en-vironmental conditions with respect to aeration conditions, but does show a partof the variation in the data linked to change in aeration. The fact that a change ingrowth medium accounts for the most variation in the correlations between reac-tions over all environmental conditions is not very surprising, as we have alreadyseen in the comparison of two conditions (Figure 2.1 and Figure 2.2), that a changein growth medium displays many changes in the correlations between only twoconditions. Figure 2.7 shows the loadings of PC1 and PC2 of the PCA model afterfolding them back into a correlation matrix structure for ease of interpretability.Because the first PCA loading (Figure 2.7(a)) is linked to the change in medium, thecorrelations between reactions that vary the most by the change in growth mediumcan be seen as correlations with a high absolute loading weight. Figure 2.7(a) dis-plays the same pattern of dots and bars (Figure 2.2(a)) as described in section 2.4.1and the reactions with highly variable correlations are indeed the same reactions as

37

Page 50: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

2. Multi-way analysis of flux distributions

absolute PCA Loading #1 in correlation form

50 100 150 200 250 300 350 400

50

100

150

200

250

300

350

400

0

0.002

0.004

0.006

0.008

0.01

0.012

(a)

absolute PCA Loading #2 in correlation form

50 100 150 200 250 300 350 400

50

100

150

200

250

300

350

400

0.005

0.01

0.015

0.02

(b)

Figure 2.7: PCA loadings after folding back into correlation matrix structure:(a) loading for PC1;(b)loading for PC2.

38

Page 51: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

2.5. Discussion and conclusion

given in Table 2.2 and 2.4. The second PCA loading (Figure 2.7(b)) display highlyvariable correlations in the low right corner. The reactions linked to these variablecorrelations are those that change by a change in aeration condition as described insection 2.4.1 (Figure 2.2(b) and Table 2.3).

2.5 Discussion and conclusion

Correlation matrices have been calculated for each condition using flux distribu-tions through reactions in a metabolic network of L. lactis for several environmentalconditions. Because of sign indeterminacy of the correlations between reactionspairs in a metabolic network absolute correlations are used, but this removes thepositive semidefinite property of a correlation matrix. The resulting absolute cor-relation matrices are still symmetric and the dependency between correlations re-mains.

When stacking the absolute correlation matrices of all conditions after meancentering correlations between reactions can be identified that are invariant acrossenvironmental conditions. For identifying correlations that change (i.e. are variant)across environmental conditions multivariate approaches are applied (PARAFACand PCA).

The PARAFAC model has been used because the data of stacked absolute corre-lation matrices in itself has a three-way structure and PARAFAC keeps the depen-dency among the correlations intact. The two component PARAFAC model withorthogonality constraints on the environmental mode does not explain the variationin the data very well (only 45% of the variation is explained). The model suffersfrom degeneracy in the second (B) and third (C) mode and the outer product ofthese modes for both components only display the variant correlations of reactionseffected by a change in growth medium. The variant correlations between reac-tions due to a change in aeration condition can not be identified at all with thePARAFAC model. Furthermore does PARAFAC have difficulty in modeling thespecific structure of the data, which contains many zeros after mean centering dueto correlations that are invariant across environmental conditions. Techniques likeweighted PARAFAC [97], maximum likelihood PARAFAC [98] or MILES [99] mighthelp in dealing with this specific structure in the data.

Instead of dealing with the specific structure in the data with techniques assuggested above, we have chosen to vectorize the absolute correlation matrices ofthe environmental conditions and applying PCA. Although PCA does not considerthe dependency between the correlations in each environmental condition whenbuilding the model, this dependency is of course still present in the data after vec-torizing it. By vectorizing the data PCA gets much more parameters for modelingthe data compared to a PARAFAC model and combined with the fact that PCA ig-nores the dependency between the correlations, it is not surprising that in the PCAmodel much more of the variation present in the data is explained (82%) than inthe PARAFAC model. However, by allowing so much more freedom in modelingthe data by removing the dependency between the correlations a real risk of over-fitting lurks. The loadings of the PCA model do, however, nicely show the variable

39

Page 52: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

2. Multi-way analysis of flux distributions

correlations across conditions by the change in both growth medium and aerationcondition.

The crucial question that remains, is whether latent variable models are suitablefor analyzing this type of data. Perhaps clustering type models are better equippedfor describing this data type in which blocks exist of invariant and variant corre-lations. In our further research we intend to focus on simplivariate models [100]and three-mode partitioning [101] for identifying invariant and variant blocks ofcorrelations.

40

Page 53: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

Ch

ap

te

r

3Maximum likelihood scaling (MALS)†

†This chapter is published as: Huub C.J. Hoefsloot, Maikel P.H. Verouden, Johan A. Westerhuisand Age K. Smilde J. Chemom. 2006;20(3–4):120–127 Copyright © 2007 John Wiley & Sons, Ltd. DOI:10.1002/cem.996

41

Page 54: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

3. Maximum likelihood scaling

Abstract

A filtering procedure is introduced for multivariate data that does not suffer fromnoise amplification by scaling. A maximum likelihood principal component anal-ysis (MLPCA) step is used as a filter that partly removes noise. This filtering canbe used prior to any subsequent scaling and multivariate analysis of the data andis especially useful for data with moderate and low signal-to-noise ratio’s, such asmetabolomics, proteomics and transcriptomics data.

KEYWORDS: scaling, preprocessing, omics-data, noise, filtering.

42

Page 55: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

3.1. Introduction

3.1 Introduction

In many multivariate data analysis methods scaling is often performed as a pre-processing step. There are often good reasons to perform such a scaling step, forexample prior to a principal component analysis (PCA) of the data [58,87]. In manycases, the correlation between variables is more important than the covariance be-tween these variables. In that case, autoscaling (AS) can be performed on the datato analyze the correlations. The problem with this type of scaling is that it doesnot take any noise characteristics into consideration. AS can easily result in noiseamplification and an increase in heteroscedasticity of the data. This could lead topoor results for autoscaled PCA [102]. Also other data analysis methods, such asclustering and classification, suffer from noise amplification after scaling.

The noise amplification is especially a problem in data with a moderate or lowsignal-to-noise ratio. Even in data with a high signal-to-noise ratio AS has to beperformed carefully in order not to amplify the noise in signal scarce domains.In the emerging post-genomic field, metabolomics, proteomics and transcriptomicsdata are becoming available. Such data usually has large variation in the dynamicrange for the different variables with a moderate to low signal-to-noise ratio and weexpect our method to perform well in these areas. In the literature, there are severalpapers [59, 99, 103–105] on how to deal with noisy data in a PCA context. Themethods described in these papers calculate the most likely PCA solution takingthe distribution of the noise into account: the maximum likelihood solution. Thesemethods, however, focus on modeling whereas our procedure focuses on filtering.An example that a component model is used as a noise filter can be found in Chenet al. [106].

The method we propose here is a two-step method. It starts with calculatingthe above mentioned maximum likelihood solution, in the second step this solutionis scaled. Subsequently the data can be analyzed with any data analysis tool. Theadvantage of this procedure is that in the first step a part of the noise is removedand the scaling step can be performed on data with reduced noise.

We shall demonstrate the superiority of MALS over ordinary scaling using twoartificial test cases with different noise characteristics. Furthermore, MALS will beused on a real life metabolo-mics data set [58]. From this example, it will becomeclear how to use MALS in practice. As an analysis tool for the data we use PCAbeing one of the most used techniques in multivariate data analysis. Althoughin the literature there are many papers on scaling and many papers on noise inmultivariate data analysis, this paper introduces a new method that deals with theproblems arising from scaling noisy data.

3.2 Methods

In the following section we will first explain briefly how AS works and subse-quently describe the MALS approach. PCA is chosen as an example of a muchused multivariate analysis method, but the results carry over to other methods aswell. Matrices are written as boldface uppercase characters; vectors as boldface

43

Page 56: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

3. Maximum likelihood scaling

Figure 3.1: MALS, the first step filters the data and the second step provides the user with the desiredview on the data.

lowercase characters; scalars as italics.

3.2.1 Autoscaling (AS)

Let X denote the matrix of J variables obtained at I experiments. AS comes downto mean-center the data and scale each column by its standard deviation:

xscaledij =

xij − x j

sj(3.1)

where x j represents the mean of the jth column of X and sj is the standard deviationof that column (xij is the typical element of X). The matrix Xauto contains the

numbers xscaledij .

3.2.2 Maximum likelihood scaling (MALS)

The first step of MALS is to partly remove noise from the original data. This isachieved by a MLPCA. After this first step the model X is obtained. Next, X can bescaled with an appropriate scaling method, for example AS. The MALS approachis depicted in the Figure 3.1. Subsequently, the data can be analyzed with an ap-propriate analysis method, for example PCA.

3.2.3 Maximum likelihood principal component analysis (MLPCA)

Standard PCA, see Jollife [96], minimizes the following objective function: g(T,P) =‖X − TPT‖2 for preset dimensions of the matrices T and P; where the matrixX contains the measurements. MLPCA can use knowledge of error informationof the variables and the error co-variation between variables. If there is no co-variation between the error of the variables then the problem can be simplifiedinto a weighted PCA (WPCA) formulation: gw(T,P) = ‖W ◦ (X − TPT)‖2, where◦ denotes a Hadamard (element wise) product. Here the matrix W has the samedimensions as the data and contains the reciprocal measurement uncertainty.

44

Page 57: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

3.3. Test cases

Several algorithms exist for MLPCA and WPCA [99, 103, 104, 107, 108]. In thispaper we do not consider correlated error structures but only consider cases thatcan be solved using WPCA. The proposed strategy also holds for the general case.For solving the minimization problem a strategy is adopted similar to Schuermanset al. [107] The objective function is written in such a form that the minimizationcan be performed by the non-linear solver of MATLAB (lsqnonlin). The MATLABm-files are available on the web: http://www.bdagroup.nl. We found this approachmuch faster than the available software for this type of problems, a conclusion thatis in line with the results from Schuermans et al. [107] The minimization stops at apoint where either the residuals or the solution do not change more than predefinedstop criteria. In all simulations performed in this paper the criteria are reduced untilthe solution is stable.

3.3 Test cases

3.3.1 Artificial data

For illustration we start with the use of artificial data to compare our approach withstraightforward AS the original (non-filtered) data. Artificial data can be made inan infinite numbers of ways. The first condition we put on our artificial data isthat the values of variables vary over orders of magnitude. If the importance ofthe small variables is comparable to the importance of the large values it is logicalto perform some type of scaling. Secondly, the number of experiments is madesmaller than the number of variables. These are characteristics that are also foundin for example metabolomics data.

The data matrix Xtrue is constructed by defining two smaller matrices A and B.The size of A is chosen to be 6 by 2 and the elements of A are drawn from a standardnormal distribution. The size of B is chosen to be 10 by 2 and the elements of B aredrawn from a uniform distribution between 0 and 1. To introduce the differencesmentioned in the paragraph above the values of the rows 6 to 8 of B are multipliedwith 103 and rows 9 and 10 are multiplied with 105.

The matrices A and B that were obtained following the procedure describedabove were:

A =

−2.4634e− 001 +4.8530e− 001+6.6302e− 001 −5.9549e− 001−8.5420e− 001 −1.4967e− 001−1.2013e+ 000 −4.3475e− 001−1.1987e− 001 −7.9330e− 002−6.5294e− 002 +1.5352e+ 000

45

Page 58: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

3. Maximum likelihood scaling

B =

9.0161e− 001 9.8299e− 0015.5839e− 003 5.5267e− 0012.9741e− 001 4.0007e− 0014.9162e− 002 1.9879e− 0016.9318e− 001 6.2520e− 0013.7589e+ 002 7.5367e+ 0029.8765e+ 000 7.9387e+ 0024.1986e+ 002 9.1996e+ 0028.4472e+ 004 6.2080e+ 0043.6775e+ 004 7.3128e+ 004

With these matrices the artificial data are created:

Xtrue = ABT (3.2)

The absolute values in Xtrue range from 1.2429e − 002 to 1.2847e + 005. The au-toscaled Xtrue (Xtrueauto), is the target (noiseless) data. The results of AS thenoisy data and the MALS approach are compared with the true results. Two casesare considered, one with homoscedastic noise and the other with heteroscedasticnoise.

X = Xtrue + Ni (3.3)

Where i = 1 indicates the homoscedastic case and i = 2 is the heteroscedastic case.

3.3.2 Homoscedastic case

Here N1 consists of i.i.d. noise from a normal distribution with zero mean and avariance of 0.2. This leads to a signal-to-noise ratio in X ranging from 10−6 to 10,which is in the order of magnitude of types of omics data. The autoscaled matrixX is denoted by Xauto. The difference between Xauto and Xtrueauto is the errorusing the traditional approach:

error = ‖Xauto − Xtrueauto‖2 (3.4)

For MALS the MLPCA for X is calculated. The number of principal componentsis two and the non-linear solver algorithm was used. The result is then autoscaledto obtain the matrix XMALS. The difference between the true value and the MALSmodel is then given by:

errorMALS = ‖XMALS − Xtrueauto‖2 (3.5)

The noise for X is generated 20000 times and both the MALS and the AS errors arecalculated. The average error for the AS approach is 2.3761 while for the MALSapproach this is 1.2690. Figure 3.2 shows the distribution of the errors, for MALS asolid line is used and for AS a dotted line is used. The error in the MALS approachis 97.835% of the time smaller than the AS approach.

46

Page 59: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

3.3. Test cases

0 1 2 3 4 50

200

400

600

800

1000

1200

1400

1600

1800

2000Error distribution

Error norm

Fre

quen

cy

ASMALS

Figure 3.2: The error distribution of the homoscedastic test case, for the autoscaling (dotted line) andthe MALS (solid line) approach.

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.80

500

1000

1500

2000

2500

Error norm

Fre

quen

cy

Error distribution

ASMALS

Figure 3.3: The error distribution of the heteroscedastic test case, for the autoscaling (dotted line) andthe MALS (solid line) approach.

47

Page 60: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

3. Maximum likelihood scaling

3.3.3 Heteroscedastic case

In this case the matrix N2 is constructed from N that is a matrix with i.i.d. noisefrom a normal distribution with zero mean and a variance of 0.2 and from Xtrue:

N2 = N ◦ Xtrue (3.6)

Again ◦ denotes a Hadamard product.The error made by the AS approach is calculated using Equation 3.4 and the

MALS error is obtained from Equation 3.5. The X data are again generated 20000times and both errors from AS and MALS are calculated. The average error is0.91989 for AS and 0.55148 for MALS. Figure 3.3 shows the error distribution: theerror in the MALS approach is in 96.7% of the time smaller than the AS approach.From Figures 3.2 and 3.3 it can be seen that MALS is superior to AS.

To illustrate the effect of the MALS approach on a multivariate analysis weperformed a PCA on Xtrueauto, XMALS and on Xauto. We did this for ten ho-moscedastic noise realization as described in subsection 3.3.2 Homoscedastic case. Ascan be seen from the score plot, Figure 3.4, the + from the MALS approach aremuch closer to the diamonds that represent the true solution than the • from theAS approach. Thus using MALS as a preprocessing tool gives better results thanAS.

This is no surprise because Figure 3.2 already showed that MALS is on averagebetter than AS. The loadings plot in Figure 3.5 shows a similar result. The MALSpreprocessing result resemble the true results much better than the AS approach.The true results and the MALS follow the well-known circular trajectory of a 2 PCmodel fully describing the data. Some of the results from AS preprocessing arequite far from the true values thus they could lead to some false conclusions.

3.3.4 Real data

To demonstrate the possibility of performing MALS on a real metabolomics datasetwe use the data described in van den Berg et al. [58] These data consist of mea-surements of a part of the metabolome of Pseudomonas putida S12 grown on fourdifferent carbon sources. In this paper we used 124 metabolites that were measuredwith GC-MS. Experimental details can be found in van der Werf et al. [12]. In totalwe consider 13 batch fermentations. Three experiments had as carbon source fruc-tose, five experiments ran on glucose, three on gluconate and two on succinate. Inthe remainder of this paper the fructose experiment are denoted by the numbers 1,2 and 3, the glucose experiments by the numbers 4, 5, 6, 7 and 8. The gluconateexperiments are the numbers 9, 10 and 11 and finally the succinate experimentsare denoted by 12 and 13. The experiments on glucose, gluconate and succinatecontain an analytical duplo. The duplos are respectively 4, 5 and 9, 10 and 12, 13.

3.3.5 Determination of the weight matrix W

In a complicated data set there are usually several sources of variation. The varia-tion that is induced by the experimental design is usually the interesting variation.

48

Page 61: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

3.3. Test cases

−6 −4 −2 0 2 4−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

1

2

3

4

5

6

PC1 (95.1%)

PC

2 (4

.9%

)

Score plot

Figure 3.4: The scores of PC 1 plotted against the scores of PC 2 of 10 noise realizations for homoscedasticnoise, the + indicate the MALS approach and the • indicate the AS results

−0.4 −0.35 −0.3 −0.25 −0.2 −0.15 −0.1 −0.05−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

1

2

3

4

5

6

7

8

9

10

PC1 (95.1%)

PC

2 (4

.9%

)

Loading plot

Figure 3.5: The loading of PC 1 plotted against the loading of PC 2 of 10 noise realizations for ho-moscedastic noise, the + indicate the MALS approach and the • indicate the AS results.

49

Page 62: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

3. Maximum likelihood scaling

Other sources of variation, for example biological, sampling and analytical are usu-ally unwanted. A filtering method should diminish the unwanted variation tomake the effects due to the experimental design more clear. The weighing matrixW should reflect on the type of variation that should be filtered. The most commonweighing procedures are based on analytical error [105, 109]. In these two papersit is argued that the analytical error for larger values is proportional to the signalstrength of the analytical device and for small values it is constant. We follow thesame approach. We take the elements in W inversely proportional to their values.For the zero values this obviously would lead to problems. Therefore, we introducea cutoff value of 10−4, this means all values smaller then 10−4 are given a weightof 104. A sensitivity analysis of the exact cutoff value showed that this is not asensitive parameter.

3.3.6 Determination of the number of principal components

Determining the number of principal components in MALS is not a trivial task.The approach we take here is just one of the possibilities. A first consideration isthe number of PC’s in the ultimate model. The number of PC’s in MALS should belarger than the number in the ultimate model. From a scree plot of the autoscaleddata it becomes clear that we need at least 3 PC’s (Figure 3.6).

In the dataset there are three analytical duplos. The weights in the matrix W

0 2 4 6 8 10 12 140

5

10

15

20

25

30

35

40

45Eigenvalue vs. PC Number

PC Number

Eig

enva

lue

Figure 3.6: Scree plot of the autoscaled data.

50

Page 63: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

3.3. Test cases

3456789101112130.05

0.1

0.15

0.2

0.25

0.3

Number of PC

Err

or n

orm

averageglucosegluconatesuccinate

Figure 3.7: The norm of the difference of the individual duplos, dashed (glucose), dotted (gluconate),dash-dotted (succinate) and mean of the three norms (solid).

are chosen in such a way that the analytical error should be filtered out. Thus theanalytical duplos should be closer to each other after the filtering procedure. Sofor all possible number of PC’s the norm of the difference between the duplos iscalculated. The results are presented in Figure 3.7.

From Figure 3.7 it can be seen that the filtering for the glucose and succinatedoes not reduce the difference between the duplos much. Note that the point for13 PC’s is equal to that of the original data because 13 PC’s describe the datacompletely. For gluconate the filter brings the duplos closer together for especially5 and 6 PC’s. The minimum of the average curve in Figure 3.7 is at 6 PC’s. Wetherefore took 6 PC’s for the MALS filter. To show the effect on a subsequentperformed AS and PCA in Figure 3.8 we present the score plots for AS and MALS.

From this score plot it can be concluded that MALS with PCA give a differentpicture compared to AS with PCA. For both methods the group structure is presentin Figure 3.8. But the grouping 1,2,3 and 4,5,6,7,8 and 9,10,11 and 12, 13 is tighterfor the MALS approach. Figure 3.8 is also consistent with Figure 3.7, the distancebetween duplos 4, 5 and 12, 13 are not altered much but the duplos 9, 10 is muchcloser. Because we do not know the true solution it is impossible to say whichmethod is closer to the truth. But because of theory and the results of the simu-lated data for which the true solution is known it is more likely that MALS is lessinfluenced by analytical errors.

51

Page 64: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

3. Maximum likelihood scaling

−10 −5 0 5 10 15−10

−5

0

5

10

15

1

2 3

4

5

6

7

8 9

10 11

12

13

1 2 3

4

5

6

7 8

9 10

11

12

13

PC1

PC

2

Figure 3.8: The score plots for AS and MALS. + Indicate MALS. • Indicate AS; 1,2,3: fructose; 4,5,6,7,8:glucose; 9,10,11: gluconate; 12,13: succinate. The duplos in the data are 4,5 and 9,10 and 12,13.

3.4 Discussion

On the basis of theoretical arguments, MALS should improve the results of a subse-quent scaling step due to its noise filtering capability. The artificial test cases indeedshow that MALS yields superior results. In more than 97% of the simulations forboth cases MALS performs better and the overall mean error is about halved. Therationale behind MALS is that it acts as a filter that reduces the noise that corruptsthe data. From this filtered data, a scaling step can be performed to obtain the de-sired view. This scaling step is not limited to AS but the arguments put forward infavor of MALS should hold for any kind of scaling, like for example level scaling orrange scaling. Moreover, MALS can be combined with any subsequent multivariateanalysis method. Hence, MALS is a general preprocessing tool.

We would like to make some comments on practical issues concerning MALS.First, in order to perform MALS, estimates of the unwanted variation are needed.In our opinion the best way to get such error estimates is to perform repeated mea-surements. If these cannot be performed, then a priori knowledge of the instrument

52

Page 65: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

3.4. Discussion

on which the measurements were performed should generate such estimates. Sec-ondly, the number of PC’s in the MALS filtering should be such that the filtering iseffective in the sense that the unwanted noise is reduced. Thirdly, if the measure-ment error is homoscedastic, a normal PCA is equivalent to a MLPCA. Hence, theproblem can be solved with easier and faster software.

It is shown in this paper that it is possible to perform MALS on a real meta-bolomics dataset. The only drawback that is present using MALS is CPU time.While for an ordinary AS the computing time is in the order of seconds, the CPUtime for MALS is in the order of hours for a dataset with the dimensions of themetabolomics data.

The MATLAB m-files that have been used to perform the calculation in thispaper are available at http://www.bdagroup.nl. To execute these files the MATLABoptimization toolbox is required.

Acknowledgements

We thank Mariët J. van der Werf and Robert A. van den Berg, TNO Quality of Life,Zeist the Netherlands for the metabolomics data.

53

Page 66: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics
Page 67: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

Ch

ap

te

r

4Exploring the analysis of structured

metabolomics data†

†This chapter is published as: Maikel P.H. Verouden, Johan A. Westerhuis, Mariët J. van der Werfand Age K. Smilde Chemom. Intell. Lab. Syst. 2009;98(1):88–96 Copyright © 2009 Elsevier B.V. DOI:10.1016/j.chemolab.2009.05.004

55

Page 68: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

4. Exploring the analysis of structured metabolomics data

Abstract

In metabolomics research a large number of metabolites are measured that reflectthe cellular state under the experimental conditions studied. In many occasionsthe experiments are performed according to an experimental design to make surethat sufficient variation is induced in the metabolite concentrations. However, asmetabolomics is a holistic approach, also a large number of metab-olites are mea-sured in which no variation is induced by the experimental design. The presence ofsuch non-induced metabolites hampers traditional data analysis methods as PCAto estimate the true model of the induced variation. The greediness of PCA leads toa clear overfit of the metabolomics data and can lead to a bad selection of importantmetabolites. In this paper we explore how, why and how severe PCA overfits datawith an underlying experimental design. Recently new data analysis methods havebeen introduced that can use prior information of the system to reduce the overfit.We show that incorporation of prior knowledge of the system under investigationleads to a better estimation of the true underlying structure and to less overfit.The experimental design information together with ASCA is used to improve theanalysis of metabolomics data. To show the improved model estimation propertyof ASCA a thorough simulation study is used and the results are extended to amicrobial metabolomics batch fermentation study. The ASCA model is much lessaffected by the non-induced variation and measurement error than PCA, leading toa much better model of the induced variation.

KEYWORDS: PCA, ASCA, experimental design, overfit, microbial metabolomics.

56

Page 69: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

4.1. Introduction

4.1 Introduction

Advances in (bio-)analytical techniques enable scientists to use more and morevariables to characterize their samples. The fact that the number of experimentsis often low leads to high dimensional data with the number of variables greatlyexceeding the number of experiments. This type of data is often referred to ashigh dimensional or megavariate. Microbial metabolomics data, as an example ofmegavariate data, emerges from the ‘omics’ field that focuses on low molecularweight compounds, so-called metabolites, present in and around microbial cells ata given time during their growth or production cycle [22]. The metabolome, i.e.the concentration of all metabolites, is a reflection of the phenotype of the sampleunder the studied experimental conditions [9]. The experimental conditions arechanged or perturbed such that sufficient variation is induced in the metabolome,that responds to these changes or perturbations in the experimental conditions.Since metabolomics is a holistic approach, covering as many metabolites as possi-ble, there are also always many metabolites in the data set in which no variation isinduced by the change or pertubation of experimental conditions. The reason forthis is that a change or perturbation in an experimental condition ’hits’ or excitesonly part(s) of the biochemical network, while the rest of the network operates as ifunder normal operating conditions. These, so called non-induced, metabolites canstill have a large variation in their concentration, i.e. the metabolites are not tightlyregulated, but this variation is not caused by a change or a perturbation of the ex-perimental condition. Furthermore, there is also always some random variation inthe data set due to measurement error [58].

Ideally a data analysis method used for analyzing this type of data should onlymodel the induced variation leaving all other variation for the residuals. Here wedefine incorporation of all variation other than the induced variation as overfit.Principal Component Analysis (PCA) is often used for explorative data analysisand focusses on describing the maximum variation in the data by modeling it intoscores, that provide information on the samples, and loadings, that provide infor-mation on the metabolites. By focussing on explaining as much of the variation inthe data as possible PCA tends, especially with only few experiments in the dataset,to be greedy and, therefore, to overfit the data by incorporating random samplingvariation and the variation of the non-induced metabolites into the model.

The use of prior information can help to focus the explorative data analysis.In curve resolution, nonnegativity, unimodality and smoothness constraints help toidentify chemical compounds in complex mixtures [110,111]. In biology knowledgeof transcription factors can be used to unravel complex gene expression data [112,113]. Recently new methods were introduced that are able to incorporate varioustypes of prior information to focus the data analysis [61, 62]. By using these moreadvanced explorative analysis methods, focus is on the relevant part of the variationand thus overfit can be reduced and sometimes even additional information can beobtained [114].

An underlying structure, that is nowadays often present in many of the col-lected megavariate ‘omics’ data sets, is an experimental design in which exper-

57

Page 70: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

4. Exploring the analysis of structured metabolomics data

imental factors are varied to study their effect [115–117]. Several methods foranalyzing metabolomics data with an underlying experimental design exist, thatfocus the analysis onto the induced variation by the design by taking it into ac-count [115, 118–122]. As a method, that uses the underlying design in the mi-crobial metabolomics data to focus the analysis, we have chosen to use ANOVA-Simultaneous Component Analysis (ASCA) [60]. In ASCA the induced varia-tion can be separated from the non-induced variation and measurement error bycreating orthogonal partitions. Subsequent simultaneous component analysis ofthe individual partitions may elucidate the relation between the samples and themetabolic profile for the effects. The orthogonality between the data partitionsallows for individual estimation of effects without mixing of effects. Recently amethod was developed that allows statistical validation of megavariate effects inASCA [123]. It also creates the possibility of testing whether the experimental de-sign induces any sources of variation in the data. Although many metabolomicsdata sets have an underlying experimental design, in the analysis this is still oftenneglected [124].

The major goal of this paper is to show by comparison of PCA and ASCA how,why and how severe PCA overfits data with an underlying experimental design.By comparison to ASCA the effect of incorporating prior knowledge with respectto experimental design into the explorative analysis can be shown. In a thoroughsimulation study we will show how PCA and ASCA behave in terms of fit (how wellare the induced metabolites that vary according to the underlying design modeled?)and overfit (how much is modeled of the non-induced metabolites that do not varyaccording to the design?) when modeling metabolomics data in which induced andnon-induced metabolites are present. The results of the simulation study will bediscussed in terms of the row and column space of the data [125] in order to showwhy and how incorporation of design information helps to focuss the explorativeanalysis. Furthermore both methods will be used to model microbial metabolomicsdata obtained from Escherichia coli batch fermentations with an underlying design.

Section 4.2 of this chapter describes PCA, ASCA and their differences. It alsodescribes how the simulated data and the E. coli batch fermentation metabolomicsdata have been created and which measures were used to assess the ability of mod-eling the induced and non-induced variation. Section 4.3 describes the results andat the end (section 4.4) some important findings are concluded.

4.2 Methods and Materials

In the following text bold uppercase characters (e.g. X) represent matrices, boldlowercase characters (e.g. x) represent vectors and scalars are displayed as italiccharacters (e.g. I).

58

Page 71: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

4.2. Methods and Materials

4.2.1 Principal Component Analysis (PCA)

PCA [95] decomposes the data X [IxJ], consisting of I samples with J measuredvariables, into a bilinear model of scores T [IxR] and loadings P [JxR] according to

X = TPT + E (4.1)

Here TPT , the PCA model (XPCA), represents a lower dimensional approximationof X and E contains the residuals. The number of principal components R, with Rmin(I, J), can be chosen by means of cross-validation or by using a scree graph [96].The calculation of the scores T and loadings P by PCA is performed in such amanner, that the sum of squares of the residuals Q, as shown in equation 4.2, isminimized.

Q(T,P|X) = ‖X − TPT‖2 (4.2)

PCA restricts the scores and loadings with the requirements of TTT being a diago-nal matrix and PTP being an identity matrix. This restriction does not decrease theexplained variation but serves to arrive at easy interpretable graphs.

4.2.2 ANOVA-Simultaneous Component Analysis (ASCA)

A recently developed data analytical method for analyzing megavariate metabo-lomics data with an underlying experimental design is ASCA [60]. This methodstarts with the ANOVA decomposition of the data X [IxJ] and partitions the vari-ation in the data into orthogonal parts per effect according to the experimentaldesign. The variation partioning for the effects with ANOVA is achieved by aver-aging the experiments, that have been performed with the same level-setting of thecorresponding factors [61]. If for instance the underlying experimental design is afull factorial design with three factors, the partitioning for the main effects can berepresented by equation 4.3.

X − XM = X1 + X2 + X3 + Xres (4.3)

In equation 4.3, XM represents the matrix of means, which is calculated as

XM =1I

1I1TI X,

with 1I [Ix1] denoting a vector of ones. The matrices X1, X2 and X3 representthe variation partitions of the three main effects (the three design factors) and Xrescontains the remainder variation consisting of all interaction effects and all othersources of non-induced variation and measurement error. Of course one can chooseto also decompose the interaction partitions or to combine different partitions into asingle matrix [61]. In equation 4.4 the orthogonality between the variation partitionsis shown, which means that the column spaces of the individual matrices on theright side of the equality sign in equation 4.3 are orthogonal. Proof of equation 4.4is supplied elsewhere [61].

XTα Xβ = 0 ∀{α, β} ⊂ {1,2,3, res} : α 6= β (4.4)

59

Page 72: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

4. Exploring the analysis of structured metabolomics data

The orthogonality between the variation partitions is a desirable property, becauseit shows that each partition per effect is calculated independently without mixingof effects. Because all variation partitions are orthogonal to each other the followingstatement is also true,

[X1 + X2 + X3]TXres = 0,

and shows that the combined variation partitions for the main effects are orthogonalto and, thus, independent of the remainder variation.

To describe ASCA as a multivariate regression model here we focus on thevariation partitioning for the effects within mean centered data (Xmc = X − XM),having an underlying two level full factorial design. The various multivariate effectscan also be achieved by least squares fitting of a linear model [126, 127]. The linearmodel, for an experiment (i) with respect to one variable (j) in a two level fullfactorial design with three factors, concerning only the main effects is shown inequation 4.5.

xij −

I

∑i=1

xij

I= di1β1j + di2β2j + di3β3j + (xij)res (4.5)

Here di1, di2 and di3 denote the level-settings (-1,1) of experiment i for respectivelythe first, second and third factor in the design. The main effects of, respectively,the first, second and third factor for variable j are displayed as β1j, β2j and β3j. Theremainder for the measured value of variable j in experiment i, that can not bemodeled by least squares fitting of the linear model, is shown as (xij)res. Com-bination of the linear models for all (i = 1, . . . , I) experiments and all (j = 1, . . . , J)variables results in:

Xmc = DBT + Xres, (4.6)

where D [IxFactors] represents the design matrix, containing the level-settings perexperiment of each factor for which a separate variation partition should be calcu-lated, and B [JxFactors] contains all modeled factors for each variable. The variationpartition for an effect can now be obtained by multiplying the design column ofthe effect in design matrix D with the corresponding row of modeled effects frommatrix BT . For the main effects of a two level full factorial design with three factorsthe variation partitioning can, therefore, be obtained as shown in equation 4.7.

Xmc = DBT + Xres

= [d·1d·2d·3]

bT1·

bT2·

bT3·

+ Xres

= d·1bT1· + d·2bT

2· + d·3bT3· + Xres

= X1 + X2 + X3 + Xres (4.7)

Here d·1, d·2 and d·3 denote the design columns of matrix D for respectively thefirst, second and third factor and bT

1·, bT2· and bT

3· represent the corresponding rowsof modeled effects from BT . Equation 4.6 shows that the variation partitioning withANOVA for a two level full factorial design can be seen as a multivariate linear

60

Page 73: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

4.2. Methods and Materials

regression problem, where the mean centered data (Xmc) needs to be regressedonto the design matrix (D). The well-known least squares solution for the matrixof regression coefficients B can, therefore, be obtained by,

B = XTmcD(DTD)−1, (4.8)

which shows that each measured variable in Xmc is separately and columnwiseregressed onto the design matrix. Throughout this paper we will use this regressionexplanation of ASCA to explain its properties compared to PCA.

For factorial designs with more than two qualitative levels, dummy variablesneed to be used in the design matrix to indicate the level settings for the experi-ments. The number of dummy variable columns for a factor equals the number oflevels in the design. Regression of the mean centered data onto the design matrixwill, therefore, for each main effect yield rows of regression coefficients, with thenumber of rows equal to the number of levels in the design. The variation partitionof a main effect can then be obtained by multiplying all dummy variable columnsof the corresponding factor in the design matrix with the corresponding rows inthe matrix of regression coefficients.

After the variation partitioning of the mean centered data (Xmc), a simultaneouscomponent analysis (SCA) is performed on each partition representing an effect, asshown in equation 4.9 for the main effects of a full factorial design with threefactors.

Xmc = X1 + X2 + X3 + Xres

= S1LT1 + S2LT

2 + S3LT3 + Xres

= [S1S2S3][L1L2L3]T + Xres

= SLT + Xres (4.9)

The SCA decomposes the variation partition of each effect (k) into score vectorsSk [IxVk] and loading vectors L [JxVk]. The number of components Vk depends onthe number of levels within the kth effect of the design and whether a dimensionreduction is applied within the SCA or not. A full factorial design with two lev-els [48,128] forms a special case, because the variation partition for each effect has atmost rank one and the SCA, therefore, decomposes the variation partition for eacheffect into one score vector and one loading vector (Vk = 1). Concatenation of thescore and loading vectors, as displayed in equation 4.9, results in the ASCA model(XASCA) denoted by SLT. The relation between the variation partitioning and thefollowing SCA for a two level full factorial design can, after combining equation 4.7and 4.9, be displayed as,

DQQ−1BT = SLT, (4.10)

where Q represents a diagonal matrix with on its diagonal the length of the rowsof regression coefficients of matrix BT .

61

Page 74: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

4. Exploring the analysis of structured metabolomics data

4.2.3 Differences between PCA and ASCA

Both PCA and ASCA decompose the data X into a bilinear model of scores andloadings as described in the previous subsections. The loadings link the modeledlower dimensional subspace to the original variable (metabolite) space and definethe directions in the original variable space that span the modeled subspace. Thescores define how the modeled experiments are positioned within the modeledsubspace. This positioning of experiments within the modeled subspace will fromhere on be referred to as the configuration.

In PCA the scores T and loadings P are calculated with only the non-active re-striction of PTP being an identity matrix, which means that the resulting loadingsare orthonormal. Thus both scores and loadings are fitted to maximize explainedvariation. In ASCA, on the other hand, the scores S are fixed, with respect to theconfiguration, to the settings of the design S = DQ and thus only the loadings L

are fitted to maximize explained variation. As a consequence the ASCA loadingsof different variation partitions are not orthogonal to each other. Thus PCA hasmuch more degrees of freedom to fit the multivariate data compared to ASCA,which is restricted by prior knowledge of the system under investigation (the ap-plied experimental design). For experimental designs with more than two levelsthe relation between the variation partitioning and the SCA is more complex thanshown in equation 4.10, due to the possibility of dimension reduction in the SCAon a variation partition of an effect.

4.2.4 Simulated data

In this simulation study we want to explore how PCA and ASCA are able to fitmetabolomics like data and investigate the effect of incorporation of prior knowl-edge, in this case design information, as restrictions into the data analysis. We willmake a distinction between induced metabolites that vary according to the experi-mental design and non-induced metabolites that vary randomly and not accordingto the design applied. Furthermore we study the effect of increasing the number ofinduced variables and increasing the number of replicates per design point.

In the first part of the simulation study only induced variables are considered.The data is simulated according to a 23 full factorial design with only main effects.Table 4.1 shows the content of the design matrix D which consists of the levelsettings of the three main factors used. The induced metabolites Xind are thenformed by multiplying D with the matrix of weights for each induced variable perdesign factor C [JindxFactors]. Matrix C has been randomly generated such thatCTC equals the identity matrix.

Xind = DCT (4.11)

If N replicates per experiment are obtained then,

Xind,N = [1N ⊗ Xind] (4.12)

The character ⊗ denotes the Kronecker product and 1N represents an [Nx1] columnof ones. Matrix Xind has been used to indicate that Xind has been autoscaled to

62

Page 75: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

4.2. Methods and Materials

Table 4.1: Tabular form of a 23 full factorial design

Experiment Factor

F1 F2 F3

1 -1 -1 -12 -1 -1 +13 -1 +1 -14 -1 +1 +15 +1 -1 -16 +1 -1 +17 +1 +1 -18 +1 +1 +1

zero mean and standard deviation 1 for each column. The autoscaling of Xindcauses the resulting matrix Xind,N to be mean centered and the standard deviationper column to be constant. With N>1, D should be replaced by DN = [1N ⊗ D] inequation 4.6.

In the second part of the simulation study a number of non-induced metabolitesare added to the induced ones. The non-induced metabolites Xnon-ind do notvary according the experimental design and are, therefore, simulated as normallydistributed random data with zero means and unit variance. If N repeats per designpoint exist then,

Xnon,N =

Xnon-ind,1...

Xnon-ind,n...

Xnon-ind,N

(4.13)

Note that the different autoscaled non-induced data sets will be different for eachrepeat. Also here the autoscaling keeps Xnon,N mean centered with a constantvariance per column. Concatenation of Xind,N and Xnon,N yields the simulatednoiseless data with N replicates per experiment Xpure,N [NIxJ], as displayed inequation 4.14.

XN = [Xind,N Xnon,N ] + Enoise,N= Xpure,N + Enoise,N (4.14)

By addition of noise (Enoise,N) the simulated data with N replicates per experimentXN [NIxJ] is obtained. Enoise,N contains normally distributed random data scaledin such a way that the ratio of the sum of the squared elements of Xpure,N and thesum of the squared elements of Enoise,N is 9:1 and results in approximately 10%

63

Page 76: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

4. Exploring the analysis of structured metabolomics data

noise in XN . This amount of noise was chosen because it is the amount typicallyencountered in metabolomics data. Note that Xnon,N represents the non-inducedmetabolites while Enoise,N represents sampling and measurement errors which areboth affecting the induced as well as the non-induced metabolites.

In the third and last part of the simulation study only non-induced metabolitesare used to show the behavior of both methods when no variation is induced bythe experimental design.

4.2.5 Measures for the ability of modeling the induced variation and

non-induced variation and measurement error

In this section we introduce the figures of merit used to quantify the ability ofPCA and ASCA to model metabolomics data with an underlying experimental de-sign. We make a distinction between the induced metabolites and the non-inducedmetabolites. A measure for the ability of modeling the induced variation is shownin equation 4.15,

1 −‖Xind,N − Xind,N‖2

‖Xind,N‖2 , (4.15)

where Xind,N represents the modeled induced variation. This measure definesa proportion of agreement between the modeled and the true induced metabolitesand will, from here on, be referred to as the fit of the model. The difference betweenthe modeled and the true induced variation can be caused by a difference in theestimated subspace and by a different configuration of the experiments within thatsubspace. The fit will be 1, when the method used (PCA or ASCA) perfectly modelsthe variation of the induced metabolites (Xind,N) in the simulated data. The morethe calculated model is influenced by the variation of the non-induced metabolitesor by the measurement noise, the more the fit will deviate from one and go towardszero.

The overfit of both methods is studied from the variation of the non-inducedmetabolites. The figure of merit that will be used as measure for the overfit is givenin equation 4.16,

1 −‖Xnon,N − Xnon,N‖2

‖Xnon,N‖2 , (4.16)

where Xnon,N represents the modeled part of the non-induced variation. The moreof the variation of the non-induced metabolites is described by the model the higherthe overfit, as this variation was not introduced by the experimental design. Forgood models this value should be as small as possible.

4.2.6 Escherichia coli batch fermentation metabolomics data

For further comparison of PCA and ASCA both methods will also be applied onmetabolomics data obtained from E. coli batch fermentations. The batch fermen-tations have been performed according to a two level full factorial design with 3factors, being acidity (pH), carbon source and phosphate level. Each fermentation

64

Page 77: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

4.2. Methods and Materials

Table 4.2: Tabular form of the design of the E. coli batch fermentations. The first character of thesample labels indicates which fermentations have been performed with Glucose (G: -1 level-setting) andwhich with Succinate (S: +1 level-setting) as carbon source. The acidity of the medium, in which thefermentations were performed, was either pH 6 (6: -1 level-setting) or pH 7 (7: +1 level-setting). Thesecond character in the samples label shows which level was used. The third character indicates, whichfermentation was performed with a low (L: -1 level-setting) and a high (H: +1 level-setting) phosphatelevel.

Experiment Sample label Factor

Acidity (pH) Carbon source Phosphate level

1 G6H -1 -1 +12 G6H -1 -1 +13 G6L -1 -1 -14 G6L -1 -1 -15 G7H +1 -1 +16 G7H +1 -1 +17 G7L +1 -1 -18 G7L +1 -1 -19 S6H -1 +1 +110 S6H -1 +1 +111 S6L -1 +1 -112 S6L -1 +1 -113 S7H +1 +1 +114 S7H +1 +1 +115 S7L +1 +1 -116 S7L +1 +1 -1

in the 23 full factorial design (I = 8) has been performed in duplo (N = 2) resultingin a total number of 16 fermentations (NI) [129]. Table 4.2 shows the sample labelsand level-settings of each batch fermentation. The columns acidity, carbon sourceand phosphate level in Table 4.2 form the design matrix (D2), which is needed forthe calculation of the ASCA model. During the fermentations, the optical densityof the batch has been monitored and in the stationary phase of the fermentationthree equidistant samples were taken from the batch. These samples have beenanalyzed using GCMS [40] and 405 peaks (metabolites) have been recorded as peakareas for each sample. Assuming that in the stationary phase of a fermentation theE. coli bacteria have adopted to their new environmental situation and that, there-fore, their metabolic profile will have stabilized, we will use the average metabolicprofile of the three samples in the stationary phase per batch in order to obtain amore stable representation of the metabolome. The resulting data matrix X [NIxJ]contains 16 experiments and 405 peaks or variables (J), with each row representingthe average metabolite profile of a batch in the stationary phase.

65

Page 78: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

4. Exploring the analysis of structured metabolomics data

0 1 2 3 4 5 6 7 8 9 10 110.92

0.93

0.94

0.95

0.96

0.97

0.98

0.99

1.00

10 induced variables,0 non−induced variables

number of replicates

aver

age

fit

PCAASCA

(a)

0 1 2 3 4 5 6 7 8 9 10 110.92

0.93

0.94

0.95

0.96

0.97

0.98

0.99

1.00

30 induced variables,0 non−induced variables

number of replicates

aver

age

fit

PCAASCA

(b)

Figure 4.1: Average fit of PCA and ASCA on simulated data-sets with only induced variables plottedagainst the number of replicates in the design:(a) plot for simulated data-sets with 10 induced variablesand,(b) with 30 induced variables. The range of ± one standard deviation of the mean is within themarker size of the symbols used in the figure.

4.3 Results and Discussion

4.3.1 Simulated data

In the simulation part of the study, we will focus on the fit and overfit propertiesof PCA and ASCA to show the effect of incorporation of prior knowledge, in thiscase design information, of the system under investigation. The average fit andoverfit values are obtained from 1000 different noise realizations (Enoise,N) foreach noiseless simulated data-set.

Induced variables

In the first part of the simulation only induced variables are used. Figure 4.1 showsthe average fit of the PCA and ASCA model as a function of the number of repli-cates. Note that the average fit is a measure for the ability of modeling the inducedvariation (equation 4.15). The number of induced metabolites used was varied be-tween 10 and 30. We clearly see in both figures that the ASCA model has a higheraverage fit than the PCA model. This means that ASCA models the variation of theinduced metabolites better and that the PCA model is apparently more disturbedby the added measurement noise. With increasing number of replicates both mod-els improve in modeling the induced variation. As Xind,N , is always of rank three(note that it is generated using a 23 full factorial design), it is in a three dimensionalsubspace of the variable space. The added measurement noise Enoise,N causes theexperiments to deviate away from their original position in the three dimensionalsubspace. As described in section 4.2.2 the ASCA model can be calculated by a

66

Page 79: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

4.3. Results and Discussion

0 1 2 3 4 5 6 7 8 9 10 110.82

0.84

0.86

0.88

0.9

0.92

0.94

0.96

0.98

1.00

10 induced variables,10 non−induced variables

number of replicates

aver

age

fit

PCAASCA

(a)

0 1 2 3 4 5 6 7 8 9 10 110.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

10 induced variables,10 non−induced variables

number of replicates

aver

age

over

fit

PCAASCA

(b)

0 1 2 3 4 5 6 7 8 9 10 110.92

0.93

0.94

0.95

0.96

0.97

0.98

0.99

1.00

30 induced variables,10 non−induced variables

number of replicates

aver

age

fit

PCAASCA

(c)

0 1 2 3 4 5 6 7 8 9 10 110.0

0.1

0.2

0.3

0.4

0.5

0.6

30 induced variables,10 non−induced variables

number of replicates

aver

age

over

fit

PCAASCA

(d)

Figure 4.2: Simulation 23 full factorial design with induced and non-induced variables:(a) induced blockfor 10 induced and 10 non-induced variables;(b) non-induced block for 10 induced and 10 non-inducedvariables;(c) induced block for 30 induced and 10 non-induced variables; and,(d) non-induced block for30 induced and 10 non-induced variables. The range of ± one standard deviation of the mean is withinthe marker size of the symbols used in the figure.

regression of X on D. The variance of the regression coefficients B is known to bedependent on the determinant of the DTD matrix. With more replicates per exper-iment in the design, the determinant of DTD becomes much larger, which leadsto a lower variance of B. This then leads on average to a better estimate of the 3dimensional subspace in the row space of Xind,N .

A similar argument can be used for the PCA model. Here also the estimation ofthe loadings P is improved when more repeats are obtained. This also leads to animproved score estimation. Both improvements of the average fit are thus obtainedby a better defined 3 dimensional subspace in the row space of the data matrixXind,N .

67

Page 80: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

4. Exploring the analysis of structured metabolomics data

Although the average fit improves for both models, the average fit for ASCAimproves more strongly and the average fit for PCA levels off with increasing num-ber of replicates. Although the three dimensional subspace in the row space is welldescribed, (this means that the correlation structure within all metabolites is wellestimated), the position of the experiments within this subspace is different for thetwo methods (this is what we refer to as configuration). In the ASCA model, thisposition is forced to be the same for replicates, while in the PCA model, the dif-ferences in the position of the replicates in this subspace is modeled by the scores.Thus although the expected position of the replicates is the same, PCA modelsthe differences between these positions. The modeling of these differences can beregarded as overfit.

When the number of induced variables is changed from 10 to 30, a large im-provement in the average fit for the PCA model is observed. The average fit forthe ASCA model, however, stays the same. The fit of the ASCA model does notimprove because the three-dimensional subspace in the row space is not estimatedfrom the data, but inferred from the experimental design of the study. Thus noimprovement can be obtained. In the PCA model linear combinations of the vari-ables in XN are used to estimate the three dimensional subspace Xind,N formed byDNCT. With more variables following the imposed design it becomes easier to cor-rectly estimate that subspace. Thus the improvement of the fit for the PCA modelis caused by averaging over the variables, also known as averaging over the columnspace, and has been observed earlier for Partial Least Squares (PLS) estimates byH. Wold, who termed it “consistency at large” [130, 131].

Induced and non-induced variables

The non-induced metabolites which are also part of the data due to the holisticcharacter of metabolomics do not respond to the design, but they can still influencethe analysis of the metabolomics data. Figure 4.2 shows the average fit and overfitof the PCA and ASCA models when non-induced metabolites are present in thedata. The average overfit of the ASCA and PCA model in Figure 4.2(d) looks almostequal, but a t-test on the calculated averages of each number of replicates has shownthe differences between the average overfit for ASCA and PCA to be statiscallysignificant. The figure clearly shows that the ASCA model has a better average fitcompared to PCA and a smaller overfit than PCA. The reason for this better fit andsmaller overfit is the same as discussed above. The three dimensional subspace inthe row space of XN is defined by the design matrix D and is not estimated. Non-induced metabolites can thus also not disturb the estimation of the ASCA model.In the PCA model however, a linear combination of all variables (and thus also ofthe non-induced) is used to define the subspace. Because of this role of the non-induced variables in the estimation of the subspace, the fit becomes worse and theoverfit (how much of the non-induced metabolites is described) becomes larger.A comparison between Figures 4.1 and 4.2 clearly shows that the fit of the ASCAmodel is not decreased by the non-induced metabolites while fit of the PCA modelis decreased considerably. The ’consistency at large’ concept only works when the

68

Page 81: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

4.3. Results and Discussion

0 1 2 3 4 5 6 7 8 9 10 110

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 induced variables,10 non−induced variables

number of replicates

aver

age

over

fit

PCAASCA

(a)

0 1 2 3 4 5 6 7 8 9 10 110

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 induced variables,30 non−induced variables

number of replicates

aver

age

over

fit

PCAASCA

(b)

Figure 4.3: Average overfit of PCA and ASCA on simulated data from a 23 full factorial design with onlynon-induced variables plotted against the number of replicates in the design:(a) plot for simulated datawith 10 non-induced variables and,(b) plot for simulated data with 30 non-induced variables. The rangeof ± one standard deviation of the mean is within the marker size of the symbols used in the figure.

new variables contain information regarding the underlying structure. The overfitobviously becomes less when the ratio of induced and non-induced becomes larger.

Only non-induced variables

To study whether the use of a method that uses prior information leads to overfitwhen the prior information is not present we simulated an extreme case with onlynon-induced variables. Figure 4.3 shows the average overfit of the PCA and ASCAmodel as a function of the number of replicates for 10 and 30 non-induced variables.In both figures we clearly see that the average overfit for the ASCA model is lessthan the average overfit for the PCA model. This means that ASCA models lessof the random variation in the data of non-induced variables. When increasingthe number of replicates both models improve by modeling less of the randomvariation caused by the non-induced variables. The ASCA model can be calculated,as described in section 4.2.2, by a regression of X on D . Each variable is regressedseparately onto D. Increasing the number of replicates in D causes the randomvariation in a variable to be averaged (DTXmc) and, therefore, with increasing thenumber of replicates less of the random variation of the non-induced variables isbeing modeled.

The improvement of the PCA model for modeling less of the random variationcaused by the non-induced variables can be explained by a similar argument. Eachexperiment is a vector in the row space of the data matrix X. Here the data matrixX contains only random variation, because it is generated from only non-inducedvariables. With more experiments in the design this random variation will can-cel out more quickly, therefore with increasing the number of replicates PCA will

69

Page 82: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

4. Exploring the analysis of structured metabolomics data

model less of the random variation of the non-induced variables.When the number of non-induced variables is changed from 10 to 30, a large

improvement in the average overfit for the PCA model is observed. The averageoverfit for the ASCA model stays the same. The average overfit of the ASCA modeldoes not improve because, as mentioned before, each variable is regressed sepa-rately onto D. In the PCA model linear combinations of the variables in XN areused. With more variables describing random variation, less of this random varia-tion tends to be described within the PCA model when increasing the number ofnon-induced variables.

4.3.2 PCA/ASCA on E. coli batch fermentation metabolomics data

In this final subsection the results obtained from analyzing metabolomics data of E.

coli batch fermentations will be presented and discussed. The first three principalcomponents (PC’s) in a PCA model explain 19.92%, 14.96% and 10.24% respec-tively. The total explained variation of the three component PCA model, therefore,is 45.15%. Figure 4.4(a) shows the scoreplots of PC2 vs. PC1 and Figure 4.4(b) dis-plays the scoreplot of PC2 vs. PC3. Figure 4.4(a) shows a separation between thebatches, in which glucose (G) and those in which succinate (S) was used as carbonsource. However, this effect of the factor carbon source is clearly mixed over allPC’s. This makes it rather difficult to interpret which metabolites are affected bywhich of the three applied factors of the experimental design.

In ASCA the configuration of the scores is fixed and therefore the loadings ex-actly match the multivariate effect caused by the applied factor, thus providingdirect information on which metabolites are affected by the representing designfactor. The first loading, explaining 10.98% of the variation, is directly linked to thedesign factor pH. The design factor carbon source is represented by to the secondloading, which explains 16.98% of the variation in the data. The third loading is di-rectly linked to the phosphate level and explains 9.74% of the variation. The designfactor carbon source has the largest effect on the batch fermentations, because thesecond loading explains the most variation in the data. Also in the PCA scoreplotof PC2 versus PC1 , as shown in Figure 4.4(a), there is a clear separation caused bythe design factor carbon source, however due to mixing of effects over the PC’s nometabolites can be directly linked as being affected by the change in carbon source.A three component ASCA model explains 37.70% of the variation in the data. TheASCA loadings are, as mentioned in section 4.2.3 not orthogonal. In this case how-ever the loadings of the ASCA model are close to orthogonality with angles of 79,84 and 90 degrees. The angle between the loadings shows the dependency be-tween the loadings and angles close to 90 degrees indicate that the effects of thedesign factors (pH, carbon source and phosphate level) on the metabolic profile areindependent.

A major issue in microbial metabolomics is to rank the metabolites with respectto their explained variation as that relates to metabolites that are highly affectedby the experimental design applied. Figure 4.5 shows the explained variation ofthe ASCA and PCA model and the p-value for all metabolites, where the metabo-lites have been sorted in descending order for the explained variation of the ASCA

70

Page 83: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

4.3. Results and Discussion

−20 −15 −10 −5 0 5 10 15 20−20

−15

−10

−5

0

5

10

15

20PCA Scoreplot

G6H

G6H G6L

G6L

G7H

G7H

G7L

G7L

S6H

S6H S6L

S6L

S7H S7H

S7L

S7L

PC1 (19.92%)

PC

2 (1

4.96

%)

(a)

−20 −15 −10 −5 0 5 10 15 20−20

−15

−10

−5

0

5

10

15

20PCA Scoreplot

G6H

G6H G6L

G6L

G7H

G7H

G7L

G7L

S6H

S6H S6L

S6L

S7H S7H

S7L

S7L

PC3 (10.24%)

PC

2 (1

4.96

%)

(b)

Figure 4.4: PCA scoreplots for E. coli batch fermentation metabolomics data:(a) Scores on PC2 plottedagainst scores on PC1;(b) Scores on PC2 plotted against scores on PC3.

71

Page 84: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

4. Exploring the analysis of structured metabolomics data

0 50 100 150 200 250 300 350 4000

10

20

30

40

50

60

70

80

90

100

% e

xpla

ined

0 50 100 150 200 250 300 350 4000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

sorted variables

p−va

lue

ASCAPCA

ASCA 31%PCA 85%

Figure 4.5: Percentage explained variation per metabolite for ASCA and PCA and p-value (– · – ·) permetabolite for ASCA on E. coli batch fermentation metabolomics data. Both the percentage explainedand the p-values have been sorted according to the percentage explained variation of the ASCA model.

model. The p-value is calculated per metabolite from a univariate ANOVA in whichthe mean square of the main effects model is compared with the mean square ofthe remaining variation. In Figure 4.5 we see that for the ASCA model only a smallfraction of the metabolites (132) is explained for more than 47% (corresponding toa p-value 0.05), when Bonferroni correction is applied only 11 metabolites are ex-plained for more than 82% (p-value 0.05

405 ). This shows that in a univariate senseonly few metabolites are significantly induced by the experimental design. Morestriking, however, is that the majority of metabolites in the PCA model have a muchhigher percentage explained than in the ASCA model. As an example one metabo-lite is circled in Figure 4.5. This circled metabolite is explained by the ASCA modelfor 31%, while it is explained for 85% by the PCA model. Of course there are alsocounter examples where the explained variation in the PCA model are much lowerthan in the ASCA model. However, with the simulated data we have already shownthat ASCA performs better in terms of fit and overfit, when modeling data with anunderlying 23 full factorial design. This suggests that the PCA model also hereoverfits the data. Ranking metabolites based on the PCA model might therefore bemisleading due to the overfit of the PCA model.

The ASCA model as used in this paper only considers the main effects of the

72

Page 85: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

4.4. Conclusions

full factorial design. This means that interactions and individual variation are notmodeled. Some of the difference between the modeled variation can be due to theinteraction. However the ASCA model can easily be extended to include interac-tions between the main effects.

In this paper ASCA was shown to work as a filter that separates the inducedvariation from the non-induced variation. After such a separation, the inducedvariation could potentially be used in a regression to provide a regression modelthat can be much better interpreted [61]. Also the non-induced variation can bestudied for its effect on a response variable to study e.g. the biological variation onthat response. The major gain of the ASCA filtering is that estimated effects can beassigned to a specific source of variation.

4.4 Conclusions

When there is structure in complex data (like experimental design), incorporationof this structure in analyzing the data helps in lowering the overfit. PCA does notuse the prior information about the underlying experimental design and thereforeoften gives largely overfitted models of the data, particularly with few replicates.ASCA uses the available prior information about the underlying design of the ex-periments. With increasing number of replicates both PCA and ASCA improve inmodeling the induced variation by better defining the subspace in the rowspace ofthe data. PCA, however, overall has a larger overfit by modeling the differences inthe position of the replicates within the subspace in the scores. When increasing thenumber of induced variables the PCA model greatly improves in modeling the in-duced variation, which is known as averaging over the column space (‘consistencyat large’ concept), whereas this has no effect on the ASCA model. Obviously theoverfit of the PCA model also becomes less when the ratio between induced andnon-induced variables becomes larger, because with more variables containing in-formation with respect to the underlying structure it becomes easier to approximatethis structure.

The percentage of explained variation of the PCA model is higher than that ofthe ASCA model, but this is due to overfit. PCA is less able to estimate the trueunderlying model of the data because it is eager to explain as much variation aspossible. This leads to a larger overfit of the data than obtained with the ASCAmodel. As PCA always explains the maximum amount of variation in the data, anyrestriction applied will lead to a lower explained variation. However the restrictionapplied in ASCA not only leads to a lower explained variation but also to a betterestimate of the true underlying configuration of the data.

PCA highly suffers from metabolites that were not induced by the experimentaldesign. This decreases the estimation of the true data even more. As metabolomicsis a holistic approach many non-induced metabolites can be expected in a metabo-lomics data set. Using PCA to find highly affected metabolites from a designedexperiment may not lead to good results.

73

Page 86: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics
Page 87: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

Ch

ap

te

r

5Weighted Smooth Principal Component

Analysis (WSPCA): validation and application

to missing value estimation†

†This chapter has been submitted for publication: Maikel P.H. Verouden, Johan A. Westerhuis andAge K. Smilde Copyright © 2011

75

Page 88: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

5. Weighted Smooth Principal Component Analysis (WSPCA)

Abstract

In the various fields of functional genomics longitudinal data plays an importantrole to improve understanding and knowledge of the dynamics within biologicalsystems. The time-resolved data are expected to contain underlying dynamic pro-files that are smooth. However, estimating these underlying smooth dynamic phe-nomena from such data is complicated due to the high complexity of the data andthe limited number of techniques that can deal with this type of data. Traditionalmultivariate data analytical techniques, such as Principal Component Analysis, ig-nore the underlying dynamics in the data and give solutions that tend towardsexplaining variance rather than explaining dynamics and understanding biology.In this paper we present WSPCA, a method that incorporates smoothness into thescores of PCA by using a roughness penalty. WSPCA can be used for data withconsecutive samples that are linked by time (time-resolved) or position (spatiallyresolved). By means of a synthetic data set we show that applying this restrictionleads to a better estimation of the dynamic phenomena underlying the data. Fordetermination of the model meta parameters (the number of components and thesmoothness parameter) we present a leave elements out cross-validation procedure,that for our synthetic data set is capable of estimating the underlying noiseless data.The WSPCA method and leave elements out cross-validation are then applied to areal-life metabolomics data set from a E. coli batch fermentation, that has been sam-pled over time, to estimate the missing elements in the data set.

KEYWORDS: PCA, smooth PCA, weighting, cross-validation, missing values.

76

Page 89: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

5.1. Introduction

5.1 Introduction

Time-resolved experiments play an increasingly important role within the variousfields of functional genomics research to improve understanding of the dynamicswithin biological systems. The longitudinal data obtained from such experimentsare expected to contain smooth dynamic profiles of interest. However, due to thehigh complexity of the data and the limited number of techniques for analyzingthis type of data [132], it is hard to discern the underlying dynamic phenomena.Traditional multivariate data analytical techniques such as Principal ComponentAnalysis (PCA) ignore the underlying dynamics in the data. PCA focusses onmaximizing the explained variation, without taking into account the relation thatexists between samples obtained at consecutive time points. Addition of prior in-formation in the form of restrictions to these multivariate data analytical techniquescan improve the biological interpretability of the created models and increase thepredictive power of those models. Similar to the use of much used restrictions asnon-negativity or unimodality in Multivariate Curve Resolution (MCR) we aim thata roughness penalty also leads the analysis of the data towards more “biological”and less “variance explained” solutions.

In this paper the goal is to show that incorporation of smoothness into the scoresof PCA can lead to a better estimation of the dynamic phenomena underlying thedata, when consecutive objects or samples are not independent of each other but areconnected for example by time, i.e. time-resolved, or sample position, i.e. spatiallyresolved. The better estimation of the underlying longitudinal phenomena can e.g.be used to improve the estimation of missing values. Adding roughness penaltiesto data analysis methods is often used for smoothing data [133] and extensions ofPCA with roughness penalties such as smoothed PCA [134, 135], have been pro-posed in the context of functional data analysis [111]. However, these methods aredesigned for data where the variables have a functional relationship whereas ourmethod focusses on multivariate data with the functional relationship between thesamples. Note that simply taking the transpose of the data before using the FDAapproach does not solve the problem of analyzing multivariate data with the func-tional relationship between the samples, as we would like to be able to predict newsmooth scores for a new set of samples using the loadings of the model. RecentlyYamamoto et al. introduced smooth PCA with smooth scores by using a general-ized eigenvalue approach [136]. Their approach calculates the smooth score as alinear combination of the variables in the data whereas our WSPCA method doesnot use this restriction.

For the introduction of the weighted smooth PCA method and describing itsproperties we use a synthetic dataset, based on biologically relevant time profiles.In the simulation study we aim to show that, although the fit with respect to theanalysed data becomes lower, the true underlying structure in the data is bettercaptured. We aim to show that by restricting the PCA model with a roughnesspenalty, the well-known overfit of the method is reduced. By using cross-validationthe optimal roughness penalty and number of components to be used in the modelwill be determined. The cross-validation is based on leaving multiple elements out

77

Page 90: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

5. Weighted Smooth Principal Component Analysis (WSPCA)

of the data and estimating them back with the created model. This approach createsthe necessity for the smooth principal component analysis to be weighted. Next wewill apply smoothness in combination with the same cross-validation procedureto microbial metabolomics data obtained from a Escherichia coli batch fermentationwith missing values to estimate those missing values.

In Section 5.2 weighted smooth principal component analysis and the cross-validation procedure will be explained and discussed in detail. Section 5.3 describeshow the simulated data has been created and how the E. coli batch fermentationmetabolomics data has been obtained experimentally. The results are presentedand discussed in Section 5.4. Finally in Section 5.5 the conclusions of our study willbe presented.

5.2 Methods

5.2.1 Weighted Smooth Principal Component Analysis

Weighted Smooth Principal Component Analysis (WSPCA) is a penalized form ofweighted principal component analysis (PCAW) [99, 103, 105], that can be used formultivariate data with objects or samples that are not independent of each otherbut are linked via a clear relationship with expected smoothness in the underlyingphenomena. The consecutive objects or samples are for example time-resolved orspatially resolved forming longitudinal profiles. The variables in the data used forWSPCA can have relationships among each other but there is no expected smooth-ness between them. Metabolomics data from a microbial batch fermentation exper-iment sampled over time is an example of this type of data. In WSPCA the scores,calculated for successive samples, are penalized for their roughness, by means ofa roughness penalty, while still providing a good description of the data. By im-plying smoothness on the scores in WSPCA we assume that the underlying latentcomponents are smooth, not necessarily the manifest variables. The smoothed scoreprofiles relate to the major dynamic events in the data set.

We use a weighted version of the smooth PCA model. Weighting can be used toincorporate prior information about the experimental error into the data analyticalmethod [103, 105, 137]. Here the weighted approach is used to indicate the missingelements in the data to be estimated and for the selection of test and validation sam-ples in the cross-validation of the model meta parameters. The weight matrix W isconstructed as a binary matrix containing ones (1) for elements that are present andzeros (0) for elements that are missing or have been excluded for cross-validation.

Weighted smooth PCA (WSPCA) is implemented using the solver for nonlinearleast-squares problems (lsqnonlin) from the Optimization Toolbox (version 5.1) forMatlab® * where the calculated scores Z [IxR] are penalized for their roughness. Thesum of squares Q (Equation 5.1) is minimized with respect to the smoothed scoresand the corresponding loadings P [JxR] for a data set X [IxJ], a given weight matrixW [IxJ] and a given smoothing parameter λ with R representing the number of

*R2010b Version 7.11.0.584, The Mathworks®, Inc., 3 Apple Hill Drive, Natick, MA 01760-2098,U.S.A.

78

Page 91: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

5.2. Methods

components in the model.

Q(Z,P|X,W,λ) = {‖W ◦ (X − ZPT)‖2 + λ‖DnZ‖2, λ ≥ 0} (5.1)

In Equation 5.1 the character ◦ denotes the Hadamard or element-wise product.Matrix Dn represents the nth order difference matrix, which is used for the finitedifferences approximation of derivatives. Commonly a first or second order differ-ence matrix is used, but difference matrices of higher order can also be applied.When Dn is set to 1st order (D1) the consecutive score values are penalized fortheir slope (the first derivative) and are forced to become equal, leading in the ex-treme (large λ) to straight flat lines with zero slope. With a 2nd order differencematrix (D2) the consecutive scores are penalized for the change in slope (the secondderivative) and are forced to become straight lines with arbitrary slope in the ex-treme. The use of a first or second order difference matrix depends on the problemat hand, generally for removing noise a first order difference probably will suffice.However, if the expected smoothness in the underlying profiles has parabolic orhigher order polynomial form a difference matrix of second or higher order mightbe more appropriate as a first order derivative has a tendency to quickly deterioratethe smoothness profiles by pushing them to straight flat lines with zero slope.

In our presented WSPCA method the only constraint is the roughness penaltyon the scores. There is no active constraint on the loadings (P). We only normalizethe loadings to length 1 and adapt the smoothed scores (Z) accordingly. This isnecessary as else the scores would become very small and the roughness penaltywould not have an effect. When λ is set to 0, the penalty will not be active andthe obtained scores and loadings from weighted smooth PCA will be equal to ora rotated version of regular weighted PCA. If λ is increased, the obtained scorevectors will become more smooth , however this will result in a reduced descriptionof the variation in the data X. The smoothing parameter λ determines the amountof emphasis that is put on the roughness penalty.

The smoothed scores (Z) obtained from minimization of Equation 5.1 are notorthogonal to each other, ZTZ is a symmetric but not a diagonal matrix. The load-ings are neither orthogonal to each other, PTP is a symmetric matrix with ones onits diagonal but not an identity matrix of order R. We did not want to enforce or-thogonality onto the scores and loading, for reasons which will be clarified in thefollowing section and, therefore, have implemented WSPCA as a method in whichall components are estimated simultaneously.

On the freedom for smoothed scores Z and loadings P.

Contrary to the common assumption of keeping the scores in the column spaceof X and loadings in the row space of X, either as a feature of or as a restrictionin the applied data analysis method, we do not impose this as a restriction in ourmethod. This means that in WSPCA both can be varied freely. When the smoothingparameter (λ) is unequal to zero, the smoothed scores Z are not restricted to therange of X. The space spanned by X depends highly on the noise realization andremeasured data usually spans a different space. In our opinion it is, therefore,

79

Page 92: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

5. Weighted Smooth Principal Component Analysis (WSPCA)

unwanted to constrain modelled information about the samples or objects (e.g. thescore profiles) to being linear combinations of the columns space of X. In generalsmooth score profiles can not be found in the range of X, the optimal smoothsolution lies outside that range.

In Burnham et al. [138] a statistical framework for Latent Variable MultivariateRegression (LVMR) is explored with various degrees of error variance in X and Y.In this framework the estimates T, whose columns provide a basis for the latentvariable space, are restricted to T = XW, where W is a function of the trainingdata. This restriction is also used in more commonly applied methods (e.g. PLS,PCR, RRR and CCR) and depending on the degree of error variance in X and Y theframework behaves like some of these methods. When the restriction is not appliedthe framework is referred to as maximum likelihood latent root regression (MLRR).In cases where the errors in X are large relative to those in Y the unrestrictedversion of the framework (MLRR) performs uniformly better. Some algorithmsprovide sub- and even final solutions (concentration and spectral profiles) that canbe outside the range of the data matrix, without the user being aware of it, aswas recently shown for the multivariate curve resolution-alternating least squares(MCR-ALS) algorithm [139–141].

5.2.2 Cross-validation procedure

Cross-validation is a resampling technique which is used to simplify the selection ofmodel parameters, such as the number of components to use, and provides a basisfor residual and influence analysis [142]. Here cross-validation will be applied todetermine the model meta parameters, being the number of components (R) to useand the smoothing parameter (λ).

Most cross-validation methods leave out one or more complete objects or sam-ples and predict them back to assess the model meta parameters. This approach cannot be applied in case of WSPCA because leaving out one or more complete objectsor samples would disrupt the relationship between the objects or samples, whichare not independent of each other. Therefore, we have developed a cross-validationprocedure in which randomly selected elements xij are left out of the data matrix X.Our procedure has a similar approach as the cross-validation method for principalcomponent analysis as proposed by Wold [143]. The user has to choose the numberof elements to leave out, e.g. one could perform a 20 fold cross-validation by leav-ing out 5% of the elements at each round. A WSPCA model consisting of smoothedscores (Z) and loadings (P), with a pre-set number of components (R) and λ, is fit-ted to the remaining data and the values of the left out elements are predicted fromthe calculated model. The values for the left out elements are obtained by first re-constructing the data matrix from the calculated Z and P as X = ZPT and selectingthe left out elements. The step of leaving out elements and fitting a WSPCA modelis repeated until all elements xij, with the exception of missing elements, have beenleft out once and predicted by the various constructed WSPCA models. The re-sulting data matrix of predicted elements (Xleo [IxJ]), which is obtained via thedescribed leave elements out procedure (leo), can be used to calculate a predicted

80

Page 93: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

5.3. Materials

residual error sum of squares (PRESS) as shown in Equation 5.2.

PRESS = ‖W ◦ (X − Xleo)‖2 (5.2)

As in Equation 5.1 the character ◦ denotes the Hadamard product. Here the weightmatrix W is, as mentioned before in Section 5.2.1, also a binary matrix containingones (1) for elements that are present and zeros (0) for elements that are missing.During construction of Xleo in the leave elements out cross-validation procedurethe weight matrix is adjusted such that not only missing elements but also the leftout elements are not used in the calculation of the various WSPCA models.

Each calculated PRESS value depends on which elements are randomly selectedfor the calculation of Xleo. In order to correct for the random selection of ele-ments 10 PRESS values are calculated, each one originating from different randomselections of elements for estimating Xleo. By varying R and λ and repeating thecross-validation procedure the optimal number of components and smoothing pa-rameter can be assessed by finding the minimum of the average PRESS of the 10calculated PRESS values. We used 10 PRESS values because this proved to have astrong stabilizing effect on the average PRESS value.

5.2.3 Missing value estimation

When a multivariate data set of related samples with expected smoothness in theunderlying phenomena (e.g. microbial metabolomics fermentation data), containsvalues that are missing completely at random, WSPCA in combination with theleave elements out cross-validation can be used to estimate the missing values. Atfirst the missing values have to be located and marked in the weight matrix W.After that the cross-validation procedure, described above in Section 5.2.2, can beused to estimate the model meta parameters for WSPCA. Keep in mind that the truemissing values do not participate in this cross-validation and that various sets ofadditional elements are sequentially temporarily marked for the construction of thematrix Xleo. Once the optimal model meta parameters R and λ, for the roughnesspenalty, have been determined a WSPCA model can be build using the weightmatrix W in which the missing values have been marked. By reconstructing thedata matrix (X) from the obtained smoothed scores Z and accompanying loadingsP as X = ZPT , estimates for the missing elements can be obtained by selectingthe missing values in X. The estimates calculated in this way are adjusted for thedynamic aspects in the data.

5.3 Materials

5.3.1 Simulated data

To show that incorporation of smoothness into the scores of PCA can lead to abetter estimation of the smooth phenomena underlying the time-resolved data andto show that the proposed leave elements out cross-validation procedure can beused for the determination of model parameters, a simulated data set has been

81

Page 94: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

5. Weighted Smooth Principal Component Analysis (WSPCA)

0 2 4 6 8 10 12 14 160

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

time

scor

escore profiles noiseless data

s1

s2

s3

(a)

0 5 10 15 20 25 30 35 40 450

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

variable

load

ing

loading profiles noiseless data

l1

l2

l3

(b)

Figure 5.1: Simulated data set:(a) Noiseless score vectors;(b) Noiseless loading vectors.

created. The simulated data set consists of a noiseless data set and white noise. Thenoiseless data set (Xtrue), consisting of 25 evenly spaced timepoints for 45 variables,has been generated by multiplying three score vectors with three loading vectors(as depicted in Figure 5.1).

The first and second score vector (s1 and s2) were generated using time-shiftedlogistic functions with t representing time, to represent well known dynamic behav-ior of metabolites that are formed or dissapearing during the batch fermentation,as described by Equation 5.3 and 5.4. The first score vector is apart from beingtime-shifted also inverted and the second score vector contains an offset value.

s1(t) = 1 − 11 + e−t+8 (5.3)

s2(t) = 0.3 +1

2 + e−t+13 (5.4)

The third score vector (s3) represents a metabolite that is first being formed andlater consumed during the batch fermentation. The following Gaussian function(Equation 5.5) was used:

s3(t) =3√

2πσ2∗ e

− (t−µ)2

2σ2 (5.5)

with µ = 8 and σ = 2.The three loading vectors, displayed in Figure 5.1(b) as loading profiles l1–l3,

are used to make linear combinations of the three dynamic effects. They have beenconstructed such that the first three variables exactly describe the three dynamicseffects, whereas the remainder 42 variables create random linear combinations of

82

Page 95: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

5.3. Materials

the three dynamic effects. Each loading vector has been normalized to unit lengthand the angle between the loading vectors is given in Table 5.1.

Table 5.1: Angle between loading vectors used to generate noiseless data.

loading vectors Angle 6

l1–l2 54.7 °l1–l3 56.0 °l2–l3 55.8 °

The noiseless data set (Xtrue) is obtained by multiplication of the three scoreand loading vectors,

Xtrue = [s1 s2 s3][l1 l2 l3]T (5.6)

Addition of random normally distributed noise (E) with mean 0 and standarddeviation 0.045 to the noiseless data (Xtrue) gives the data set (X), as representedby Equation 5.7.

X = Xtrue + E (5.7)

The resulting data set (X) contains approximately 8.2% noise, calculated as:

‖E‖2

‖X‖2 ∗ 100% (5.8)

5.3.2 E. coli batch fermentation metabolomics data

Escherichia coli batch fermentation experiments have been performed under differ-ent conditions to maximize the variation in the production of the amino acid pheny-lalanine using a full factorial design with three factors on two levels. During thetime course of these batch fermentations 12 samples were taken from each biore-actor and analysed for intracellular metabolites by means of GC-MS and LC-MS.The experimental design, the experiments and the data collection are extensivelydescribed in Rubingh et al. [129]. To show the practical application of the leaveelements out cross-validation procedure for model parameter estimation and theuse of the WSPCA method for missing data estimation we used GC-MS data ofone of the E. coli batch fermentations performed with Glucose at pH 6 and lowphosphate concentration. The data consists of intensity values that represent thenon-calibrated concentration levels of 21 putative intracellular metabolites. Prior toanalyzing the data we range scaled each putative metabolite, making all metabolitesequally important and comparable relative to their biological response range [58].The range scaled measured intensities have been stored in a matrix consisting of12 rows, representing the samples taken during the batch fermentation, and 21columns, representing the putative metabolites (variables).

83

Page 96: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

5. Weighted Smooth Principal Component Analysis (WSPCA)

0 0.001 0.01 0.1 1 100

1

2

3

4

5

6

λ

SS

pen

alty

Sum of Squares and Q vs. λ

0 0.001 0.01 0.1 1 101.5

2

2.5

3

λ

SS

res

idua

ls, Q

SS residualsQSS penalty

Figure 5.2: Sum of Squares penalty, Sum of Squares residuals and Q plotted versus λ. Although λ =0 can not be shown on a common (base 10) logarithmic scale, it is shown here to indicate the non-regularized solution.

Missing values

The described E. coli batch fermentation metabolomics data set has 6 missing ele-ments out of a total of 12 ∗ 21 = 252 elements, making the percentage of missingvalues 2.38%. The missing values are present in the putative metabolite profiles ofvariables 018, 019 and 220. In variable 018 timepoints 2 and 4 are missing. Variable019 is missing timepoints 10 and 11 and variable 220 misses timepoints 4 and 7.

5.4 Results and discussion

5.4.1 WSPCA method

To test WSPCA we have applied the implemented method with a second order dif-ference matrix (D2) on the created simulated data set, described in Section 5.3.1,using 3 components and 25 different values (ranging from 10−4 to 101.5) for thesmoothing parameter λ. We chose 3 components, because the simulated data wasgenerated using 3 score and 3 loading profiles (Figure 5.1). Figure 5.2 graphi-cally displays the obtained sum of squared penalty (‖D2Z‖2), the obtained sum ofsquared residuals (‖W ◦ (X − ZPT)‖2) and Q (see Equation 5.1) as a function of λ.The graph shows that over 2 orders of magnitude, from λ = 0.001 to 0.1, the sumof squared penalty rapidly decreases, whereas the sum of squared residuals and

84

Page 97: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

5.4. Results and discussion

Table 5.2: Values of Sum of Squares residuals, Sum of Squares penalty, Q and Fit for various Log10λvalues, as shown in Figure 5.2.

Log10λ SS residuals SS penalty Q Fit (%)

-4.000 1.85 5.72 1.85 93.33-3.000 1.85 5.16 1.86 93.33-2.000 1.87 2.31 1.89 93.28-1.500 1.89 0.92 1.92 93.20-1.000 1.92 0.33 1.96 93.09-0.500 1.95 0.15 2.00 92.98-0.250 1.97 0.11 2.03 92.92

+0.000 1.99 0.08 2.07 92.85+0.125 2.00 0.07 2.09 92.81+0.250 2.01 0.06 2.12 92.75+0.375 2.03 0.05 2.16 92.69+0.500 2.05 0.04 2.19 92.62+0.750 2.09 0.03 2.25 92.48+0.875 2.12 0.02 2.28 92.37+1.000 2.15 0.02 2.33 92.26+1.125 2.19 0.01 2.38 92.13+1.250 2.23 0.01 2.44 91.99+1.375 2.27 0.01 2.50 91.82+1.500 2.32 0.01 2.57 91.64

Q only slightly increase. In other words over those two orders of magnitude in λ

the scores get smoothed quite strongly, hence the rapid drop in the sum of squaredpenalty. However, the created model fits the data still quite well, since the sumof squared residuals has not increased much. Determining an optimal value forthe smoothing parameter λ based on the graph in Figure 5.2 is, however, impossi-ble. Qualitatively it is expected that the optimal λ will be found, where the sumof squared penalty has reduced considerably and the sum of squared residuals hasnot increased much, meaning that the created model still fits the data well. Table 5.2contains some of the values used for making Figure 5.2 and additionally containsthe fit given in percentage, which is calculated as

(1 − ‖X − ZPT‖2

‖X‖2 ) ∗ 100% (5.9)

From this table can be seen that with the increase of λ the fit of the calculated model(ZPT) with respect to the measured data (X) becomes smaller, which is logical as thesum of squared residuals increases with increasing λ. The table also shows, thateven when the sum of squared penalty has dropped considerably (see Log10λ =−0.500) the method still slightly tends to overfit the data by incorporating noise

85

Page 98: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

5. Weighted Smooth Principal Component Analysis (WSPCA)

into the model (the fit is about 93% of the data that contains approximately 8.2% ofwhite noise).

To find the optimal λ for our simulated data set we have plotted the sum ofsquared penalty against the sum of squared residuals (Figure 5.3) according to theL-curve [144], as introduced by Lawson and Hanson [145]. The corner of the L-shape curve should provide the optimal smoothing parameter λ. The graph does,unfortunately, not display the characteristic L-shaped curve, therefore, making itnot possible to determine an optimal λ in this way for our WSPCA method.

The goal of WSPCA is by applying smoothness on the scores to obtain a betterestimation of the true dynamic phenomena underlying the data. To get insight intowhat happens to the estimated smooth scores (Z) when increasing the smoothingparameter λ, we have calculated the angle between the subspace spanned by thetrue underlying smooth scores and the subspace spanned by the estimated smoothscores of the WSPCA models built on X with three components for creating Fig-ure 5.2. The angle between the subspaces has been calculated as described byBjörck et al. [146] and describes the similarity (or resemblance) between two sub-spaces, with two subspaces being identical when the angle between them becomeszero. To simultaneously see what happens to the accompanying loadings (P) whenincreasing λ we also calculated the angle between the subspace spanned by the trueloadings and the subspaces spanned by the loadings of the same WSPCA models.The true underlying smooth scores and loadings have been obtained by performingWSPCA on Xtrue with λ = 0 and three components (R = 3). Figure 5.4 displays thecalculated angles as a function of λ and clearly shows in Figure 5.4(a) that whenincreasing λ the subspace spanned by the smooth scores of the calculated modelsincreasingly resembles the subspace spanned by the true underlying scores. Max-imal resemblance is reached around λ = 0.7, beyond this value of λ the scores getover-smoothed. The over-smoothing of the scores causes the resemblance betweenthe subspaces to drop, which is seen as increase in the angle between the sub-spaces. Figure 5.4(b) shows that when penalizing the scores for their roughness thesubspace spanned by the accompanying loadings also slowly adjusts towards thesubspace spanned by the true loadings. However, maximal resemblance is reachedat a much larger λ value than found for the subspaces spanned by the scores. Withincreasing λ not only the subspace spanned by the calculated smooth scores startsto resemble the subspace of the true underlying scores, but also the score profileis a better estimation of the true underlying score profile. This is clearly shown inFigure 5.5, in which the true smooth score profile and the smooth score (Z) pro-files for two λ-values (10−3 and 0.7) from Figure 5.2 are visualized. Comparisonof the smooth score profiles in Figure 5.5(b) and 5.5(c) shows that the scores getnicely smoothed when changing from λ = 10−3 to 0.7 and that the smooth scores atλ = 0.7 estimate the true underlying scores, shown in Figure 5.5(a), very well. Wealso looked at the accompanying loadings profiles, but as was to be expected fromFigure 5.4(b) they hardly change when changing from λ = 10−3 to 0.7.

86

Page 99: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

5.4. Results and discussion

1.80 1.90 2 2.10 2.20 2.30 2.40−1

0

1

2

3

4

5

6

SS residuals

SS

pen

alty

L−curve

λ= 10−4

λ= 10−3

λ= 10−2

λ= 10−1

λ= 100λ= 101 λ= 101.5

Figure 5.3: L-curve: Sum of Squares penalty plotted against Sum of Squares residuals for λ rangingfrom 10−4 to 101.5

0 0.001 0.01 0.1 1 1010

12

14

16

18

20

22

24

26

λ

angl

e

nc=3

(a)

0 0.001 0.01 0.1 1 1018

19

20

21

22

λ

angl

e

nc=3

(b)

Figure 5.4: (a) Angle between the subspace spanned by the smooth scores obtained from WSPCA onXtrue (λ = 0 and three components R = 3) and the subspace spanned by the smooth scores of WSPCAmodels built with three components on X for creating Figure 5.2 as a function of λ; (b) Angle between thesubspace spanned by the loadings obtained from WSPCA on Xtrue (λ = 0 and R = 3) and the subspacespanned by the loadings of WSPCA models built with three components on X for creating Figure 5.2 asa function of λ. Although λ = 0 can not be shown on a common (base 10) logarithmic scale, it is shownhere to indicate the non-regularized solution.

87

Page 100: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

5. Weighted Smooth Principal Component Analysis (WSPCA)

0 2 4 6 8 10 12 14 16−1.4

−1.2

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

time

smoo

th s

core

Z

λ=0

comp.1comp.2comp.3

(a)

0 2 4 6 8 10 12 14 16−1.4

−1.2

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

time

smoo

th s

core

Z

λ=10−3

comp.1comp.2comp.3

(b)

0 2 4 6 8 10 12 14 16−1.4

−1.2

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

time

smoo

th s

core

Zλ=0.7

comp.1comp.2comp.3

(c)

Figure 5.5: Smooth score profiles for R = 3: (a) smooth score profiles (Z) calculated on Xtrue with λ = 0;(b) smooth score profiles (Z) calculated on X for λ = 10−3; (c) smooth score profiles (Z) calculated on Xfor λ = 0.7.

5.4.2 Leave elements out cross-validation

To determine the optimal model meta parameters for the WSPCA method the leaveelements out cross-validation procedure, described in Section 5.2.2, has been ap-plied in combination with the WSPCA method on the simulated data (see Sec-tion 5.3.1). For the cross-validation procedure we have chosen to leave out 5% ofthe elements per fold in a 20 fold cross validation, the number of components wasvaried from 2 to 4 and λ was varied between 0 and 10.

88

Page 101: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

5.4. Results and discussion

0 0.001 0.01 0.1 1 102.3

2.4

2.5

2.6

2.7

2.8

2.9

3

3.1x 10

−3

λ

avg

PR

ES

S

nc=2nc=3nc=4

(a) ‖X − Xleo‖2

0 0.001 0.01 0.1 1 103

4

5

6

7

8

9

10

11

12x 10

−4

λ

avg

PR

ES

S

nc=2nc=3nc=4

(b) ‖Xtrue − Xleo‖2

Figure 5.6: Predicted error sum of squares (PRESS) using the leave elements out cross-validationprocedure:(a) average PRESS calculated with respect to X;(b) average PRESS calculated with respectto Xtrue. Although λ = 0 can not be shown on a common (base 10) logarithmic scale, it is shown hereto indicate the non-regularized solution.

Figure 5.6 shows two graphs in which the average PRESS values divided by thenumber of elements in X are plotted against λ for all possible combinations of λ andthe number of components used for modelling. In Figure 5.6(a) the average PRESSvalues are calculated with respect to X, whereas in Figure 5.6(b) the average PRESSvalues are calculated with respect to Xtrue. Both Figure 5.6(a) and Figure 5.6(b)show a minimum for three components between λ = 0.4 and 0.7 (the optimal metaparameters for the WSPCA method). This means that the used cross-validationprocedure is also applicable in situations where the true underlying dynamic phe-nomena are unknown as normally is the case for measured data. Comparison ofFigure 5.6(a) and Figure 5.6(b) also shows that the average PRESS with respect toXtrue is lower than the average PRESS with respect to X. The cross-validated pre-dictions (Xleo) closely represent the true data. This is exactly what the procedureis expected to do. It is able to capture the true underlying phenomena in the data.The optimal meta parameters to capture the real data as good as possible can befound by cross validation of measured data as the minimum PRESS obtained withthe measured data in Figure 5.6(a) is obtained at the same meta parameters settings.

For four components (nc=R=4) the curves of the average PRESS against λ is notso smooth as for two and three components, this is due to the existence of localminima in the calculation of the WSPCA model for several sets of left out elementsin the cross-validation procedure. Another interesting feature can be seen in thecurve for nc=4 of Figure 5.6(b), which is that the minimum in the curve is reachedat a larger λ value compared to the curves for two and three components. Modeling

89

Page 102: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

5. Weighted Smooth Principal Component Analysis (WSPCA)

02

46

810

1214

16−0.1 0

0.1

0.2

0.3

0.4

0.5

0.6

time

conc.

λ=10−3

var 1var 2var 3

(a)X

leo

02

46

810

1214

16−0.1 0

0.1

0.2

0.3

0.4

0.5

0.6

time

conc.

λ=0.7

var 1var 2var 3

(b)X

leo

02

46

810

1214

16−0.1 0

0.1

0.2

0.3

0.4

0.5

0.6

time

conc.

λ=101

var 1var 2var 3

(c)X

leo

02

46

810

1214

16−0.1 0

0.1

0.2

0.3

0.4

0.5

0.6

time

conc.

original data

var 1var 2var 3

(d)

X

02

46

810

1214

16−0.1 0

0.1

0.2

0.3

0.4

0.5

0.6

time

conc.

noiseless data

var 1var 2var 3

(e)X

true

Figure5.7:

Concentration

profilesfor

thefirst

threevariables:(a)

estimated

byusing

theleave

elements

outcross-valid

ationproced

ure(X

leo)

with

λ=

10 −3;(b)

estimated

byusing

theleave

elements

outcross-validation

procedure

(Xleo )

with

λ=

0.7;(c)estim

atedby

usingthe

leaveelem

entsout

cross-validation

procedure

(Xleo

)w

ithλ=

101;(d

)X

;and,(e)

Xtrue .

90

Page 103: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

5.4. Results and discussion

the data with 4 components leads to an over-fitted model as can be observed fromthe high PRESS value at λ = 0. When more noise get integrated into the model,more emphasis needs to be put on the roughness penalty to reach a minimum.

The observation, that Xleo is closer to Xtrue than to X, is also clearly visualizedby Figure 5.7 in which the concentration profiles of the first three variables areplotted as estimated for three different λ values, the original data X and the originalnoiseless data Xtrue. The concentration profiles for λ = 0.7 (close to the optimal λ)in Figure 5.7(b) resemble closely the concentration profiles of the original noiselessdata (Xtrue), as shown in Figure 5.7(e). The concentration profiles for λ = 10−3

(a very small value of λ), shown in Figure 5.7(a), look more like the concentrationprofiles of the original data, displayed in Figure 5.7(d). This can be explained bythe fact that the solution is very similar to the non-regularized solution (λ = 0),or the PCA solution, which tries to model as much of the variation present in thedata as possible. For λ = 10, visualized by Figure 5.7(c), the concentration profilesget smoothed too much resulting in an increase of the average PRESS values inFigure 5.6. Note that the maximum value of var 3 has dropped to 0.2 whereas atλ = 0.7 this maximum closely resembles the true value of 0.27.

5.4.3 E. coli metabolomics data from a batch fermentation

As an example of real life data we have applied our WSPCA method to the E.

coli batch fermentation metabolomics data set described in Section 5.3.2 to estimatethe missing data in this data set with application of smoothness over the time-resolved samples. To determine the model parameters for the WSPCA method wefirst applied the leave elements out cross-validation procedure of Section 5.2.2 onthe data set, where the number of components have been varied from 3 to 6 andthe smoothing parameter λ has been varied between 0 and 1. Figure 5.8, showingthe average PRESS divided by the number of elements in the data set against thesmoothing parameter for all combinations, displays the results of the leave elementsout cross-validation procedure. It shows that there is a trade-off between the num-ber of components and the effect of the smoothness parameter. With 3 componentsthe data matrix predicted by means of the leave elements out cross-validation (Xleo)clearly underestimates the underlying noiseless data, as for 4 components the curvelies below the curve for 3 components. When the number of components is high,e.g. 6 components, the unrestricted solution at (λ = 0) leads to a high PRESS value.In these cases the introduction of more smoothness by increasing the smoothingparameter helps enormously in lowering the average PRESS. The minimum in thecurve for six components, compared to four and five components, clearly shifts to-wards a higher smoothing parameter value. This is due to the fact that with sixcomponent more noise gets modelled and more smoothness needs to be applied toobtain a minimum in the PRESS curve, as was also described in the previous Sec-tion. The effect of the smoothing parameter λ with four components is only small,since the average PRESS does not get reduced much when increasing the smooth-ing parameter. However, the optimal model parameters, leading to lowest averagepress per element in Figure 5.8, are found at four components and λ = 0.06.

91

Page 104: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

5. Weighted Smooth Principal Component Analysis (WSPCA)

0 0.001 0.01 0.1 10.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

0.11

λ

avg

PR

ES

S

nc=3nc=4nc=5nc=6

Figure 5.8: Average Predicted Error Sum of Squares (avg PRESS) divided by the number of elementsin the data set as a function of the smoothing parameter λ from applying the leave elements out cross-validation procedure on real life E. coli batch fermentation metabolomics data. Although λ = 0 can notbe shown on a common (base 10) logarithmic scale, it is shown here to indicate the non-regularizedsolution.

With the optimal model parameters a WSPCA model was built on all availabledata to estimate the truly missing elements (described in Section 5.3.2). The missingelements were also estimated using a non-regularized WSPCA model (λ = 0) withfour components. Figure 5.9 shows the time profiles of the range scaled intensitiesof the putative metabolites (measured variables) containing the missing elementssubstituted with estimates of both non-regularized and optimal smoothness for themissing elements. The missing elements predicted with smoothness are close tothe non-regularized estimations. This result is not unexpected, since the cross-validation results of Figure 5.8 showed that incorporating smoothness with fourcomponents had only a very small effect on the average PRESS. However, basedon the simulated data the estimations of the missing elements with incorporatedsmoothness should be closer to the true underlying noiseless data and, therefore,be better estimates.

92

Page 105: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

5.5. Conclusion

1 2 3 4 5 6 7 8 9 10 11 120

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1.1

1.2

1.3

1.4

time

I/(I m

ax−I

min

)Var018

measuredpredicted λ=0predicted λ=0.06

(a)

1 2 3 4 5 6 7 8 9 10 11 120

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1.1

1.2

1.3

1.4

time

I/(I m

ax−I

min

)

Var019

measuredpredicted λ=0predicted λ=0.06

(b)

1 2 3 4 5 6 7 8 9 10 11 120

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1.1

1.2

1.3

1.4

time

I/(I m

ax−I

min

)

Var220

measuredpredicted λ=0predicted λ=0.06

(c)

Figure 5.9: Range scaled measured intensities and predicted range scaled intensities for missing data inthe time profiles of real life E. coli batch fermentation metabolomics data: (a) putative metabolite variable18; (b) putative metabolite variable 19; (c) putative metabolite variable 220.

5.5 Conclusion

In this paper we have shown that when a multivariate data set consists of sam-ples, that are not independent of each other but linked via a clear relationship withexpected smoothness in the underlying phenomena, incorporation of smoothnessinto data analysis can help to estimate the underlying phenomena. The proposedWSPCA method, which puts the smoothness on the calculated scores, is able tocapture the true underlying scores and accompanying loadings very well as shown

93

Page 106: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

5. Weighted Smooth Principal Component Analysis (WSPCA)

with our synthetic data set. Contrary to most other multivariate analysis methodswe do not restrict the scores to the column space of the data as smooth solution mostprobably lie outside that range. Our leave elements out cross-validation procedurewas able to find the number of components and the optimal smoothing parameterλ and proved to give a good estimation of the underlying noiseless data. Applica-tion of the WSPCA method, with the model parameters determined with the leaveelements out cross-validation procedure, can be applied for missing value estima-tion and give better estimations for the missing elements as they are calculated byincorporating the dynamic phenomena underlying the data.

Acknowledgement

The authors thank Peter J. Punt, TNO, Microbiology & Systems Biology, Zeist theNetherlands for permission to use the E. coli batch fermentation data.

94

Page 107: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

Ch

ap

te

r

6Conclusion and Outlook†

†Copyright © 2012 Maikel P.H. Verouden.

95

Page 108: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

6. Conclusion and Outlook

6.1 Conclusion

This thesis shows that fusing prior knowledge with microbial metabolomics is auseful approach. The benefit/advantage of this approach was displayed for vari-ous aspects of a microbial metabolomics study. First the use of prior knowledgein the generation of data for studying metabolism has been exemplified by usingthe genome-scale network of Lactococcus lactis MG1363 to generate flux distribu-tions for multiple in silico environmental conditions, mimicking laboratory growthconditions. This is beneficial, because microbial metabolomics data for answeringa specific biological question is not always available or can be too expensive to gen-erate. Next the advantage of using prior knowledge in preprocessing of data priorto data analysis was clearly demonstrated by a new scaling method, which usesthe measurement error of metabolites as prior. Application of this scaling methodadjusts the processed measurement data by partly filtering of noise subsequentlyallowing use of other scaling and commonly applied data analysis methods, whichwould otherwise not be advisable. The advantage of applying prior knowledge indata analysis was shown for two methods. The first method used the experimen-tal design of the E. coli fermentations used in the microbial metabolomics study asprior and in the second method the expected smoothness of underlying dynamicprofiles was used. The first method displayed less overfit and both methods showedimproved model estimation with application of prior knowledge. The better modelestimation of the second method was successfully applied for estimation of missingvalues in the data.

6.2 Outlook

Although the benefit of fusing prior knowledge into data analysis of a microbialmetabolomics study was shown for two methods, both using prior knowledge withrespect to the samples in the used data sets, the big remaining challenge is fu-sion of biological knowledge into data analysis and other aspects of a microbialmetabolomics study. This means that instead of using knowledge about the sam-ples in the data set, knowledge about the variables (metabolites) needs to be fusedinto data analysis and other aspects of the study. As mentioned in the general intro-duction (Chapter 1) development of methodology guided by biological knowledgeto facilitate data analysis has only recently been engaged. This suggests that plentyopportunities still remain to be explored. Data analysis consist of much more thanjust the analysis of the data, also interpretation and validation of analysis resultsare part of it. In the following some potential recent developments and suggestionswill be discussed with respect to the use of biological knowledge in data analysis.

In a recent paper methods were explored to focus data analysis to specific ar-eas of interest within the metabolome [147]. The selected methods explored therelation between selected metabolites (e.g. biochemically related metabolites ormetabolites for a specific pathway) and the remainder of the metabolome data set.One method focussed on searching for major trends in the behaviour of metaboliteconcentrations that are in common for the metabolites of interest and the remainder

96

Page 109: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

6.2. Outlook

of the metabolome. Whereas the other method identified the strongest correlationsbetween the metabolites of interest and the remainder of the metabolome. The se-lected methods proved to be complementary data analysis tools that both can focusthe data analysis on areas in the data that are of specific interest and enhance thebiological interpretability. It could, however, also be desirable not just to focus theexploration of data on a specific area of interest but to be able to direct data analysisby putting either more or less emphasis on a particular metabolic pathway. If for ex-ample beforehand is known that a certain pathway predominates in the generateddata the ability to put less emphasis on that pathway might reveal interesting in-formation that otherwise would be overwhelmed. Sometimes multiple hypothesesexist with respect to biological knowledge, for example two alternative metabolicpathways. In this case it would be interesting to be able to test which of both pro-posed models fits best to the data. On the other hand it could be used to confirm(validate) the biological knowledge given the measured data. The made sugges-tions require introduction of metabolic pathways into data analysis methods. This,however, touches upon the main question of how to encode the information [11].An additional difficulty with respect to the encoding problem is the fact that biolog-ical knowledge has a broad range from only qualitative information to quantitativeinformation e.g. detailed kinetic models.

Interpretation of analysis results is a complicated task due to the large amountof metabolites measured in a microbial metabolomics study. The use of a targetanalysis approach in a microbial metabolomics study can simplify the interpreta-tion, but only reveals information about the selected metabolites. In a microbialmetabolomics study the goal is to gain new insights into the metabolism of the mi-crobe studied, therefore as many metabolites as possible are being measured. Apartfrom metabolites that are related to the research question many other metabolitesare measured that are unrelated to the question. Combining data analysis with agenome-scale metabolic model of the micro-organism could aid the interpretationof analysis results. It could link the analysis results to certain pathways or excludepathways thought to be involved, thereby generating new insights.

Prior to interpretation of analysis results validation of those results is also veryimportant. During validation the analysis results are evaluated to determine whetheror not they are caused by chance. This is often done by permuting the samples inthe data set many times, reanalyzing the permuted data sets and evaluate whetherthe obtained original result differs significantly from the results of the permuteddata sets. This is, however, a statistical approach, which states nothing about bio-logical validity of the obtained results. Recently two frameworks named MetabolicSet Enrichment Analysis (MSEA) [148] and Metabolic Pathway Enrichment Analy-sis [149] have been introduced that allow for inference of biologically meaningfulpatterns, functions and pathways from metabolomic data. Within this frameworkit is possible to evaluate whether a found set of metabolites is represented morethan expected by chance within a given compound list. The framework is currentlyavailable for human and mammalian metabolite data, it would be very interestingto explore possibilities to extent it to microbial metabolomics data.

97

Page 110: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics
Page 111: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

Bibliography

[1] Madigan MT, Martinko JM, Stahl DA, Clark DP. Brock Biology of Microorgan-

isms. 13th edn. Pearson Education, Inc./Benjamin Cummings, San Francisco,2010. 2

[2] Fischer E, Sauer U. Large-scale in vivo flux analysis shows rigidity and subop-timal performance of Bacillus subtilis metabolism. Nat. Genet. 2005; 37(6):636–640. 2

[3] Haverkorn van Rijsewijk BRB, Nanchen A, Nallet S, Kleijn RJ, Sauer U. Large-scale 13c-flux analysis reveals distinct transcriptional control of respiratoryand fermentative metabolism in Escherichia coli. Mol. Syst. Biol. 2011; 7:477. 2

[4] Stephanopoulos GN, Aristidou AA, Nielsen J. Metabolic Engineering: Principles

and Methodologies. 1st edn. Academic Press, San Diego, 1998. 2

[5] Stephanopoulos GN. Metabolic fluxes and metabolic engineering. Metab. Eng.1999; 1(1):1–11. 2

[6] Fiehn O. Metabolomics – the link between genotypes and phenotypes. PlantMol. Biol. 2002; 48:155–171. 3

[7] Bino RJ, Hall RD, Fiehn O, Kopka J, Saito K, Draper J, Nikolau BJ, Mendes P,Roessner-Tunali U, Beale MH, Trethewey RN, Lange BM, Wurtele ES, SumnerLW. Potential of metabolomics as a functional genomics tool. Trends Plant Sci.2004; 9(9):418–425. 3

[8] Oliver SG, Winson MK, Kell DB, Baganz F. Systematic functional analysis ofthe yeast genome. Trends Biotechnol. 1998; 16(9):373–378. 4

[9] van der Werf MJ, Jellema RH, Hankemeier T. Microbial metabolomics: re-placing trial-and-error by the unbiased selection and ranking of targets. J.Ind. Microbiol. Biotechnol. 2005; 32(6):234–252. 4, 7, 57

[10] Brown M, Dunn WB, Ellis DI, Goodacre R, Handl J, Knowles JD, O’Hagan S,Spasic I, Kell DB. A metabolome pipeline: from concept to data to knowledge.Metabolomics 2005; 1:39–51. 4

[11] Hendriks MMWB, van Eeuwijk FA, Jellema RH, Westerhuis JA, Reijmers TH,Hoefsloot HCJ, Smilde AK. Data-processing strategies for metabolomics stud-ies. TrAC, Trends Anal. Chem. 2011; 30(10):1685–1698. 4, 7, 97

99

Page 112: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

Bibliography

[12] van der Werf MJ, Pieterse B, van Luijk N, Schuren F, van der Werff-van derVat B, Overkamp K, Jellema RH. Multivariate analysis of microarray data byprincipal component discriminant analysis: prioritizing relevant transcriptslinked to the degradation of different carbohydrates in Pseudomonas putida

S12. Microbiology 2006; 152(1):257–272. 4, 48

[13] Braaksma M, Bijlsma S, Coulier L, Punt PJ, van der Werf MJ. Metabolomicsas a tool for target identification in strain improvement: the influence of phe-notype definition. Microbiology 2011; 157(1):147–159. 4, 10

[14] Willey JM, Sherwood LM, Woolverton CJ. Presscott, Harley, and Klein’s Micro-

biology. 7th edn. McGraw-Hill Higher Education, New York, 2008. 4

[15] Stanbury PF, Whitaker A, Hall SJ. Principles of Fermentation Technology. 2ndedn. Elsevier Science Ltd./Butterworth-Heinemann, Burlington, 1995. 4

[16] Novick A, Szilard L. Description of the Chemostat. Science 1950;112(2920):715–716. 4

[17] Herbert D, Elsworth R, Telling RC. The continuous culture of bacteria; atheoretical and experimental study. Microbiology 1956; 14(3):601–622. 4

[18] Novick A. Growth of Bacteria. Annu. Rev. Microbiol. 1955; 9(1):97–110. 6

[19] Demain AL. Microbial production of primary metabolites. Naturwis-senschaften 1980; 67:582–587. 6

[20] Fraenkel GS. The raison d’Être of secondary plant substances. Science 1959;129(3361):1466–1470. 6

[21] Hanson JR. Natural Products: The Secondary Metabolites, Tutorial Chemistry

Texts, vol. 17. The Royal Society of Chemistry, Cambridge, UK, 2003. 6

[22] Mashego MR, Rumbold K, De Mey M, Vandamme E, Soetaert W, Heijnen JJ.Microbial metabolomics: past, present and future methodologies. Biotechnol.

Lett. 2007; 29(1):1–16. 7, 57

[23] Fiehn O. Combining genomics, metabolome analysis, and biochemical mod-elling to understand metabolic networks. Comparative and Functional Genomics

2001; 2(3):155–168. 7

[24] Allen J, Davey HM, Broadhurst D, Heald JK, Rowland JJ, Oliver SG, KellDB. High-throughput classification of yeast mutants for functional genomicsusing metabolic footprinting. Nat. Biotechnol. 2003; 21(6):692–696. 7

[25] Dunn WB, Bailey NJC, Johnson HE. Measuring the metabolome: currentanalytical technologies. Analyst 2005; 130:606–625. 7

[26] Krastanov A. Metabolomics - The State of Art. Biotechnol. & Biotechnol. Eq.2010; 24(1):1537–1543. 7

100

Page 113: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

Bibliography

[27] Sweetlove LJ, Last RL, Fernie AR. Predictive Metabolic Engineering: A Goalfor Systems Biology. J. Plant Physiol. 2003; 132(2):420–425. 7

[28] Villas-Bôas SG, Mas S, Åkesson M, Smedsgaard J, Nielsen J. Mass spectrom-etry in metabolome analysis. Mass Spectrom. Rev. 2005; 24(5):613–646. 7, 8

[29] Villas-Bôas SG, Højer-Pedersen J, Åkesson M, Smedsgaard J, Nielsen J. Globalmetabolite analysis of yeast: evaluation of sample preparation methods. Yeast

2005; 22(14):1155–1169. 7

[30] Kell DB, Brown M, Davey HM, Dunn WB, Spasic I, Oliver SG. Metabolicfootprinting and systems biology: the medium is the message. Nat. Rev.Microbiol. 2005; 3(7):557–565. 8

[31] Nielsen J, Oliver S. The next wave in metabolome analysis. Trends Biotechnol.

2005; 23(11):544–546. 8

[32] Mapelli V, Olsson L, Nielsen J. Metabolic footprinting in microbiology:methods and applications in functional genomics and biotechnology. Trends

Biotechnol. 2008; 26(9):490–497. 8

[33] Goodacre R, Vaidyanathan S, Dunn WB, Harrigan GG, Kell DB.Metabolomics by numbers: acquiring and understanding global metabolitedata. Trends Biotechnol. 2004; 22(5):245–252. 8

[34] van der Werf MJ, Overkamp KM, Muilwijk B, Coulier L, Hankemeier T. Mi-crobial metabolomics: Toward a platform with full metabolome coverage.Anal. Biochem. 2007; 370(1):17–25. 8

[35] Bundy JG, Willey TL, Castell RS, Ellar DJ, Brindle KM. Discrimination ofpathogenic clinical isolates and laboratory strains of Bacillus cereus by NMR-based metabolomic profiling. FEMS Microbiol. Lett. 2005; 242(1):127–136. 8

[36] Nicholson JK, Lindon JC. Systems biology: Metabonomics. Nature 2008;455(7216):1054–1056. 8

[37] Boroujerdi AFB, Vizcaino MI, Meyers A, Pollock EC, Huynh SL, SchockTB, Morris PJ, Bearden DW. NMR-Based Microbial Metabolomics and theTemperature-Dependent Coral Pathogen Vibrio coralliilyticus. Environ. Sci.Technol. 2009; 43(20):7658–7664. 8

[38] Dettmer K, Aronov PA, Hammock BD. Mass Spectrometry-basedmetabolomics. Mass Spectrom. Rev. 2007; 26(1):51–78. 8

[39] Morrow Jr KJ. Mass Spec Central to Metabolomics. Genetic Engineering &Biotechnology News 2010; 30(7):1–3. 8

[40] Koek MM, Muilwijk B, van der Werf MJ, Hankemeier T. MicrobialMetabolomics with Gas Chromatography/Mass Spectrometry. Anal. Chem.

2006; 78(4):1272–1281. 8, 65

101

Page 114: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

Bibliography

[41] Zhou B, Xiao JF, Tuli L, Ressom HW. LC-MS-based metabolomics. Mol.

BioSyst. 2012; 8:470–481. 8

[42] Issaq HJ, Abbott E, Veenstra TD. Utility of separation science in metabolomicstudies. J. Sep. Sci. 2008; 31(11):1936–1947. 8

[43] Colby BN. Spectral deconvolution for overlapping GC/MS components. J.Am. Soc. Mass Spectrom. 1992; 3(5):558–562. 9

[44] Herron NR, Donnelly JR, Sovocool GW. Software-based mass spectral en-hancement to remove interferences from spectra of unknowns. J. Am. Soc.

Mass Spectrom. 1996; 7(6):598–604. 9

[45] Jellema RH, Krishnan S, Hendriks MMWB, Muilwijk B, Vogels JTWE. Decon-volution using signal segmentation. Chemom. Intell. Lab. Syst. 2010; 104(1):132–139. 9

[46] Halket JM, Waterman D, Przyborowska AM, Patel RKP, Fraser PD, BramleyPM. Chemical derivatization and mass spectral libraries in metabolic profilingby GC/MS and LC/MS/MS. J. Exp. Bot. 2005; 56(410):219–243. 9

[47] Kol S, Merlo ME, Scheltema RA, de Vries M, Vonk RJ, Kikkert NA, DijkhuizenL, Breitling R, Takano E. Metabolomic Characterization of the Salt StressResponse in Streptomyces coelicolor. Appl. Environ. Microbiol. 2010; 76(8):2574–2581. 10

[48] Massart DL, Vandeginste BGM, Buydens LMC, de Jong S, Lewi PJ, Smeyers-Verbeke J. Handbook of Chemometrics and Qualimetrics: Part A, Data handling in

science and technology, vol. 20A. Elsevier Science B.V., Amsterdam, 1997. 10,61, 108

[49] Vandeginste BGM, Massart DL, Buydens LMC, de Jong S, Lewi PJ, Smeyers-Verbeke J. Handbook of Chemometrics and Qualimetrics: Part B, Data handling in

science and technology, vol. 20B. Elsevier Science B.V., Amsterdam, 1997. 10

[50] Wold S, Sjöström M, Eriksson L. PLS-regression: a basic tool of chemometrics.Chemom. Intell. Lab. Syst. 2001; 58(2):109–130. 10

[51] Barker M, Rayens W. Partial least squares for discrimination. J. Chemom. 2003;17(3):166–173. 10

[52] Bro R. Multiway calibration. Multilinear PLS. J. Chemom. 1996; 10(1):47–61.10

[53] Smilde AK. Comments on multilinear PLS. J. Chemom. 1997; 11(5):367–377.10

[54] Trygg J, Wold S. Orthogonal projections to latent structures (o-pls). J. Chemom.

2002; 16(3):119–128. 10

102

Page 115: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

Bibliography

[55] Bylesjö M, Rantalainen M, Cloarec O, Nicholson JK, Holmes E, Trygg J. OPLSdiscriminant analysis: combining the strengths of PLS-DA and SIMCA clas-sification. J. Chemom. 2006; 20(8–10):341–351. 10

[56] Jackson JE. A User’s Guide to Principal Components. 1st edn. Wiley Series inProbability and Statistics, John Wiley & Sons, Inc., New York, 1991. 10

[57] Hastie T, Tibshirani R, Friedman JH. The Elements of Statistical Learning:

Data Mining, Inference, and Prediction. 2nd edn. Springer Series in Statistics,Springer Science+Business Media, LLC, New York, 2009. 11

[58] van den Berg RA, Hoefsloot HCJ, Westerhuis JA, Smilde AK, van der WerfMJ. Centering, scaling, and transformations: improving the biological infor-mation content of metabolomics data. BMC Genomics 2006; 7(1):142. 11, 43,48, 57, 83

[59] Kvalheim OM, Brakstad F, Liang Y. Preprocessing of analytical profiles inthe presence of homoscedastic or heteroscedastic noise. Anal. Chem. 1994;66(1):43–51. 11, 43

[60] Smilde AK, Jansen JJ, Hoefsloot HCJ, Lamers RJAN, van der Greef J, Timmer-man ME. ANOVA-simultaneous component analysis (ASCA): a new tool foranalyzing designed metabolomics data. Bioinformatics 2005; 21(13):3043–3048.12, 58, 59

[61] Jansen JJ, Hoefsloot HCJ, van der Greef J, Timmerman ME, Westerhuis JA,Smilde AK. ASCA:analysis of multivariate data obtained from an experimen-tal design. J. Chemom. 2005; 19(9):469–481. 12, 57, 59, 73

[62] Westerhuis JA, Derks EPPA, Hoefsloot HCJ, Smilde AK. Grey ComponentAnalysis. J. Chemom. 2007; 21(10–11):474–485. 12, 57

[63] Francke C, Siezen RJ, Teusink B. Reconstructing the metabolic network of abacterium from its genome. Trends Microbiol. 2005; 13(11):550–558. 17

[64] Reed JL, Famili I, Thiele I, Palsson BØ. Towards multidimensional genomeannotation. Nat. Rev. Genet. 2006; 7(2):130–141. 17

[65] Notebaart RA, van Enckevort FHJ, Francke C, Siezen RJ, Teusink B. Accel-erating the reconstruction of genome-scale metabolic networks. BMC Bioinf.

2006; 7:296. 17, 18

[66] Satish Kumar V, Dasika MS, Maranas CD. Optimization based automatedcuration of metabolic reconstructions. BMC Bioinf. 2007; 8:212. 17

[67] DeJongh M, Formsma K, Boillot P, Gould J, Rycenga M, Best A. Toward theautomated generation of genome-scale metabolic networks in the seed. BMC

Bioinf. 2007; 8:139. 17

103

Page 116: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

Bibliography

[68] Varma A, Palsson BØ. Stoichiometric flux balance models quantitatively pre-dict growth and metabolic by-product secretion in wild-type Escherichia coliW3110. Appl. Environ. Microbiol. 1994; 60(10):3724–3731. 17, 19

[69] Ibarra RU, Edwards JS, Palsson BØ. Escherichia coli k-12 undergoes adaptiveevolution to achieve in silico predicted optimal growth. Nature (London, U.K.) 2002; 420(6912):186–189. 17

[70] Price ND, Reed JL, Palsson BØ. Genome-scale models of microbial cells: eval-uating the consequences of constraints. Nat. Rev. Microbiol. 2004; 2(11):886–897. 17, 19, 21

[71] Teusink B, Wiersma A, Molenaar D, Francke C, de Vos WM, Siezen RJ,Smid EJ. Analysis of growth of Lactobacillus plantarum WCFS1 on a com-plex medium using a genome-scale metabolic model. J. Biol. Chem. 2006;281(52):40041–40048. 17

[72] Price ND, Schellenberger J, Palsson BØ. Uniform Sampling of Steady-StateFlux Spaces: Means to Design Experiments and to Interpret Enzymopathies.Biophys. J. 2004; 87(4):2172–2186. 17, 20

[73] Burgard AP, Nikolaev EV, Schilling CH, Maranas CD. Flux Coupling Analy-sis of Genome-Scale Metabolic Network Reconstructions. Genome Res. 2004;14(2):301–312. 17, 21

[74] Notebaart RA, Teusink B, Siezen RJ, Papp B. Co-regulation of metabolic genesis better explained by flux coupling than by network distance. PLoS Comput.

Biol. 2008; 4(1):e26. 17

[75] Becker SA, Feist AM, Mo ML, Hannum G, Palsson BØ, Herrgard MJ. Quan-titative prediction of cellular metabolism with constraint-based models: theCOBRA Toolbox. Nat. Protoc. 2007; 2(3):727–738. 17, 20

[76] Poolman MG, Sebu C, Pidcock MK, Fell DA. Modular decomposition ofmetabolic systems via null-space analysis. J. Theor. Biol. 2007; 249(4):691–705.17, 22

[77] Wegmann U, O’Connell-Motherway M, Zomer A, Buist G, Shearman C, Can-chaya C, Ventura M, Goesmann A, Gasson MJ, Kuipers OP, van Sinderen D,Kok J. Complete genome sequence of the prototype lactic acid bacterium Lac-tococcus lactis subsp. cremoris MG1363. J. Bacteriol. 2007; 189(8):3256–3270.18

[78] Lee JM, Gianchandani EP, Papin JA. Flux balance analysis in the era ofmetabolomics. Brief. Bioinform. 2006; 7(2):140–150. 19

[79] Edwards JS, Covert M, Palsson BØ. Metabolic modelling of microbes: theflux-balance approach. Environ. Microbiol. 2002; 4(3):133–140. 19

[80] Wagner C, Urbanczik R. The Geometry of the Flux Cone of a MetabolicNetwork. Biophys. J. 2005; 89(6):3837–3845. 20

104

Page 117: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

Bibliography

[81] Schwender J, Ohlrogge J, Shachar-Hill Y. Understanding flux in plantmetabolic networks. Curr. Opin. Plant Biol. 2004; 7(3):309–317. 21

[82] Samoilov M, Plyasunov S, Arkin AP. Stochastic amplification and signalingin enzymatic futile cycles through noise-induced bistability with oscillations.Proc. Natl. Acad. Sci. U. S. A. 2005; 102(7):2310–2315. 21

[83] Beard DA, Liang Sd, Qian H. Energy Balance for Analysis of ComplexMetabolic Networks. Biophys. J. 2002; 83(1):79–86. 21

[84] Kümmel A, Panke S, Heinemann M. Systematic assignment of thermody-namic constraints in metabolic network models. BMC Bioinf. 2006; 7:512. 21

[85] Henry CS, Broadbelt LJ, Hatzimanikatis V. Thermodynamics-Based MetabolicFlux Analysis. Biophys. J. 2007; 92(5):1792–1805. 21

[86] Pfeiffer T, Sánchez-Valdenebro I, Nuño J, Montero F, Schuster S. METATOOL:for studying metabolic networks. Bioinformatics 1999; 15(3):251–257. 22

[87] Bro R, Smilde AK. Centering and scaling in component analysis. J. Chemom.

2003; 17(1):16–33. 23, 43

[88] Horan C. Multidimensional scaling: Combining observations when individ-uals have different perceptual structures. Psychometrika 1969; 34(2):139–165.23

[89] Carroll JD, Chang JJ. Analysis of individual differences in multidimensionalscaling via an n-way generalization of “Eckart-Young” decomposition. Psy-chometrika 1970; 35(3):283–319. 23, 24

[90] Carroll JD, Pruzansky S. The CANDECOMP-CANDELINC Family of Models

and Methods for Multidimensional Data Analysis., chap. 10. In Law et al. [150],372–402. 24

[91] Harshman R. Foundations of the PARAFAC procedure: Models and condi-tions for an “explanatory” multimodal factor analysis. UCLA Working Papersin Phonetics 1970; 16:1–84. 24

[92] Harshman RA, Lundy ME. The PARAFAC Model for Three-Way Factor Analysis

and Multidimensional Scaling., chap. 5. In Law et al. [150], 122–215. 24

[93] Smilde A, Bro R, Geladi P. Multi-way analysis with applications in the chemical

sciences. John Wiley & Sons, Ltd, Chichester, 2004. 24, 35

[94] ten Berge JM, Kiers HA. Some clarifications of the CANDECOMP algorithmapplied to INDSCAL. Psychometrika 1991; 56(2):317–326. 24

[95] Wold S, Esbensen K, Geladi P. Principal Component Analysis. Chemom. Intell.

Lab. Syst. 1987; 2(1–3):37–52. 24, 59

[96] Jolliffe IT. Principal Component Analysis., Springer Series in Statistics, vol. XXIX.2nd edn. Springer-Verlag, New York, 2002. 24, 35, 44, 59

105

Page 118: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

Bibliography

[97] Andersson GG, Dable BK, Booksh KS. Weighted parallel factor analysis forcalibration of HPLC-UV/Vis spectrometers in the presence of Beer’s law de-viations. Chemom. Intell. Lab. Syst. 1999; 49(2):195–213. 39

[98] Vega-Montoto L, Wentzell PD. Maximum likelihood parallel factor analysis(MLPARAFAC). J. Chemom. 2003; 17(4):237–253. 39

[99] Bro R, Sidiropoulos ND, Smilde AK. Maximum likelihood fitting using ordi-nary least squares algorithms. J. Chemom. 2002; 16(8–10):387–400. 39, 43, 45,78

[100] Hageman JA, Hendriks MMWB, Westerhuis JA, van der Werf MJ, Berger R,Smilde AK. Simplivariate Models: Ideas and First Examples. PLoS One 2008;3(9):e3259. 40

[101] Schepers J, van Mechelen I, Ceulemans E. Three-mode partitioning. Comput.Stat. Data Anal. 2006; 51(3):1623–1642. 40

[102] Keller HR, Massart DL, Liang YZ, Kvalheim OM. Evolving factor analysis inthe presence of heteroscedastic noise. Anal. Chim. Acta 1992; 263(1–2):29–36.43

[103] Kiers HAL. Weighted least squares fitting using ordinary least squares algo-rithms. Psychometrika 1997; 62(2):251–266. 43, 45, 78

[104] Wentzell PD, Andrews DT, Hamilton DC, Faber K, Kowalski BR. Maximumlikelihood principal component analysis. J. Chemom. 1997; 11(4):339–366. 43,45

[105] Jansen JJ, Hoefsloot HCJ, Boelens HFM, van der Greef J, Smilde AK. Analysisof longitudinal metabolomics data. Bioinformatics 2004; 20(15):2438–2446. 43,50, 78

[106] Chen ZP, Morris J, Martin E, Hammond RB, Lai X, Ma C, Purba E, Roberts KJ,Bytheway R. Enhancing the signal-to-noise ratio of x-ray diffraction profilesby smoothed principal component analysis. Anal. Chem. 2005; 77(20):6563–6570. 43

[107] Schuermans M, Markovsky I, Van Huffel S. An adapted version of theelement-wise weighted total least squares method for applications in chemo-metrics. Chemom. Intell. Lab. Syst. 2007; 85(1):40–46. 45

[108] Schuermans M, Markovskya I, Wentzell PD, Van Huffel S. On the equivalencebetween total least squares and maximum likelihood pca. Anal. Chim. Acta

2005; 544(1–2):254–267. 45

[109] Hendriks MMWB, Cruz-Juarez L, De Bont D, Hal RD. Preprocessing andexploratory analysis of chromatographic profiles of plant extracts. Anal. Chim.Acta 2005; 545(1):53–64. 50

106

Page 119: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

Bibliography

[110] Tauler R, Kowalski B, Fleming S. Multivariate curve resolution applied tospectral data from multiple runs of an industrial process. Anal. Chem. 1993;65(15):2040–2047. 57

[111] Ramsay JO, Silverman BW. Functional Data Analysis. 2nd edn. Springer Seriesin Statistics, Springer Science+Business Media, Inc., 233 Spring St., New York,NY 10013, USA, 2005. 57, 77

[112] Liao JC, Boscolo R, Yang YL, Tran LM, Sabatti C, Roychowdhury VP. Net-work component analysis: reconstruction of regulatory signals in biologicalsystems. Proc. Natl. Acad. Sci. U. S. A. 2003; 100(26):15522–15527. 57

[113] Galbraith SJ, Tran LM, Liao JC. Transcriptome network component analysiswith limited microarray data. Bioinformatics 2006; 22(15):1886–1894. 57

[114] Trygg J, Holmes E, Lundstedt T. Chemometrics in Metabonomics. J. Proteome

Res. 2007; 6(2):469–479. 57

[115] Antti H, Ebbels TM, Keun HC, Bollard ME, Beckonert O, Lindon JC, Nichol-son JK, Holmes E. Statistical experimental design and partial least squaresregression analysis of biofluid metabonomic NMR and clinical chemistry datafor screening of adverse drug effects. Chemom. Intell. Lab. Syst. 2004; 73(1):139–149. 58

[116] Gullberg J, Jonsson P, Nordstrom A, Sjostrom M, Moritz T. Design of ex-periments: an efficient strategy to identify factors influencing extraction andderivatization of Arabidopsis thaliana samples in metabolomic studies withgas chromatography/mass spectrometry. Anal. Biochem. 2004; 331(2):283–295.58

[117] Pieterse B, Leer RJ, Schuren FH, van der Werf MJ. Unravelling the multipleeffects of lactic acid stress on Lactobacillus plantarum by transcription profiling.Microbiology (Reading, U. K.) 2005; 151(12):3881–3894. 58

[118] Ståhle L, Wold S. Multivariate analysis of variance (MANOVA). Chemom.Intell. Lab. Syst. 1990; 9(2):127–141. 58

[119] Höskuldsson A. Experimental design and priority PLS regression. J. Chemom.

1996; 10(5–6):637–668. 58

[120] Martens H, Høy M, Westad F, Folkenberg D, Martens M. Analysis of designedexperiments by stabilised PLS Regression and jack-knifing. Chemom. Intell.Lab. Syst. 2001; 58(2):151–170. 58

[121] de B Harrington P, Vieira NE, Espinoza J, Nien JK, Romero R, Yergey AL.Analysis of variance-principal component analysis: a soft tool for proteomicdiscovery. Anal. Chim. Acta 2005; 544(1–2):118–127. 58

[122] Van den Brink PJ, Ter Braak CJ. Principal Response Curves: analysis of time-dependent multivariate responses of biological community to stress. Environ.

Toxicol. Chem. 1999; 18(2):138–148. 58

107

Page 120: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

Bibliography

[123] Vis DJ, Westerhuis JA, Smilde AK, van der Greef J. Statistical validation ofmegavariate effects in ASCA. BMC Bioinf. 2007; 8(1):322. 58

[124] Ducruix C, Vailhen D, Werner E, Fievet JB, Bourguignon J, Tabet JC, EzanE, Junot C. Metabolomic investigation of the response of the model plantArabidopsis thaliana to cadmium exposure: Evaluation of data pretreat-ment methods for further statistical analyses. Chemom. Intell. Lab. Syst. 2008;91(1):67–77. 58

[125] Smilde AK, Hoefsloot HCJ, Westerhuis JA. The geometry of ASCA. J.Chemom. 2008; 22(8):464–471. 58

[126] Deming SN, Morgan SL. Approximating a Region of a Multifactor Response Sur-

face, chap. 11. Vol. 3 of Data handling in science and technology [127], 181–221.60

[127] Deming SN, Morgan SL. Experimental design: a chemometric approach, Datahandling in science and technology, vol. 3. Elsevier Science Publishers B.V., Am-sterdam, 1987. 60, 108

[128] Massart DL, Vandeginste BGM, Buydens LMC, de Jong S, Lewi PJ, Smeyers-Verbeke J. Two-level Factorial Designs., chap. 22. Vol. 20A of Data handling in

science and technology [48], 659–682. 61

[129] Rubingh CM, Bijlsma S, Jellema RH, Overkamp KM, van der Werf MJ, SmildeAK. Analyzing Longitudinal Microbial Metabolomics Data. J. Proteome Res.2009; 8(9):4319–4327. 65, 83

[130] Hui BS, Wold HOA. Consistency and consistency at large of Partial Least Squares

estimates., chap. 5. In Jöreskog and Wold [131], 119–130. 68

[131] Jöreskog KG, Wold HOA (Eds.) Systems Under Indirect Observation; Part II.North-Holland Publishing Company, Amsterdam, 1982. 68, 108

[132] Smilde AK, Westerhuis JA, Hoefsloot HCJ, Bijlsma S, Rubingh CM, Vis DJ,Jellema RH, Pijl H, Roelfsema F, van der Greef J. Dynamic metabolomic dataanalysis: a tutorial review. Metabolomics 2010; 6(1):3–17. 77

[133] Eilers PHC. A Perfect Smoother. Anal. Chem. 2003; 75(14):3631–3636. 77

[134] Silverman BW. Smoothed Functional Principal Components Analysis byChoice of Norm. Ann. Stat. 1996; 24(1):1–24. 77

[135] Chen ZP, Liang YZ, Jiang JH, Li Y, Qian JY, Yu RQ. Determination of the num-ber of components in mixtures using a new approach incorporating chemicalinformation. J. Chemom. 1999; 13(1):15–30. 77

[136] Yamamoto H, Yamaji H, Abe Y, Harada K, Waluyo D, Fukusaki E, Kondo A,Ohno H, Fukuda H. Dimensionality reduction for metabolome data usingPCA, PLS, OPLS, and RFDA with differential penalties to latent variables.Chemom. Intell. Lab. Syst. 2009; 98(2):136–142. 77

108

Page 121: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

Bibliography

[137] Hoefsloot HCJ, Verouden MPH, Westerhuis JA, Smilde AK. Maximum likeli-hood scaling (MALS). J. Chemom. 2006; 20(3–4):120–127. 78

[138] Burnham AJ, MacGregor JF, Viveros R. A statistical framework for multi-variate latent variable regression methods based on maximum likelihood. J.Chemom. 1999; 13(1):49–65. 80

[139] Rajkó R. Some surprising properties of multivariate curve resolution-alternating least squares (MCR-ALS) algorithms. J. Chemom. 2009; 23(4):172–178. 80

[140] Tauler R. Comments on a recently published paper ‘Some surprising prop-erties of multivariate curve resolution-alternating least squares (MCR-ALS)algorithms’. J. Chemom. 2010; 24(2):87–90. 80

[141] Rajkó R. Rejoinder to ‘Comments on a recently published paper “Some sur-prising properties of multivariate curve resolution-alternating least squares(MCR-ALS) algorithms”’. J. Chemom. 2010; 24(2):91–93. 80

[142] Bro R, Kjeldahl K, Smilde AK, Kiers HAL. Cross-validation of compo-nent models: a critical look at current methods. Anal. Bioanal. Chem. 2008;390(5):1241–1251. 80

[143] Wold S. Cross-Validatory Estimation of the Number of Components in Factorand Principal Components Models. Technometrics 1978; 20(4):397–405. 80

[144] Hansen PC. Analysis of Discrete Ill-Posed Problems by Means of the L-Curve.SIAM Rev. 1992; 34(4):561–580. 86

[145] Lawson CL, Hanson RJ. Solving least squares problems. Prentice-Hall series inautomatic computation, Prentice-Hall, Inc., Englewood Cliffs, NJ, USA, 1974.86

[146] Björck A, Golub GH. Numerical methods for computing angles betweenlinear subspaces. Math. Comp. 1973; 27(123):579–594. 86

[147] van den Berg RA, Rubingh CM, Westerhuis JA, van der Werf MJ, Smilde AK.Metabolomics data exploration guided by prior knowledge. Anal. Chim. Acta

2009; 651(2):173–181. 96

[148] Xia J, Wishart DS. MSEA: a web-based tool to identify biologically meaningfulpatterns in quantitative metabolomic data. Nucleic Acids Res. 2010; 38(suppl2):W71–W77. 97

[149] Kankainen M, Gopalacharyulu P, Holm L, Orešic M. MPEA—metabolitepathway enrichment analysis. Bioinformatics 2011; 27(13):1878–1879. 97

[150] Law HG, Snyder Jr CW, Hattie JA, McDonald RP (Eds.) Research methods for

multimode data analysis. Praeger Publishers, New York, 1984. 105

109

Page 122: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics
Page 123: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

Summary

This thesis, entitled “Fusing prior knowledge with microbial metabolomics”, dealswith combining prior knowledge in microbial metabolomics. Chapter 1 containsa general introduction that puts the thesis into a framework. It describes thebackground of metabolomics and discusses the various aspects of a (microbial)metabolomics study: 1) the origin of samples, 2) measurement of samples and pro-cessing the raw data, and finally 3) data analysis and the challenges the data andmethods used pose. In the section about data analysis fusion of prior knowledgeabout the study is introduced as a solution to deal with the challenges the dataand methods present. However fusion of prior knowledge is not restricted solelyto data analysis but can also be applied within the other aspects of a microbialmetabolomics study.

In a microbial metabolomics study the data for studying metabolism is obtainedby sampling from fermentations followed by metabolome analysis of these samples.Data for studying metabolism of a micro-organism can, however, also be obtainedin silico. The genome sequences of many organisms and information about gene-protein-reaction associations with respect to these organisms can be obtained fromdatabases, textbooks and other scientific publications. This information can subse-quently be used as prior knowledge for constructing genome-scale metabolic net-works. In cellular systems biology these networks are used to model and studythe behaviour of metabolism in context of cell growth in terms of fluxes (reactionrates) through reactions in the network. Because the flux through each reaction cangenerally vary within a range, many flux distributions of the entire network arepossible. However, since reactions are connected by common metabolites, reactionsthat are functionally coherent, are expected to highly correlate in terms of their fluxvalue over different flux distributions.

In Chapter 2 the genome-scale network of a lactic acid bacterium, named Lac-

tococcus lactis MG1363, is used to generate flux distributions for multiple in silicoenvironmental conditions, mimicking laboratory growth conditions. The flux dis-tributions per condition are used to calculate a correlation matrix for each condition.Subsequently the correlations between the reactions are analyzed in a multivariateapproach across the in silico environmental conditions in order to identify correla-tions that are invariant (i.e. independent of the environment) and correlations thatare variant across conditions (i.e. dependent of the environment). The applied mul-tivariate methods are parallel factor analysis (PARAFAC) and principal componentanalysis (PCA). The discussion of the results of both methods leads to the question

111

Page 124: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

Summary

whether latent variable models are suitable analyzing this type of data.One of the aims of metabolomics is to quantify the full metabolome (all metabo-

lites present in a sample). Commonly used data analytical methods for analyzingmetabolomics data work optimal when each measured variable has the same prob-ability distribution as the others and all variables are mutually independent, that iswhen each variable is independent and identically distributed (i.i.d.). Prior to dataanalysis the data is often first preprocessed by scaling. Metabolomics data, how-ever, hardly ever is i.i.d. and scaling can easily result in noise amplification andan increase in heteroscedasticity. In many cases prior knowledge exist about thenoise characteristics of the data, either from experience with the measurement plat-form(s), or from estimation of the measurement error from repeated measurementsor biological repeats.

A filtering procedure is introduced in Chapter 3 for multivariate data, that usesthe noise characteristics of the data and does not suffer from noise amplificationby scaling. A maximum likelihood principal component analysis (MLPCA) step isused as a filter that partly removes noise. This filtering can be used prior to anysubsequent scaling and multivariate analysis of the data and is especially useful fordata with moderate and low signal-to-noise ratio’s, such as metabolomics data butis also applicable to proteomics and transcriptomics data with these properties.

In metabolomics research a large number of metabolites are measured that re-flect the cellular state under the experimental conditions studied. In many occasionsthe experiments are performed according to an experimental design to make surethat sufficient variation is induced in the metabolite concentrations. However, asmetabolomics is a holistic approach, also a large number of metabolites are mea-sured in which no variation is induced by the experimental design. The presence ofsuch non-induced metabolites hampers traditional data analysis methods as PCAto estimate the true model of the induced variation. The greediness of PCA leads toa clear overfit of the metabolomics data and can lead to a bad selection of importantmetabolites.

Chapter 4 explores how, why and how severe PCA overfits data with an un-derlying experimental design. Recently new data analysis methods have been in-troduced that can use prior information of the system to reduce the overfit. Thischapter shows that incorporation of prior knowledge of the system under inves-tigation leads to a better estimation of the true underlying structure and to lessoverfit. The experimental design information together with anova-simultaneouscomponent analysis (ASCA) is used to improve the analysis of metabolomics data.To show the improved model estimation property of ASCA a thorough simula-tion study is used and the results are extended to a microbial metabolomics batchfermentation study. The ASCA model is much less affected by the non-inducedvariation and measurement error than PCA, leading to a much better model of theinduced variation.

Longitudinal data plays an important role in the various fields of functionalgenomics to improve understanding and knowledge of the dynamics within bio-logical systems. The time-resolved data of metabolomics are expected to containunderlying dynamic profiles that are smooth. However, estimating these under-

112

Page 125: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

lying smooth dynamic phenomena from such data is complicated due to the highcomplexity of the data and the limited number of techniques that can deal with thistype of data. Traditional multivariate data analytical techniques, such as PrincipalComponent Analysis, ignore the underlying dynamics in the data and give solu-tions that tend towards explaining variance rather than explaining dynamics andunderstanding biology.

In Chapter 5 Weighted Smooth Principal Component Analysis (WSPCA) is pre-sented, a method that incorporates smoothness into the scores of PCA by using aroughness penalty. WSPCA can be used for data with consecutive samples thatare linked by time (time-resolved) or position (spatially resolved) containing ex-pected smoothness. By means of a synthetic data set is shown that applying thisrestriction leads to a better estimation of the dynamic phenomena underlying thedata. For determination of the model meta parameters (the number of componentsand the smoothness parameter) a leave elements out cross-validation procedure ispresented, that for the synthetic data set is capable of estimating the underlyingnoiseless data. The WSPCA method and leave elements out cross-validation arethen applied to a real-life metabolomics data set from a E. coli batch fermentation,that has been sampled over time, to estimate the missing elements in the data set.

Chapter 6, the final chapter of this thesis, contains a conclusion about the workdescribed and an outlook for future research opportunities with respect to the useof biological knowledge in a microbial metabolomics study.

113

Page 126: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics
Page 127: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

Samenvatting

Dit proefschrift, getiteld “Fusing prior knowledge with microbial metabolomics”,behandelt het inbouwen van voorkennis in microbiële metabolomics. Hoofdstuk 1bevat een algemene introductie, die het proefschrift in een raamwerk plaatst. Hetbeschrijft de achtergrond van metabolomics en bespreekt de verschillende aspectenvan een (microbiële) metabolomics studie: 1) de oorsprong van monsters, 2) me-ting van monsters en verwerking van ruwe data, en tenslotte 3) data analyse en deuitdagingen die de data en gebruikte methoden genereren. In de sectie over dataanalyse wordt inbouwen van voorkennis over de studie geïntroduceerd als een op-lossing om met de uitdagingen die de data en methoden voortbrengen om te gaan.Inbouwen van voorkennis beperkt zich echter niet enkel tot data analyse, maar kanook toegepast worden in andere aspecten van een microbiële metabolomics studie.

Bij een microbiële metabolomics studie worden de meetgegevens voor het bestu-deren van het metabolisme verkregen door microbiële fermentaties te bemonsterenen vervolgens het metabolome te analyseren van deze monsters. Data voor hetbestuderen van het metabolisme van micro-organismen kan echter ook in silico ver-kregen worden uit computer simulaties. De genoomsequentie van veel organismenen informatie over gen-eiwit-reactie associaties met betrekking tot deze organismenkan verkregen worden uit databases, tekstboeken en andere wetenschappelijke pu-blicaties. Deze informatie kan vervolgens gebruikt worden als voorkennis bij hetbouwen van genoom-schaal metabole netwerken. In de cellulaire systeem biolo-gie worden deze netwerken gebruikt om metabolisme in de context van celgroeiin termen van fluxen (reactiesnelheden) door reacties in het netwerk te modelerenen te bestuderen. Omdat de flux door elke reactie over het algemeen binnen eenbepaald bereik kan variëren, behoren vele flux distributies over het hele network totde mogelijke oplossingen. Echter aangezien reacties verbonden zijn door gemeen-schappelijke metabolieten, wordt verwacht dat reacties, die functioneel coherentzijn, sterk zullen correleren ten aanzien van hun flux waarde over verschillendeflux distributies.

Het genoom-schaal netwerk van een melkzuurbacterie, genaamd Lactococcus lac-

tis MG1363, wordt in Hoofdstuk 2 gebruikt om flux distributies te genereren voorverscheidene in silico omgevingscondities, die laboratorium groeicondities naboot-sen. De flux distributies per conditie worden gebruikt om een correlatiematrix voorelke conditie te berekenen. Daarna worden de correlaties tussen de reacties gea-nalyseerd in een multivariate aanpak over de in silico omgevingscondities ter iden-tificatie van correlaties die invariant (d.w.z. onafhankelijk van de omgeving) en

115

Page 128: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

Samenvatting

correlaties die variant zijn tussen condities (d.w.z. afhankelijk van de omgeving).De toegepaste multivariate methoden zijn parallel factor analysis (PARAFAC) enprincipal component analysis (PCA). De bespreking van de resultaten van beidemethoden leidt tot de vraag of latente variabele modellen wel geschikt zijn voor hetanalyseren van dit type data.

Een van de doelstellingen van metabolomics is het kwantificeren van het vol-ledige metaboloom (alle in een monster aanwezige metabolieten). Gangbare dataanalytische methoden voor het analyseren van metabolomics data werken optimaalwanneer elke gemeten variabele dezelfde kansverdeling volgt als alle anderen enalle variabelen onafhankelijk van elkaar zijn, d.w.z. wanneer elke variabele onaf-hankelijk is en identiek verdeeld (i.i.d.). Voorafgaande aan data analyse wordt dedata vaak eerst voorbewerkt door middel van schaling. Metabolomics data is echterzelden i.i.d. en schaling ervan kan makkelijk resulteren in versterking van de aan-wezige ruis en leiden tot een toename in heteroscedasticiteit. In veel gevallen is ervoorkennis aanwezig over de ruis karakteristieken van de data, uit ervaring met hetmeetplatvorm of uit schatting van de meetfouten door middel van herhaald metenof de aanwezigheid van biologische replica’s.

Een filter procedure wordt geïntroduceerd in Hoofdstuk 3 voor multivariatedata, die gebruik maakt van de ruis karakteristieken van de data en geen last heeftvan ruis versterking door schaling. Een maximum likelihood principal componentanalyse (MLPCA) stap wordt toegepast als filter, die de ruis gedeeltelijk verwijderd.Deze filtering kan gebruikt worden voorafgaande aan elke navolgende schaling enmultivariate analyse van de data en is bijzonder bruikbaar voor data met matige enlage signaal-ruisverhouding, zoals metabolomics data maar is ook toepasbaar opproteomics en transcriptomics data met deze eigenschap.

In metabolomics onderzoek wordt een groot aantal metabolieten gemeten, dieeen afspiegeling vormen van de cellulaire toestand onder de bestudeerde experi-mentele omstandigheden. Veel experimenten worden uitgevoerd volgens een be-paald experimenteel ontwerp om er zeker van te zijn, dat er voldoende variatie inde metaboliet concentraties wordt geïnduceerd. Echter, aangezien metabolomicseen holistische aanpak betreft, worden ook grote aantallen metabolieten gemeten,waarin in geen variatie geïnduceerd wordt ten gevolge van het gebruikte experi-mentele ontwerp. De aanwezigheid van dergelijke niet-geïnduceerde metabolietenbelemmert traditionele data analyse methoden zoals PCA om het werkelijk onder-liggende model van de geïnduceerde variatie te schatten. De hebzucht van PCAleidt tot een duidelijke overfit van de metabolomics data en kan leiden tot eenslechte selectie van belangrijke metabolieten.

Hoofdstuk 4 onderzoekt hoe, waarom en hoe sterk PCA data met een onderlig-gend experimenteel ontwerp overfit. Recentelijk zijn nieuwe data analyse methodengeïntroduceerd, die gebruik kunnen maken van voorkennis over het systeem om deoverfit te verminderen. Dit hoofdstuk laat zien dat inbouwen van voorkennis overhet bestudeerde systeem leidt tot een betere schatting van de werkelijk onderlig-gende structuur en tot minder overfit. De informatie over het experimentele ont-werp wordt samen met anova-simultaneous component analyse (ASCA) gebruiktom de analyse van metabolomics data te verbeteren. Om de verbeterde model

116

Page 129: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

schattende eigenschap van ASCA aan te tonen wordt een grondige simulatiestu-die gebruikt en worden de resultaten uitgebreid naar een microbiële metabolomicsbatch fermentatie studie. Het ASCA model wordt veel minder beïnvloed door deniet-geïnduceerde variatie en meetfouten dan PCA, hetgeen leidt tot een veel betermodel van de geïnduceerde variatie.

Longitudinale data speelt een belangrijke rol in de verschillende domeinen vanfunctionele genomics om het begrip en de kennis van de dynamiek binnen bio-logische systemen te verbeteren. De tijdsopgeloste meetgegevens van metabolo-mics zullen naar verwachting onderliggende dynamische profielen bevatten dieglad zijn. Echter, is het schatten van deze onderliggende gladde dynamische ver-schijnselen uit dit soort data ingewikkeld door de hoge complexiteit van de dataen het beperkt aantal methoden, die kunnen omgaan met dit type data. Traditio-nele multivariate data analyse technieken, zoals Principale Componenten Analyse,negeren de onderliggende dynamiek in de data en geven oplossingen, die neigennaar het verklaren van variantie in plaats van de dynamiek en bij te dragen aan hetbegrijpen van de onderliggende biologie.

In Hoofdstuk 5 wordt Weighted Smooth Principal Component Analyse (WSPCA)gepresenteerd, een methode die gladheid inbouwt in de scores van PCA doorgebruik te maken van een afstraffing op de ruwheid van een metaboliet profiel.WSPCA kan gebruikt worden voor data van opeenvolgende monsters, die aan el-kaar gerelateerd zijn door tijd (tijdsopgelost) of locatie (ruimtelijk opgelost) en daar-naast verwacht worden gladheid te bezitten over de tijd of locatie. Door middelvan een synthetische dataset wordt aangetoond dat het toepassen van deze restric-tie leidt tot een beter schatting van de onderliggende dynamische verschijnselen inde data. Voor bepaling van de model meta parameters (het aantal componentenen de gladheidsparameter) wordt een kruisvalidatie procedure gebaseerd op hetweglaten van elementen uit de data matrix voorgesteld, die voor de synthetischedataset in staat is de onderliggende ruisloze data te schatten. De WSPCA methodeen de op het weglaten van elementen gebaseerde kruisvalidatie worden vervolgenstoegepast op een metabolomics dataset uit de praktijk van een E. coli batch fermen-tatie, die bemonsterd is in de tijd, om de onbrekende elementen in de dataset teschatten.

Het laatste hoofdstuk van dit proefschrift, Hoofdstuk 6, bestaat uit een conclusieten aanzien van het werk dat beschreven is in dit academische proefschrift en eenvooruitblik op mogelijkheden voor verder onderzoek ten aanzien van het gebruikvan biologische kennis in een microbiële metabolomics studie.

117

Page 130: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics
Page 131: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

Dankwoord

Mijn proefschrift is klaar! Onderstaande strip (Figuur 1) illustreert het soms frustre-rende proces van een promotie-onderzoek. Het geluk is echter dat je een promotienooit helemaal alleen doet. Ik wil dan ook al mijn collega’s, vrienden en niet in delaatste plaats mijn familie danken voor hun vertrouwen en steun in de afgelopenjaren. Jullie hebben allemaal op de een of andere manier bijgedragen aan de tot-standkoming van dit proefschrit en daar ben ik jullie allemaal zeer dankbaar voor.Hier in mijn dankwoord wil ik een aantal van hen in het bijzonder noemen.

In de eerste plaats wil ik mijn promotor Age Smilde danken voor het gesteldevertrouwen aan het einde van mijn Master door mij als promovendus te vragenin zijn Biosystems Data Analysis (BDA) groep. Jouw nooit aflatend enthousiasmevoor multivariate data analyse werkt erg aanstekelijk en geeft na elke besprekingweer een drijfveer de schouders er flink onder te zetten. Vervolgens dank ik mijncopromotor Johan Westerhuis voor zijn dagelijkse begeleiding en geduld met namebij het schrijven van publicaties, het zal nooit mijn hobby worden. Over de jaren hebik in elk geval een boel opgestoken van jouw visie en benadering bij het klaren vanmenige klus. Ook de andere stafleden van de groep Huub, Gooitzen en Antoinehebben het werken bij BDA tot een zeer aangename ervaring gemaakt. Huub,

Figuur 1: “Fading” from Piled Higher and Deeper by Jorge Cham

119

Page 132: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

Dankwoord

bedankt voor het meedenken waar nodig en de samenwerking die Hoofdstuk 3 indit proefschrift heeft opgeleverd. Je manier van communiceren en “managementby walking around” blijven voor mij legendarisch. Wellicht eten we ooit nog eenkopje dropsoep samen! Gooitzen, ik vond het altijd fijn om even bij je binnen tekunnen lopen om iets te vragen of een soms broodnodig ontspannend praatje overonze gedeelde interesses (computers/Linux/LATEX 2ε). Je keurde mijn LVT (leukvoor thuis) edities niet goed, maar gek genoeg kreeg ik wel af en toe personen aanmijn bureau die op jouw advies bij mij een LVT editie kwamen vragen.

Mijn promotie project is gefinancierd door het Netherlands Bioinformatics Cen-tre en ik wil dan ook alle leden van het project cluster spX (later sp3.7.1) dankenvoor de plezierige samenwerking. Ik heb jullie feedback tijdens onze halfjaarlijkseontmoetingen altijd gewaardeerd en ter harte genomen. Bas en Richard dank ikvoor de bijzonder fijne samenwerking die in een mooie puclicatie (Hoofdstuk 2)geresulteerd heeft. Mariët, jij verdient een speciale vermelding. Zonder jou in-breng ten aanzien van data, ideeën, suggesties en je microbiologische kennis zou ikbehoorlijk verloren zijn geweest, waarvoor mijn oprechte dank.

Sara* en Lisa† dank ik uit de grond van mijn hart, zonder jullie zou menighoofdstuk in dit proefschrift niet mogelijk zijn geweest.

Alle collega’s van de BDA groep en de voormalige Proces Analyse & Chemo-metrie (PAC) groep dank ik voor de plezierige samenwerking en goede sfeer in degroep. In het bijzonder dank ik Jeroen, zonder jouw aanbod m.b.t. huisvesting hadik hoogstwaarschijnlijk op straat moeten leven, en Suzanne, jouw tip over wonen inZuidoost bleek de gouden en je hulp (en Arjens) heb ik erg op prijs gesteld. Ik dankmijn kamergenoten: Daniel (met jouw heb ik het langst een kamer gedeeld), Jeroen,Olja, Susana, Tunahan (thanks for the unforgettable visit to Istanbul), Marcel, Ewaen Chengjian (not at University, but thanks for being my roommate at home). Demensen die niet met naam genoemd zijn, ben ik ook zeker niet vergeten maar hetworden er te veel om op te noemen. Ook de collega’s van Analytische Chemie(HIMS) en alle collega’s van SILS dank ik voor de gezelligheid tijdens koffiepauzesen borrels. De dames van het secretariaat (Ans, Bondien en later ook Maartje), jullieworden te vaak niet genoemd maar vervullen een heel belangrijke rol tijdens eenpromotie-onderzoek. Bondien, jou dank ik in het bijzonder, omdat ik onze gesprek-jes erg heb gewaardeerd. Vooral af en toe lekker tegen elkaar mopperen en klagenvond ik erg vermakelijk en maakte alles wat draaglijker.

Dan even een woordje over mijn paranimfen. Jaap Hans, onze geschiedenis gaatterug tot het begin van onze studie in Nijmegen en ik ben blij en trots dat we alzoveel met elkaar hebben gedeeld. Zoals ik steeds zeg, zien we elkaar niet vaakmaar als we elkaar zien of spreken heb ik altijd het gevoel of we elkaar gisterennog gezien/gesproken hebben. Jan Paul, wij kennen elkaar vanaf mijn stage bijTauw. Ik vind het heel bijzonder dat we al die jaren in contact zijn gebleven. Vooralonze concertavonden uit het verleden (en hopelijk ook in de toekomst) staan inmijn geheugen gegrift. Ik kan gelukkig altijd op jullie rekenen en ben dan ook zeervereerd dat jullie als paranimfen willen optreden tijdens mijn promotie.

*Stichting Academisch Rekencentrum Amsterdam†Linux Supercomputer Almere

120

Page 133: Fusing prior knowledge with microbial metabolomics · M a i k e l P. H. V e r o u d e n Fusing prior knowledge with microbial metabolomics Fusing prior knowledge with microbial metabolomics

Mijn oud studiegenoten van het HLO dank ik voor ons jaarlijks weekendje (Sur-vivall) weg. Na 15 jaar blijft het elk jaar toch weer bijzonder en een hoogtepunt,hopelijk blijven we dit nog lang volhouden. Ik hoop dat jullie erbij kunnen zijn,ondanks dat de Sur5all 2012 het weekend na mijn promotie gepland staat.

Net als de lijst auteurs bij een publicatie zijn de belangrijkste plaatsen in eendankwoord voor- en achteraan. Als laatste komen dan ook mijn naasten aan bod.Ik vind het nog steeds spijtig dat mijn moeder er niet meer is en dat zij mijn promo-tie niet meer meemaakt. Ik mis je nog elke dag! Ook mijn vader is heel belangrijkvoor me. Vooral de opmerking:“Houd er dan maar mee op en zoek maar een baan”,heeft mij in het verleden vaak, waarschijnlijk zonder het zelf door te hebben, aan-gespoord het juist wel te willen afronden. Anneke, ik ben erg blij dat jij en mijnvader het zo goed met elkaar hebben. Het is mooi zo’n stukje geluk te kunnenaanschouwen. Helga, mijn lieve en vooral zorgzame zus. Ik vind het geweldig hoewe altijd met elkaar zijn omgegaan. Soms plaag ik je een beetje, vooral als je weerbelt om te vragen of ik het gas heb uitgedaan, maar we weten beiden dat het uiteen goed hart komt. Giuseppe, dank voor je humor en relaxte houding ten opzichtevan alles en blijf jezelf. Je maakt mijn zus gelukkig (althans het meeste van de tijd,haha). Jullie staan aan de basis van alles, dank voor jullie liefde, geloof en vertrou-wen in mij. De allerlaatse wie ik in het bijzonder wil danken en daarmee ook deallerbelangrijkste is Ewa. Sinds de dag dat je in Amsterdam bent komen werken enwij, een paar maanden later, een relatie hebben gekregen, prijs ik mezelf nog elkedag gelukkig bij je te kunnen zijn. Het proefschrift is opgedragen aan allen die mijgevormd hebben tot de persoon die ik nu ben. Echter jij weet dit nog uit te breiden,want dankzij jou probeer ik ook nog op alle vlakken een beter mens te zijn. Datlukt niet altijd en blijft een “work in progress”, maar ik doe en blijf mijn uiterstebest doen. Er is zoveel dat ik tegen je zou willen zeggen, maar het meeste daarvandoe ik het liefst privé. Het belangrijkste is echter in je eigen taal:"Dziekuje za całatwoja miłosc i wsparcie, kocham cie bardzo!”.

Amsterdam, 2012

Maikel P.H. Verouden

121