Using ontologies for preprocessing and mining spectra data on the Grid

Future Generation Computer Systems 23 (2007) 55–60www.elsevier.com/locate/fgcs

Using ontologies for preprocessing and mining spectra dataon the Grid

M. Cannataro∗, P.H. Guzzi, T. Mazza, G. Tradigo, P. Veltri

Department of Experimental Medicine and Clinic, Bioinformatics Laboratory, University “Magna Græcia” of Catanzaro, Viale Europa, 88100 Catanzaro, Italy

Available online 27 June 2006

Abstract

The analysis of mass spectrometry proteomics data requires the composition of different software tools devoted to the loading, management,preprocessing, mining, and visualization of spectra data. This paper proposes the use of ontologies to guide the composition of preprocessing anddata mining tools and describes the approach through MS-Analyzer, a software tool for the integrated management, preprocessing and mining ofspectra data on the Grid.c© 2006 Elsevier B.V. All rights reserved.

1. Introduction

Mass Spectrometry (MS) proteomics is a powerfultechnique for identifying different molecular targets in differentpathological conditions [1]. Data produced by MS, the spectra,may be represented as a very large set of measures (intensity,m/z), representing the abundance (intensity) of biomoleculeshaving certain mass-to-charge ratio (m/z) values. Such datacan be used for various analysis, such as biomarker discovery(i.e. a list of peaks characterizing a disease), peptide/proteinidentification, and sample classification. Since spectra have ahigh dimensionality and are often affected by errors and noise,preprocessing techniques are required before the application ofany data analysis, and especially before Data Mining (DM).

The increasing use of MS in clinical studies causes thecollection of spectra data from large sample populations,e.g. to control the progression of a disease. Moreover, thecomparative study of a disease may require the analysis ofspectra produced in different laboratories, so it is possibleto envision that, in a few years, biomedical researchers willneed to collect and analyse more and more data coming fromremote proteomics laboratories. Grid technology can be usefulfor providing efficient storage space for maintaining on-linelarge spectra datasets, the broadband infrastructure needed to

∗ Corresponding author. Tel.: +39 0961 3694100; fax: +39 0961 3694073.E-mail addresses: [email protected] (M. Cannataro), [email protected]

(P.H. Guzzi), [email protected] (T. Mazza), [email protected](G. Tradigo), [email protected] (P. Veltri).

0167-739X/$ - see front matter c© 2006 Elsevier B.V. All rights reserved.doi:10.1016/j.future.2006.04.011

collect in a secure and efficient way proteomics data comingfrom remote laboratories, and the computational power neededby the preprocessing and data mining algorithms.

On the other hand, the building of such a Virtual ProteomicsLaboratory involves different technological platforms relatedto the various steps of a proteomics experiment, such assample treatments, MS techniques, spectra processing, datamining analysis, and results visualization. Choosing the righttools requires multidisciplinary knowledge from MS specialiststo biologists and computer scientists, thus modelling thesemantics of processes, tools, and data sources is a key issuefor simplifying the design of applications.

This paper proposes the combined use of workflow tech-niques and domain ontologies as a semantic guide for exper-iment formulation, tool exploitation, and application design,i.e. services composition. The proposed WekaOntology andProtOntology ontologies describe concepts and relationships ofdata mining, bioinformatics software tools, and data sources.The use of such ontologies to compose applications, and theability to combine in a different way preprocessing and datamining tools, are described by a case study developed with MS-Analyzer.

MS-Analyzer is a Grid-based Problem Solving Environ-ment [5] that uses a Service Oriented Architecture, i.e. it offersspectra analysis services built as a composition of specializedspectra management and data mining services. Functions arethus implemented as Web/Grid Services, i.e. independent, self-describing programs that interact using the following standards:

http://www.elsevier.com/locate/fgcs

mailto:[email protected]





http://dx.doi.org/10.1016/j.future.2006.04.011

https://www.researchgate.net/publication/10857914_Mass_Spectrometry-Based_Proteomics?el=1_x_8&enrichId=rgreq-060b6f2a-7a82-4592-a812-33f9f59167eb&enrichSource=Y292ZXJQYWdlOzIyMDI4NTE3MztBUzoxMDQyNzA1OTg3NzA2OTBAMTQwMTg3MTQ1ODM4MA==

https://www.researchgate.net/publication/222820662_Future_Trends_in_Distributed_Applications_and_Problem-Solving_Environments?el=1_x_8&enrichId=rgreq-060b6f2a-7a82-4592-a812-33f9f59167eb&enrichSource=Y292ZXJQYWdlOzIyMDI4NTE3MztBUzoxMDQyNzA1OTg3NzA2OTBAMTQwMTg3MTQ1ODM4MA==

56 M. Cannataro et al. / Future Generation Computer Systems 23 (2007) 55–60

SOAP (Simple Object Access Protocol) for application invoca-tion, WSDL (Web Services Description Language) for servicedescription, and UDDI (Uniform Description, Discovery andIntegration) for service publishing and discovery. Using suchan approach, MS-Analyzer is able to offer its spectra analysisservices to more users, who can submit their analysis tasks in aconcurrent way.

2. Related works

In the last few years, many systems dealing withontologies and workflows, as well as spectra management,have been developed but, to the best of our knowledge,none of these systems implement a complete knowledgediscovery process for the analysis of mass spectra andbiological information extraction. Systems like SpecAlign [15],MSAnalyzer (http://sashimi.sourceforge.net/, please note thename similarity with MS-Analyzer, the system proposedhere), and those developed in [7], are all specializedin the preprocessing, visualization, and analysis of massspectra, but do not support the data mining of spectra andworkflow composition, nor do they include domain ontologies.LabBase [6] is an LIMS (Laboratory Information ManagementSystems) that is useful for managing experiments conductedin laboratory and related data, but is inadequate for supportingsophisticated analysis.

The genomics Research Network Architecture (gRNA) [8]provides an environment that focuses on data managementand integration and supports genomic, gene-expression, andmolecular structure analysis. The Pegasys [12] bioinformaticssystem includes many tools for pair-wise and multiple sequencealignment as in gRNA, but has a main focus on workflowcomposition. Neither system supports mass spectra datamining, nor the various knowledge discovery steps for miningof other data types.

myGrid [13] is a powerful toolkit to build bioinformaticsworkflows that offers a large set of bioinformatics toolswrapped as Web Services and uses ontologies to guidethe construction of workflows, but does not provide anyspectra management. General purpose workflow editors [16],such as [2], Kepler, Pegasus, Triana, and Taverna, aresuitable for supporting the composition of knowledge discoveryworkflows, but all lack spectra management functions. Finally,distributed data mining environments such as the Discovery Net(www.discovery-on-the.net) or the Knowledge Grid [4], thoughsupporting the entire knowledge discovery process, do not offerefficient spectra management services.

3. Mass spectrometry data analysis

Mass spectrometry [1] is a technique that is used to identifymacromolecules such as proteins or peptides in a compound.The mass spectrometer separates gas-phase ions accordingto their m/z (mass-to-charge ratio) values. Commonlyused MS techniques are SELDI (Surface Enhanced LaserDesorption/Ionization) and MALDI (Matrix-Assisted LaserDesorption/Ionization) TOF (Time Of Flight), and Liquid-Chromatography tandem mass spectrometry (LC-MS/MS).

Although each instrument usually produces data in aproprietary format, MS output can easily be represented asa (large) sequence of value pairs: intensity, which dependson the quantity of the detected biomolecules, and mass-to-charge ratio (m/z), which depends on the molecularmass of detected biomolecules. There are two main effortstoward the standardization of MS data formats: mzXMLdeveloped at the Seattle Proteome Center of the Institute forSystems Biology [11] and, more recently, mzData, definedby the Proteomics Standard Initiative of the Human ProteomeOrganization [10].

File dimensions span a broad range: from a few kilobytesper spectrum to a few gigabytes. This variability dependsmainly on two parameters: the type of spectrometer and thebin dimension (that is, the total number of measurements). Inparticular, LC-MS/MS is able to carry out a very deep analysisand requires a lot of storage space. As an example, in a LC-MS/MS experiment performed in our proteomics laboratory,the spectrometer executed 4000 scans, acquiring approximately2000 mass spectra and another 2000 spectra of selected andfragmented peptides per biological sample: the dimension ofa non-compressed single LC-MS/MS datum is about 200 MB,while a complete dataset of 20 samples is about 4 GB.

Examples of spectra analysis are: supervised classification(e.g. healthy and diseased patients); unsupervised clustering(e.g. finding patients at different stages of disease, or withdifferent causes/evolutions of the disease); and peptide/proteinidentification in a spectrum. To perform such analysis, dataneeds to be preprocessed in order to reduce noise, reducethe amount of data, and make spectra comparable. The mainpreprocessing tasks [3] are: base line subtraction, which flattensthe base profile of a spectrum, and smoothing, which reducesthe noise level—both used for noise reduction; normalization ofintensities, used to enable the comparison of different samples;binning, which performs data dimensionality reduction bygrouping measured data into bins while preserving relevantinformation; peak extraction, which separates real peaks(e.g. corresponding to peptides) from peaks representing noise;and peak alignment, which finds a common set of peaklocations (i.e. m/z values) in a set of spectra, where the samepeak (e.g. the same peptide) may have different values of m/zacross samples.

4. Ontologies and mass spectrometry analysis

Designing a data mining application for the analysis of MSdata involves different experts and requires the contemporaryuse of different knowledge, among the following: (i) basicconcepts of MS, related to the content and the format of spectragenerated by different MS techniques; (ii) concepts of datamanagement, related to the efficient organization, retrieval, andpreprocessing of spectra; (iii) concepts of knowledge discovery,related to the different available data mining algorithms, to thegenerated knowledge models, and to the data reorganizationrequirements of data mining algorithms; (iv) bio-medicalknowledge, which directly guides first the biological analysis

http://sashimi.sourceforge.net/


https://www.researchgate.net/publication/220285539_Workflow_composer_and_service_registry_for_grid_applications?el=1_x_8&enrichId=rgreq-060b6f2a-7a82-4592-a812-33f9f59167eb&enrichSource=Y292ZXJQYWdlOzIyMDI4NTE3MztBUzoxMDQyNzA1OTg3NzA2OTBAMTQwMTg3MTQ1ODM4MA==

https://www.researchgate.net/publication/222534376_Distributed_data_mining_on_the_grid?el=1_x_8&enrichId=rgreq-060b6f2a-7a82-4592-a812-33f9f59167eb&enrichSource=Y292ZXJQYWdlOzIyMDI4NTE3MztBUzoxMDQyNzA1OTg3NzA2OTBAMTQwMTg3MTQ1ODM4MA==

https://www.researchgate.net/publication/13554654_The_LabBase_system_for_data_management_in_large_scale_biology_research_laboratories?el=1_x_8&enrichId=rgreq-060b6f2a-7a82-4592-a812-33f9f59167eb&enrichSource=Y292ZXJQYWdlOzIyMDI4NTE3MztBUzoxMDQyNzA1OTg3NzA2OTBAMTQwMTg3MTQ1ODM4MA==

https://www.researchgate.net/publication/7862434_Algorithms_for_alignment_of_mass_spectrometry_proteomic_data?el=1_x_8&enrichId=rgreq-060b6f2a-7a82-4592-a812-33f9f59167eb&enrichSource=Y292ZXJQYWdlOzIyMDI4NTE3MztBUzoxMDQyNzA1OTg3NzA2OTBAMTQwMTg3MTQ1ODM4MA==

https://www.researchgate.net/publication/232789541_A_Common_Open_Representation_of_Mass_Spectrometry_Data_and_Its_Application_to_Proteomics_Research?el=1_x_8&enrichId=rgreq-060b6f2a-7a82-4592-a812-33f9f59167eb&enrichSource=Y292ZXJQYWdlOzIyMDI4NTE3MztBUzoxMDQyNzA1OTg3NzA2OTBAMTQwMTg3MTQ1ODM4MA==

https://www.researchgate.net/publication/8608091_Shah_S_P_et_al_Pegasys_software_for_executing_and_integrating_analyses_of_biological_sequences_BMC_Bioinformatics_5_40?el=1_x_8&enrichId=rgreq-060b6f2a-7a82-4592-a812-33f9f59167eb&enrichSource=Y292ZXJQYWdlOzIyMDI4NTE3MztBUzoxMDQyNzA1OTg3NzA2OTBAMTQwMTg3MTQ1ODM4MA==

https://www.researchgate.net/publication/10667368_myGrid_Personalised_Bioinformatics_on_the_Information_Grid?el=1_x_8&enrichId=rgreq-060b6f2a-7a82-4592-a812-33f9f59167eb&enrichSource=Y292ZXJQYWdlOzIyMDI4NTE3MztBUzoxMDQyNzA1OTg3NzA2OTBAMTQwMTg3MTQ1ODM4MA==

https://www.researchgate.net/publication/8040540_SpecAlign_-_Processing_and_alignment_of_mass_spectra_datasets?el=1_x_8&enrichId=rgreq-060b6f2a-7a82-4592-a812-33f9f59167eb&enrichSource=Y292ZXJQYWdlOzIyMDI4NTE3MztBUzoxMDQyNzA1OTg3NzA2OTBAMTQwMTg3MTQ1ODM4MA==

https://www.researchgate.net/publication/221311315_The_gRNA_A_Highly_Programmable_Infrastructure_for_Prototyping_Developing_and_Deploying_Genomics-Centric_Applications?el=1_x_8&enrichId=rgreq-060b6f2a-7a82-4592-a812-33f9f59167eb&enrichSource=Y292ZXJQYWdlOzIyMDI4NTE3MztBUzoxMDQyNzA1OTg3NzA2OTBAMTQwMTg3MTQ1ODM4MA==

https://www.researchgate.net/publication/4156511_Preprocessing_of_Mass_Spectrometry_proteomics_data_on_the_grid?el=1_x_8&enrichId=rgreq-060b6f2a-7a82-4592-a812-33f9f59167eb&enrichSource=Y292ZXJQYWdlOzIyMDI4NTE3MztBUzoxMDQyNzA1OTg3NzA2OTBAMTQwMTg3MTQ1ODM4MA==

https://www.researchgate.net/publication/220415696_A_taxonomy_of_scientific_workflow_systems_for_grid_computing_ACM_SIGMOD_Record?el=1_x_8&enrichId=rgreq-060b6f2a-7a82-4592-a812-33f9f59167eb&enrichSource=Y292ZXJQYWdlOzIyMDI4NTE3MztBUzoxMDQyNzA1OTg3NzA2OTBAMTQwMTg3MTQ1ODM4MA==

M. Cannataro et al. / Future Generation Computer Systems 23 (2007) 55–60 57

(e.g. the choice of sample preparation and MS techniques) andthen the knowledge discovery task.

The ontologies constitute a well-established tool formodeling the knowledge in different domains, thus they canplay an important role in modelling the different steps of a datamining application and supporting the application design. Anontology consists of: (i) a set of concepts; (ii) a set of relationsdescribing concept hierarchy or taxonomy; (iii) a set of relationslinking concepts non-taxonomically; and (iv) a set of axioms,usually expressed in a formal language—see [9] for a formaldefinition of ontology.WekaOntology. WekaOntology is an ontology of the datamining domain that is used to describe the tools of the Wekasuite [14] and has been enriched by the description of relevantMS datasets and preprocessing algorithms: in such a way, it canmodel data mining applications on MS data. The categorizationof the data mining software tools has been made on the basisof the following classification parameters: (i) the kind of datasources and output on which the software works; (ii) the type ofmethodologies used by the data mining algorithms; (iii) the datamining tasks; and (iv) the knowledge models inferred by datamining tools. The basic concepts are: (i) Spectrum, the basicdatum; (ii) Task, the goal of a data mining process; (iii) Method,the applied methodology; (iv) Algorithm, the specification of amethod; (v) Software, an implementation of an algorithm; and(vi) Knowledge Model, the model inferred by analysis. Severalsmall taxonomies derived from the specialization of these basicclasses are described in the following.Modelling Spectra. A main concept of the ontology isthe Spectrum. A spectra dataset is obtained by mergingmultiple spectra of an experiment. Spectra are classifiedby considering the type of spectrometer, thus the Spectrumclass has the following subclasses: (i) SELDI Spectra; (ii)MALDI Spectra; and (iii) LC Spectra. The super-class Datamodels the fact that each spectrum can be input or outputdata for preprocessing software, and has two specializations:Data Source and Data Output. Finally, Data can be: (i) raw;(ii) preprocessed, i.e. treated with one or more preprocessingmethods; (iii) prepared, i.e. reorganized in the format requiredby data mining software (e.g. the Weka Attribute-RelationFile Format (ARFF) [14]); or (iv) compressed, i.e. usingmzData. Each preprocessing service is connected with twonon-taxonomical relations to a corresponding subclass of Data:receives in input and produces as output. For instance, binningsoftware receives as input a raw or preprocessed spectra datasetand produces preprocessed spectra. A fragment of the Spectrumtaxonomy is depicted in Fig. 1(a).Modelling Tasks. The Task class represents the knowledgediscovery goal of a data mining computation. How a Task isperformed is modelled by the Method class. An Algorithmis a way to perform a Task and the Software is theimplementation of an Algorithm. Sub-classes of Task areClassification, Clustering, Attribute Selection, Preprocessing,and Visualization. Fig. 1(b) shows the taxonomy of the spectraPreprocessing Software.Modelling Knowledge Models. The Knowledge Model classcurrently comprises classification and clustering models.

Fig. 1. Taxonomy of (a) Spectrum and (b) Preprocessing Software inWekaOntology.

PMML (Predictive Model Markup Language) and CRISP-DM (CRoss Industry Standard Process for Data Mining) arelanguages that are used to describe knowledge models.

ProtOntology. ProtOntology models concepts, methods, algo-rithms, tools and databases relevant to the proteomics domain,and provides a biological background to the data mining anal-ysis. It comprises taxonomies for biological concepts (e.g. pro-teins) and non-biological concepts (e.g. software). Biologicalconcepts comprise: (i) Aminoacid, (ii) Protein, and (iii) Struc-ture. Non-biological concepts comprise: (i) Analysis (e.g. insilico experiments or in vitro experiments); (ii) Task (e.g. theidentification of proteins expressed in a sample); and (iii) Activ-ity, linking concepts of ProtOntology and WekaOntology (e.g. auser can retrieve Activities whose software tools are describedin WekaOntology).

5. Ontology-based application design using MS-Analyzer

MS-Analyzer is a software platform for bioinformatics ap-plications that allows the integrated preprocessing, manage-ment and data mining analysis of MS proteomics data. MS-Analyzer provides various services that implement spectramanagement and preprocessing, as well as data mining and vi-sualization functions; the latter have been obtained by wrap-ping Weka tools [14]. In particular, management, preprocess-ing and preparation services, respectively, implement the for-mat conversion and efficient spectra storage, preprocessing al-gorithms, and spectra reorganization that are required by datamining tools. The composition and execution of such servicesis carried out through an ontology-based workflow editor andscheduler that uses WekaOntology and ProtOntology. A spectradatabase is in charge of enhancing spectra management. Fig. 2shows the graphical user interface of MS-Analyzer, which of-fers the following tools.

The Dataset Manager (see left-hand side of Fig. 2) managesthe experiment data modelled through the dataset concept,i.e. a set of spectra in a raw, preprocessed or prepared status.Different source spectra are supported, such as those provided

https://www.researchgate.net/publication/221900847_Data_Mining_Practical_Machine_Learning_Tools_And_Techniques?el=1_x_8&enrichId=rgreq-060b6f2a-7a82-4592-a812-33f9f59167eb&enrichSource=Y292ZXJQYWdlOzIyMDI4NTE3MztBUzoxMDQyNzA1OTg3NzA2OTBAMTQwMTg3MTQ1ODM4MA==



https://www.researchgate.net/publication/285713798_Ontology_learning_for_the_semantic_web?el=1_x_8&enrichId=rgreq-060b6f2a-7a82-4592-a812-33f9f59167eb&enrichSource=Y292ZXJQYWdlOzIyMDI4NTE3MztBUzoxMDQyNzA1OTg3NzA2OTBAMTQwMTg3MTQ1ODM4MA==


Fig. 2. MS-Analyzer Graphical User Interface.

Table 1Execution times and spectra sizes vs preprocessing strategy: times in ms; sizes in KBytes

Execution times No prep Normalization Binning Binning + Alignment

Loading 21 831 21 831 21 831 21 831Preprocessing 0 201 120 1 456Preparation 5 432 5 432 599 607Mining 720 722 120 122

Total time 27 983 28 184 22 670 24 106

Spectra sizes No prep Normalization Binning Binning + Alignment

Dataset sizes 48 505 48 505 524 998ARFF sizes 24 505 24 505 255 492

by the main spectrometer providers, as well as the portablemzData spectra format. Spectra can be stored either on the localfile system or in the MS-Analyzer spectra database.

The Ontology-based Workflow Editor comprises an Ontol-ogy Manager and a Graphical Editor (see central and right-hand side of Fig. 2). The former, after loading the availableontologies, allows the browsing, searching and selection ofbioinformatics tools. The latter allows us to design bioinfor-matics workflows by using a notation based on the UnifiedModeling Language (UML), providing basic control blockssuch as start/end, fork/join, etc. The system provides ontology-based services discovery based on the integration of ontologydata and metadata stored in the Globus Metacomputing Di-rectory Service (MDS). For instance, for the J48 classificationservice, the following attribute-value pairs are stored into the

MDS record: Task = classification, Method = decision tree,Receives in input = ARFF, Produces as output = decisiontree. In this way, it is possible to obtain metadata about a service(e.g. its WSDL file) by querying the MDS with such semanticinformation. This scheme can also be applied to metadata storedin the UDDI registry for Web Services. Constraints expressedby the ontology are enforced at composition time. The producedabstract workflow schema is translated into a schedulable con-crete workflow using the Business Process Execution Languagefor Web Services (BPEL4WF), which is a language for orches-trating Web Services. BPEL4WS workflows are in turn sched-uled by a Grid workflow scheduler. Orchestration refers to anexecutable business process that can interact with Web Servicesand describes a flow from the perspective of, and under the con-trol of, a single party (e.g. a BPEL4WF workflow).

M. Cannataro et al. / Future Generation Computer Systems 23 (2007) 55–60 59

5.1. A case study of classification of mass spectrometry data

Using MS-Analyzer, it is possible to execute a completeprocess of knowledge discovery, making such a knowledge pathsimple. Ontologies help cooperation between the biological andbioinformatics groups and provide a step-by-step support to theusers. The steps necessary to design and execute an applicationare explained through the following case study.

Dataset selection. First of all, a user needs to load thespectra dataset to be analysed. In the case study, the OvarianCancer dataset made available at Stanford University has beenconsidered (www-stat.stanford.edu/˜tibs/PPC).

Ontology-based workflow design. Browsing the Activitiesfrom Tasks taxonomy in ProtOntology, the user finds suitablepreprocessing techniques. Details about services are foundboth in WekaOntology and through the ontology-basedservices discovery system. They are: (i) a reference to theWSDL file; (ii) a short description of the algorithm; (iii)a pre-condition (receives in input); and (iv) a post-condition(produces as output). Fig. 2 shows a fragment of the casestudy workflow where the Ovarian Cancer dataset is analysedin parallel by using different combinations of preprocessingtechniques (no preprocessing, normalization, binning, binningplus alignment) to evaluate the quality of data miningand the performance of execution. For each pipeline, theJ48 classification algorithm of Weka is applied [14]. Theknowledge models that are produced are decision trees.

Soundness check of the workflow graph. Using pre- andpost-conditions modelled by the ontologies, MS-Analyzerautomatically checks the soundness of the workflow, avoidingwrong data feed to the algorithms, i.e. it checks if: (i) there isa data source; (ii) there are missing arcs; (iii) each pre/post-condition is adhered to.

Workflow translation and enactment. The graphic workflowis translated into BPEL4WS and stored in a repositoryfor further use. Then the workflow is scheduled, takingcare of events related to services invocation, execution, andtermination. The scheduling is obtained through the Karajanscheduler provided by the CoG Toolkit available on Globus(www.globus.org).

Visualization of knowledge models. Currently, MS-Analyzercaptures the textual output of the wrapped Weka tools andshows it as the resulting knowledge model.

To evaluate how preprocessing affects the execution times,memory occupancy and, most importantly, the quality ofclassification, a set of experiments with different combinationsof preprocessing techniques (including no preprocessing) wasapplied to the Ovarian Cancer dataset. Table 1 shows theexecution times and spectra sizes vs preprocessing strategy.Applying normalization alone increased the overall experimenttime, whereas binning and binning plus alignment sensiblyreduced the size of data and thus decreased the preparationand classification times and the overall experiment time, evenif preprocessing time was spent. Moreover, the application ofsuch preprocessing techniques also enhanced the true positiverate, precision, and recall of the classification process. In

conclusion, MS-Analyzer simplifies application building andallows us to produce, in little time, different workflows ofthe same application by considering different combinations ofpreprocessing and data mining techniques. The performance ofthe scheduled workflows can be visualized easily, allowing usto compare the effect of different preprocessing and data miningtechniques and to evaluate the best strategies for analysing massspectra data.

References

[1] R. Aebersold, M. Mann, Mass spectrometry-based proteomics, Nature422 (2003) 198–207.

[2] M. Bubak, T. Gubala, M. Kapalka, M. Malawski, K. Rycerz, Workflowcomposer and service registry for grid applications, Future Gener.Comput. Syst. 21 (1) (2005) 79–86.

[3] M. Cannataro, P. Guzzi, T. Mazza, G. Tradigo, P. Veltri, Preprocessingof mass spectrometry proteomics data on the grid, in: CBMS’05, Dublin,Ireland, 2005.

[4] M. Cannataro, D. Talia, P. Trunfio, Distributed data mining on the grid,Future Gener. Comput. Syst. 18 (8) (2002) 1101–1112.

[5] J. Cunha, O. Rana, P. Medeiros, Future trends in distributed applicationsand problem-solving environments, Future Gener. Comput. Syst. 21 (6)(2005) 843–855.

[6] N. Goodman, S. Rozen, L.D. Stein, A. Smith, The labbase system for datamanagement in large scale biology research laboratories, Bioinformatics14 (7) (1998) 562–574.

[7] N. Jeffries, Algorithms for alignment of mass spectrometry proteomicdata, Bioinformatics 21 (14) (2005) 3066–3073.

[8] A. Laud, S. Bhowmick, P. Cruz, D. Singh, G. Rajesh, The grna: A highlyprogrammable infrastructure for prototyping, developing and deployinggenomics-centric applications. in: VLDB 2002, Hong Kong, China, 2002.

[9] A. Maedche, Ontology Learning for the Semantic Web, KluwerAcademic, 2002.

[10] S. Orchard, H. Hermjakob, P.A. Binz, C. Hoogland, C.F. Taylor, W. Zhu,R.K. Julian Jr., R. Apweiler, Further steps towards data standardisation:The proteomic standards initiative, Proteomics 5 (2) (2005) 337–339.

[11] P.G. Pedrioli, A common open representation of mass spectrometry dataand its application to proteomics research, Nat. Biotechnol. 22 (11) (2004)1459–1466.

[12] S.P. Shah, D.Y.M. He, J.N. Sawkins, J.C. Druce, G. Quon, D. Lett, G.X.Y.Zheng, T. Xu, B.F.F. Ouellette, Pegasys: Software for executing andintegrating analyses of biological sequences, BMC Bioinformatics 5 (40)(2004).

[13] R.D. Stevens, A.J. Robinson, C.A. Goble, Mygrid: Personalisedbioinformatics on the information grid, Bioinformatics 19 (1) (2004)302–302.

[14] I.H. Witten, E. Frank, Data Mining: Practical Machine Learning Tools andTechniques, 2nd ed., Morgan Kaufmann, San Francisco, 2005.

[15] J.W.H. Wong, G. Cagney, H.M. Cartwright, Specalign—processingand alignment of mass spectra datasets, Bioinformatics 21 (9) (2005)2088–2090.

[16] J. Yu, R. Buyya, A taxonomy of scientific workflow systems for gridcomputing, SIGMOD Rec. 34 (3) (2005) 44–49.

Mario Cannataro has been an Associate Professorof Computer Science at the University Magna Græ-cia of Catanzaro, Italy, since November 2002. He re-ceived his Laurea Degree cum Laude in ComputerEngineering from the University of Calabria, Rende,Italy, in 1993. He has worked on parallel computing,massively parallel architectures, parallel implementa-tion of logic programs and cellular automata. His cur-rent research interests are in grid computing, and grid-

based problem-solving environments for bioinformatics, proteomics, ontolo-gies, and adaptive hypermedia systems. He has authored a book on the parallel

http://www-stat.stanford.edu/~tibs/PPC

http://www.globus.org








































implementation of logic programs and over 100 scientific papers published ininternational journals and conference proceedings. He is a member of ACM,ACM Sigmod, and IEEE Computer Society, and he serves as a program com-mittee member for several international conferences. He is a co-founder anda member of Exeura (www.exeura.com). Mario Cannataro can be reached [email protected].

Pietro Hiram Guzzi received his Laurea degree inComputer Engineering in 2004 from the University ofCalabria, Rende, Italy. Currently, he has been pursuinghis Ph.D. in Informatics and Biomedical Engineeringat the University Magna Græcia of Catanzaro, Italy,since November 2004. His research interests comprisebioinformatics, the analysis of proteomics data, and thedistributed management of ontologies. Pietro H. Guzzican be reached at [email protected].

Tommaso Mazza received his Laurea degree inComputer Engineering in 2004 from the Universityof Calabria, Rende, Italy. Currently, he is pursuinghis Ph.D. in Informatics and Biomedical Engineeringat the University Magna Græcia of Catanzaro, Italy,since November 2004. His research interests compriseweb/grid services, workflow technology, and statisticaland data mining algorithms for the analysis of massspectra and biomedical data. Tommaso Mazza can be

reached at [email protected].

Giuseppe Tradigo received his Laurea degree in Com-puter Engineering from the University of Calabria,Rende, Italy, in 2003. Currently he is working withthe bioinformatics group of University “Magna Græ-cia” of Catanzaro. His research interests are in geo-graphical databases and GIS, grid computing, softwareengineering, protein structure prediction, and grid en-vironments for bioinformatics applications. GiuseppeTradigo can be reached at [email protected].

Pierangelo Veltri is an Assistant Professor ofComputer Science at the University Magna Græcia ofCatanzaro, Italy. He received his Ph.D. in ComputerScience in 2002, at INRIA Rocquencourt, France, andobtained his Laurea degree in Computer Engineeringin 1998 at University of Calabria, Italy. From1998 to 2000 he was a member of the Versogroup and participated in the Chorochronous project(1998–1999) and the Xyleme project (1999–2001).

Since 2002 he has participated in several projects, whose main topicsare computer science (such as Grid computing, geographical and temporaldatabases, and XML views) and bioinformatics (spectra data analysis forcancer detection, proteomics, and prediction of protein structures). His mainresearch interests are: XML and semi-structured database, data modelling,spatial and multidimensional databases, protein matching structure, proteinsstructure prediction, and data management on grid computing. Pierangelo Veltrican be reached at [email protected].

http://www.exeura.com






Documents

Using ontologies for preprocessing and mining spectra data on the Grid