15
Conceptual Modeling for Genomics: Building an Integrated Repository of Open Data Anna Bernasconi (B ) , Stefano Ceri, Alessandro Campi, and Marco Masseroli Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Milan, Italy {anna.bernasconi,stefano.ceri,alessandro.campi,marco.masseroli}@polimi.it Abstract. Many repositories of open data for genomics, collected by world-wide consortia, are important enablers of biological research; more- over, all experimental datasets leading to publications in genomics must be deposited to public repositories and made available to the research community. These datasets are typically used by biologists for validating or enriching their experiments; their content is documented by metadata. However, emphasis on data sharing is not matched by accuracy in data documentation; metadata are not standardized across the sources and often unstructured and incomplete. In this paper, we propose a conceptual model of genomic metadata, whose purpose is to query the underlying data sources for locating rel- evant experimental datasets. First, we analyze the most typical meta- data attributes of genomic sources and define their semantic proper- ties. Then, we use a top-down method for building a global-as-view inte- grated schema, by abstracting the most important conceptual properties of genomic sources. Finally, we describe the validation of the concep- tual model by mapping it to three well-known data sources: TCGA, ENCODE, and Gene Expression Omnibus. Keywords: Conceptual model · Data integration · Genomics · Next Generation Sequencing · Open data 1 Introduction Thanks to Next Generation Sequencing, a recent technological revolution for reading the DNA, a huge number of genomic datasets have become avail- able. Sequencing machines perform the primary data analysis and produce raw datasets (a single human genome requires about 200 GB). Computationally expensive pipelines, collectively regarded as secondary data analysis [30], are then applied to raw data for extracting signals from the genome (such as: muta- tions, expression levels, peaks of binding enrichment, chromatin states, etc.), thereby producing processed genomic data, which are much smaller in size. Processed datasets are collected by worldwide consortia, such as TCGA (The Cancer Genome Atlas) [36], ENCODE (the Encyclopedia of DNA Elements) [28], c Springer International Publishing AG 2017 H.C. Mayr et al. (Eds.): ER 2017, LNCS 10650, pp. 325–339, 2017. https://doi.org/10.1007/978-3-319-69904-2_26

Conceptual Modeling for Genomics: Building an Integrated ... · feature vector, e.g. Type [RSM] denotes Type as an attribute which is mandatory, restricted and single-valued, while

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Conceptual Modeling for Genomics: Building an Integrated ... · feature vector, e.g. Type [RSM] denotes Type as an attribute which is mandatory, restricted and single-valued, while

Conceptual Modeling for Genomics: Buildingan Integrated Repository of Open Data

Anna Bernasconi(B), Stefano Ceri, Alessandro Campi, and Marco Masseroli

Dipartimento di Elettronica, Informazione e Bioingegneria,Politecnico di Milano, Milan, Italy

{anna.bernasconi,stefano.ceri,alessandro.campi,marco.masseroli}@polimi.it

Abstract. Many repositories of open data for genomics, collected byworld-wide consortia, are important enablers of biological research; more-over, all experimental datasets leading to publications in genomics mustbe deposited to public repositories and made available to the researchcommunity. These datasets are typically used by biologists for validatingor enriching their experiments; their content is documented by metadata.However, emphasis on data sharing is not matched by accuracy in datadocumentation; metadata are not standardized across the sources andoften unstructured and incomplete.

In this paper, we propose a conceptual model of genomic metadata,whose purpose is to query the underlying data sources for locating rel-evant experimental datasets. First, we analyze the most typical meta-data attributes of genomic sources and define their semantic proper-ties. Then, we use a top-down method for building a global-as-view inte-grated schema, by abstracting the most important conceptual propertiesof genomic sources. Finally, we describe the validation of the concep-tual model by mapping it to three well-known data sources: TCGA,ENCODE, and Gene Expression Omnibus.

Keywords: Conceptual model · Data integration · Genomics · NextGeneration Sequencing · Open data

1 Introduction

Thanks to Next Generation Sequencing, a recent technological revolution forreading the DNA, a huge number of genomic datasets have become avail-able. Sequencing machines perform the primary data analysis and produce rawdatasets (a single human genome requires about 200 GB). Computationallyexpensive pipelines, collectively regarded as secondary data analysis [30], arethen applied to raw data for extracting signals from the genome (such as: muta-tions, expression levels, peaks of binding enrichment, chromatin states, etc.),thereby producing processed genomic data, which are much smaller in size.

Processed datasets are collected by worldwide consortia, such as TCGA (TheCancer Genome Atlas) [36], ENCODE (the Encyclopedia of DNA Elements) [28],c© Springer International Publishing AG 2017H.C. Mayr et al. (Eds.): ER 2017, LNCS 10650, pp. 325–339, 2017.https://doi.org/10.1007/978-3-319-69904-2_26

Page 2: Conceptual Modeling for Genomics: Building an Integrated ... · feature vector, e.g. Type [RSM] denotes Type as an attribute which is mandatory, restricted and single-valued, while

326 A. Bernasconi et al.

Roadmap Epigenomics [19], and 1000 Genomes [27]; moreover, it is customaryfor authors of biological articles to publish their processed datasets on reposito-ries such as GEO (the Gene Expression Omnibus) [4]. These datasets constitutea wealth of information, as they are open and can be used for secondary research.Processed datasets are used in tertiary data analysis for giving a global sense to het-erogeneous genomic and epigenomic signals, thereby answering complex biologicalqueries. Several systems are dedicated to tertiary data analysis, including Fire-Cloud1, SciDB-Paradigm4 [26], and BLUEPRINT [1]. In the context of the GeCoProject2, we developed GMQL [17,23], a high-level query language for genomics;we also proposed GDM [24], a unifying model for processed data formats.

While a lot of efforts are made for the production of genomic datasets, muchless emphasis is given to the structured description of their content. Such descrip-tions, collectively regarded as metadata, are fundamental for understanding howeach biological sample was processed, to which biological or clinical condition itis associated, which technological process has been used for its production, andso on. There is no standard for metadata, thus each source/consortium enforcessome rules autonomously; a conceptual design for metadata is either missing or,when present, overly complex and useless3. In summary, in spite of a growinginterest on tertiary data analysis and of the availability of many valuable datasources, genomic metadata are lacking a conceptual model for understandingwhich sources and datasets are most suitable for answering a genomic question.

One of the far-reaching goals of the GeCo project is the development of anintegrated repository of open processed data, supporting both structured andsearch queries; the GMQL prototype4 already integrates data from three reposi-tories (TCGA, ENCODE, and Roadmap Epigenomics) and structured methodsfor periodically loading and keeping updated their contents. To overcome thelack of standards, metadata are stored in GMQL as generic attribute-value pairs;with such format, metadata are used for the initial selection of relevant datasets.However, we are aware of the fact that attribute-value pairs are just providinga viable solution, but do not carry enough semantics.

In this paper, we present the Genomic Conceptual Model (GCM), aconceptual model for describing metadata of genomic data sources. GCM iscentered on the notion of the experiment item, typically a file containing genomicregions and their properties, which is analyzed from three points of view:

– The technology used in the experiment, including information about itemcontainers and their formats.

– The biological process observed in the experiment, in particular the samplebeing sequenced (derived from a tissue or a cell culture) and its preparation,including its donor.

1 https://software.broadinstitute.org/firecloud/.2 Data-Driven Genomic Computing, http://www.bioinformatics.deib.polimi.it/geco/,ERC Advanced Grant, 2016–2021.

3 At https://www.encodeproject.org/profiles/graph.svg see the conceptual model ofENCODE, an ER schema with tens of entities and hundreds of relationships, whichis neither readable nor supported by metadata for most concepts.

4 http://www.bioinformatics.deib.polimi.it/GMQL/interfaces/.

Page 3: Conceptual Modeling for Genomics: Building an Integrated ... · feature vector, e.g. Type [RSM] denotes Type as an attribute which is mandatory, restricted and single-valued, while

Conceptual Modeling for Genomics 327

– The management of the experiment, describing the organizations/projectswhich are behind the production of each experiment.

The conceptual schema is constructed top-down, based on a systematic analysisof metadata attributes and of their properties in many genomic sources, and thenverified bottom-up, on TCGA, ENCODE, and GEO; we show that ER schemasdescribing these sources can be constructed as subsets of GCM. Arbitrary querieson GCM can be propagated to sources, using the global-as-view approach [20].We also show that GCM provides the skeleton to a simple query interface, similarto the one provided by DeepBlue [2]. Driven by GCM, we will add many moredata sources to our integrated repository of open data for genomics.

2 Design of GCM

2.1 Analysis of Metadata Attributes

Most data sources provide interfaces for metadata extraction; these are basedon simple query templates or application programming interfaces (APIs), andenable the selection of experimental data. Some sources also provide tabulardescriptions of the metadata that can be more systematically queried, or enablethe extraction of matching metadata in semistructured format (XML or JSONfiles).

Taxonomy of Metadata Attributes. As a first step in developing GCM,we defined a taxonomy of the main properties of metadata attributes; we thensystematically applied the taxonomy to each considered source, so as to bettercharacterize its content. According to our taxonomy, attributes are:

– contextual (C) when they are present (or absent) only within specific con-texts, typically because another attribute takes a specific value. In such cases,there is an existence dependency between the two attributes.

– dependent (D) when the domain of their possible values is restricted, typ-ically because another attribute takes a specific value. In such cases, there isa value dependency between the two attributes.

– restricted (R) when their value must be chosen from a controlled vocabu-lary.

– single-valued (S) when they assume at most one value for each specificexperiment.

– mandatory (M) when they must have a value, either for all experiments orwithin a specific context.

The resulting taxonomy is shown in Table 1; it includes orthogonal features,and we targeted both completeness and minimality. By default (and in mostcases), attributes do not have any of the above properties. Very few attributesare mandatory and unfortunately sources do not always agree on them; in manycases they are named and typed somehow differently.

Page 4: Conceptual Modeling for Genomics: Building an Integrated ... · feature vector, e.g. Type [RSM] denotes Type as an attribute which is mandatory, restricted and single-valued, while

328 A. Bernasconi et al.

Table 1. Taxonomy of features for metadata attributes

Level Symbol Feature Default

Source C Contextual Non-contextual

D Dependent Independent

R Restricted Free

S Single-valued Multi-valued

M Mandatory Optional

Integrated Repository H Human Curated Extracted

O Ontological Ordinary

We use these five categories to describe the attributes that are included in theconceptual model, as explained in the next section; we label the attributes with afeature vector, e.g. Type[RSM ] denotes Type as an attribute which is mandatory,restricted and single-valued, while Pipeline[D(Technique)S] denotes Pipeline as asingle-valued attribute with a value dependency from the attribute Technique.

Source Analysis. We examined several sources; among them TCGA andENCODE provide the most comprehensive collection of metadata attributes.

– TCGA reports many experiment pipeline-specific metadata attributes; outof them we selected 22 attributes, common to all pipelines, which are themost interesting from a biological point of view (Table 2).

– ENCODE includes both a succinct and an expanded list of metadataattributes; while the expanded list has over 2000 attributes, the succinctlist has 49 attributes for experiments, 44 attributes for biosamples, and 28attributes for file descriptions.

Other Properties. We next define properties that we could not observe inthe sources, but will be used for characterizing the metadata attributes of ourintegrated repository (they are also included in Table 1). Accordingly, attributesare:

– human curated (H) when their value is provided by a curator of the repos-itory (and not extracted from the underlying data source).

– ontological (O) when an interface supports similarity-based matches basedupon semantic properties, e.g. through the connection to external ontologies.

Rules. Rules may be used for expressing existence and value dependencies.

– The existence dependency Technique = “Chip-seq” → M(Target) indicatesthat Target is a mandatory attribute if Technique takes the value “Chip-seq”,while Technique �= “Chip-seq” → NULL(Target) indicates that Technique isnot specified otherwise.

Page 5: Conceptual Modeling for Genomics: Building an Integrated ... · feature vector, e.g. Type [RSM] denotes Type as an attribute which is mandatory, restricted and single-valued, while

Conceptual Modeling for Genomics 329

Table 2. TCGA metadata attributes analysis

C D R S M Dependency Attribute

× × clinical.demographic.id

× clinical.demographic.year of birth

× × × clinical.demographic.gender

× × × clinical.demographic.ethnicity

× × × clinical.demographic.race

× × biospecimen.sample.id

× × × biospecimen.sample.sample type

× × × sample type biospecimen.sample.tissue type

× × generated data files.data file.〈type〉.id× × × generated data files.data file.〈type〉.data type

× × × × data type generated data files.data file.〈type〉.data format

× × generated data files.data file.〈type〉.file size

× × × × data type generated data files.data file.〈type〉.experimental strategy

× × × data type generated data files.data file.〈type〉.platform× × × × data type analysis.〈workflow〉.workflow type

× × analysis.〈workflow〉.workflow link

× × case.case.id

× case.case.primary site

× × primary site case.case.disease type

× × administrative.program.name

× × administrative.project.name

× administrative.tissue source site.name

– The following value dependency connects the DataType and Formatattributes: DataType = “raw data” → Format = “fastq”.

In the next section we show examples of both existence and value dependencies,that complement the conceptual model specification; when the dependenciesare specified for attributes belonging to different entities, they hold for all theinstance pairs connected with an arbitrary join path connecting the two entities(this is not ambiguous because the conceptual model is acyclic).

2.2 Genomic Conceptual Model

We next designed the Genomic Conceptual Model top-down, inclusive of themost relevant metadata attributes as scouted from the various sources, build-ing the entity-relationship schema represented in Fig. 1. The schema includesthe principal concepts; other source-specific concepts can be made available in

Page 6: Conceptual Modeling for Genomics: Building an Integrated ... · feature vector, e.g. Type [RSM] denotes Type as an attribute which is mandatory, restricted and single-valued, while

330 A. Bernasconi et al.

Fig. 1. Genomic conceptual model

semi-structured form aside from this schema (e.g. all clinical diagnosis condi-tions available for the donor in TCGA). The model is centered on the Itementity, which represents an elementary experimental unit. Three sub-schemata(or views) depart from the central entity, recalling a classic star-schema organi-zation that is typical of data warehouses; they respectively describe biological,technological, and management aspects.

Central Entity. We next describe the attributes of the Item entity andassociate each of them with their feature vector. The SourceId[SM ] andDataType[RSM ] respectively denote the item identifier within the source andthe item’s data type, and must always be included; DataType denotes the spe-cific content of the Item, e.g. “peak”. Format[D(DataType)RSM ] denotes the Itemdata file format (e.g. [“fastq”, “bam”, “wiggle”, “bed”, “tsv”, “vcf”, “maf”,“xml”]) and depends on DataType (e.g. “bed” format is compatible with “peak”and not compatible with “read”). Other attributes are: Size[SM ], SourceUrl[M ],LocalUri[C(Format)SM ], and Pipeline[D(Technique)S].

The use of the last three attributes requires some discussion. Recall that weintend to build an integrated repository that contains only processed data, whilein many cases the sources include also the raw data. In our metadata repositorywe include items relative to both raw and processed data with a reference tothe related file in the original source within the SourceUrl attribute, that canbe multi-valued in case the same data file is derived from different sources. Inaddition, items relative to processed files also exhibit an attribute LocalUri (seerule 1 in Listing 1, at the end of this section) indicating their physical location inour data repository. Pipeline is a descriptor of the specific parameters adopted inthe pipeline used for producing the processed data. The descriptor is interpretedin the general context of the Technique used for producing several items of the

Page 7: Conceptual Modeling for Genomics: Building an Integrated ... · feature vector, e.g. Type [RSM] denotes Type as an attribute which is mandatory, restricted and single-valued, while

Conceptual Modeling for Genomics 331

same type and format; hence, the feature vector notation for Pipeline. Providingparameters and references to the raw data is relevant in the case of processeddata, as sometimes biologists resort to original raw data for reprocessing; how-ever, in the data sources such attributes may be missing or hidden within textualattributes.

Biological View. This view consists of a chain of entities: Item-Replicate-BioSample-Donor describing the biological process leading to the productionof the Item. All relationships are many-to-one, hence an Item is associated witha given Replicate, each associated with a given BioSample, each associatedwith a Donor.

Donor represents the individual of a specific organism from which the bio-logical material is derived. It has attributes SourceId[S] (donor identifier relativeto a source) and Species[RSM ]; Age[S], Gender[RS], and Ethnicity[RS] are otheroptional attributes of interest.

BioSample describes the material sample taken from a biological entity andused for the experiment. Its SourceId[M ] is an identifier of the bio-sample within asource, mandatory but also multi-valued (when the same sample is linked to dif-ferent sources). Type[RSM ] is restricted to the values [“cell line”,“tissue”]. Basedon the value of this attribute, either Tissue[CSMO] or CellLine[CSMO] becomesmandatory, but not both of them; this dependency is expressed by rules 2 and 3.IsHealthy[RS] is Boolean and Disease[C(IsHealthy)D(Tissue)O] contextually dependson IsHeathy, as expressed by rule 4, and can be multi-valued; moreover, its val-ues depend on Tissue because given diseases can only be related to given tissues.We marked Tissue, CellLine and Disease as ontological5, as we intend to extendthe values of these attributes with their synonyms and generalizations/special-izations, so as to ease their search; for example, the Tissue “blood vessel” willmatch the terms “vessel”, “arteries” and “veins”. Preliminary work for giving anextended ontological interpretation to ENCODE metadata is reported in [11].

Replicate is used when multiple material samples are generated from thesame BioSample, giving rise to items that are replica for the same experiment.This entity is relevant in some epigenomic data sources (such as ENCODE), thatdifferentiate between technical and biological replication; such distinction is notpresent in most of the other sources.

Technology View. This view consists of a chain of entities: Item-Container-ExperimentType describing the used technology leading to the production ofthe Item. Through this chain, an Item is associated by means of (1:N) relation-ships to a given Container of a given ExperimentType.

5 We will use the BRENDA Tissue and Enzyme Source Ontology [32] for tissues,the Cell Line Ontology [31] for cell lines, and the Human Disease Ontology [33] forhuman diseases.

Page 8: Conceptual Modeling for Genomics: Building an Integrated ... · feature vector, e.g. Type [RSM] denotes Type as an attribute which is mandatory, restricted and single-valued, while

332 A. Bernasconi et al.

Container is used to describe common properties of homogeneous items- sharing the same data structure and produced by the same experiment type.Its attributes include Name[SM ] and Assembly[C(DataType)D(Species)RSM ]; Assem-bly is only present for items of particular types (see rule 5) and is restrictedto a smaller vocabulary according to the Species (e.g., see rules 16 and 17).The Boolean attribute IsAnn[RSM ] is used for distinguishing experimental itemsfrom known annotations (i.e., regarding known genomic regions): when true,Annotation[C(IsAnn)RSM ] exists (see rules 6 and 7); annotations have a restrictedvocabulary, including: [“Gene”, “Exon”, “TSS”, “Promoter”, “Enhancer”, “Cpg-Island”].

ExperimentType refers to the specific methods used for producing eachitem. It includes the mandatory attribute Technique[RSM ] (e.g., [“Chip-seq”,“Dnase-seq”, “RRBS”, . . . ]). Feature[D(Technique)RSMH] is a mandatory manu-ally curated attribute that we add to denote the specific feature described bythe experiment (e.g., “Copy Number Variation”, “Histone Modification”, “Tran-scription Factor”). The value of Platform[C(DataType)RSM ] illustrates the NGSplatform used for sequencing and depends on the DataType of the item (see rule8). When the Technique is “Chip-seq”, the two attributes Target[C(Technique)RSM ]

and Antibody[C(Technique)D(Target)RSM ] are present (see rules 9–12). The Targetvalue is usually aligned to the vocabulary of UniProtKB6. The Antibody valuedepends on the Target since it is specific against that antigen.

Management View. This view consists of a chain of entities: Item-Case-Project describing the organizational process for the production of each itemand the way in which items are grouped together to form a case.

Case represents a set of items that are gathered together, because theyparticipate to a same research objective.

Project represents the project or program, occurred at a given institution(e.g., individual laboratory or consortium) that was responsible of the productionof the item. ProjectName[S] and ProgramName[S] may be present, but none ofthem is mandatory.

Dependencies. Rules 1–12 of Listing 1 exhaustively describe the existencedependencies of the global schema. Rules 13–17 show some examples of valuedependencies. Note that an attribute can be contextual but not mandatory (suchas Disease, rule 4), contextual and mandatory (such as Target, rules 9 and 10),and also mandatory but not contextual (such as Technique). Note also that,when an attribute is marked as mandatory and the related information is miss-ing from the source, then either human curation or rule-based management areneeded.

6 http://www.uniprot.org/uniprot/.

Page 9: Conceptual Modeling for Genomics: Building an Integrated ... · feature vector, e.g. Type [RSM] denotes Type as an attribute which is mandatory, restricted and single-valued, while

Conceptual Modeling for Genomics 333

1Item.Format=“bed” → M(Item.LocalUri)2BioSample.Type=“tissue” → M(BioSample.Tissue)3BioSample.Type=“cell line” → M(BioSample.CellLine)4BioSample.IsHealthy → NULL(BioSample.Disease)5Item.DataType in [“aligned read”,“peak”,“signal”] → M(Container.Assembly)6Container.IsAnn → M(Container.Annotation)7NOT(Container.IsAnn) → NULL(Container.Annotation)8Item.DataType=“raw data” → M(ExperimentType.Platform)9ExperimentType.Technique=“Chip-seq”→ M(ExperimentType.Target)10ExperimentType.Technique �=“Chip-seq” → NULL(ExperimentType.Target)11ExperimentType.Technique=“Chip-seq”→ M(ExperimentType.Antibody)12ExperimentType.Technique �=“Chip-seq” → NULL(ExperimentType.Antibody)

————————————————————————————————————————–13Item.DataType=“raw data”→ Item.Format=“fastq”14BioSample.Tissue=“liver”→ BioSample.Disease ∈ [“viral hepatitis”,“liver lymphoma”,. . . ]15BioSample.Tissue=“liver” → BioSample.Disease �∈ [“acute leukemia”,“pilorus cancer”,. . . ]16Donor.Species=“Homo sapiens”→ Container.Assembly ∈ [“GRCh38”, “hg19”, “hs37d5”]17Donor.Species=“Mus musculus”→ Container.Assembly ∈ [“mm9”, “mm10”, “GRCm38”]

Listing 1. Examples of existence and value dependencies

2.3 Source-Specific Views of GCM

We verify that the global-as-view approach really captures the three data sourcesconsidered, by showing them as views of GCM in Fig. 2; we use the followingnotation:

– We place the attributes of each source in the same position as in GCM, butwe use for them the name that we found in the documentation of each source;missing attributes correspond to white circles.

– We cluster the conceptual entities corresponding to a single concept in theoriginal source by encircling them within grey shapes. The entity names cor-responding to the original source are reported with a bold bigger font on theclustered shape (e.g. Series in GEO) or directly on the new entity (e.g., Casein TCGA) when this corresponds to the name given in our GCM.

– We indicate specific relationship cardinalities where GCM differs from thesource, using a bold font (e.g., see (1,1) from Item to Case in ENCODE).

– We enclose fixed human curated values in inverted commas and use the func-tions notation tr, comb, and curated to describe a transformation of a sourcefield, a combination of multiple source fields, and curated fields, respectively.

Note that the Gene Expression Omnibus (GEO) source is at the same timea very rich public repository of genomic data (as most research publicationsinclude links to experimental data uploaded to GEO), but is also a very poorsource of metadata, which are not well structured and often lack information;hence our mapping effort is harder and less precise for GEO than for the moreorganized TCGA and ENCODE sources7. The mapping to GEO captures as wellthe mapping to Roadmap Epigenomics, another relevant source of public data.

7 Textual analysis to extract semantic information from the GEO repository isreported in [12]; we plan to reuse their library.

Page 10: Conceptual Modeling for Genomics: Building an Integrated ... · feature vector, e.g. Type [RSM] denotes Type as an attribute which is mandatory, restricted and single-valued, while

334 A. Bernasconi et al.

Fig. 2. Source-specific views of GCM for TCGA, ENCODE, and GEO

2.4 User-Friendly Interface

An important side effect of providing a global and integrated view of data sourcesis the ability to build user-friendly query interfaces for selecting items from mul-tiple data sources. We show a mock-up of an interface that supports conjunctive

Fig. 3. Retrieval interface mock-up

Page 11: Conceptual Modeling for Genomics: Building an Integrated ... · feature vector, e.g. Type [RSM] denotes Type as an attribute which is mandatory, restricted and single-valued, while

Conceptual Modeling for Genomics 335

queries over our entities, very similar to the user interface currently providedby DeepBlue [2] (Fig. 3); attributes are rendered by pop-up lists and values arethen entered by users, with autocomplete support.

3 Building the Integrated Repository

In this section, we describe high level rules for loading the content of the inte-grated repository from the original data sources, with a global-as-view approach.These transformations drive our approach.

3.1 Available Repositories at the Sources

Most genomic repositories offer Web interfaces for accessing their metadata. Inaddition, some of them offer Web APIs for querying the metadata, used foraccessing storage structures for metadata (typically relational tables). Table 3describes the schemas of the tables available at TCGA8, ENCODE, and GEO9.TCGA and ENCODE tables result from the translation of a hierarchical jsonformat representation (the only one provided by the sources) into a relationalrepresentation that has required several normalization steps and simplificationsfor illustration purposes. GEO tables result from a selection of a small subset ofattributes used for mapping GEO to GCM.

Table 3. Relational schema of TCGA(T), ENCODE(E), and GEO(G) repositories

T.Case(id,project id,disease type,primary site,tissue source site)

T.Project(id,name,program)

T.ClinicalDemographic(id,case id,year of birth,gender,ethnicity,race)

T.BiospecimenSample(id,case id,sample type,tissue type)

T.BiospecimenReadGroup(id,sample id,platform id)

T.DataFile(id,readgroup id,workflow id,data type,data format,size,experimental strategy)

T.AnalysisWorkflow(id,workflow type,workflow link)

E.Donor(id,organism scientific name,age,sex,ethnicity)

E.Biosample(id,donor id,biosample type,biosample term name,health status)

E.Replicate(id,biosample id,experiment id,biological replicate num,technical replicate num)

E.Experiment(id,assembly,assay term name,target,antibody,lab,award project,platform)

E.File(id,experiment id,output type,file type,file size,pipeline title,pipeline accession)

G.File(id,gsm id,file type,size,download)

G.Gse(id,organization name,contact,bioproject,experiment type)

G.Gsm(id,gse id,organism,age,gender,source name,disease state,genome built,instrument model)

8 The metadata is provided in the NCI Genomic Data Commons portal, https://docs.gdc.cancer.gov/Data Dictionary/viewer/.

9 GEO information can be retrieved through the R package GEOmetadb [37].

Page 12: Conceptual Modeling for Genomics: Building an Integrated ... · feature vector, e.g. Type [RSM] denotes Type as an attribute which is mandatory, restricted and single-valued, while

336 A. Bernasconi et al.

3.2 Mapping Rules

Mapping rules are used to describe how data are loaded from the sources into theintegrated repository; for illustration purposes, in Table 4 we provide some of themappings, related to the Donor, BioSample, Case, and ExperimentTypeentities. Each mapping rule is a logic formula with variables in its left end side(LHS) which are computed from the variables in its right end side (RHS). Theorder of the LHS variables is the same reported in our global schema in Fig. 1 andthe order of the RHS variables is the same reported in Table 3 for each source. Asan example, the entity ExperimentType of the global schema is filled with datafrom ENCODE’s entity Experiment, together with data from TCGA’s Biospec-imenReadGroup and DataFile (joined on the readgroup id attribute), and datafrom GEO’s Gse and Gsm (joined on the gse id attribute).

Table 4. Examples of mapping rules for building the integrated repository from thesources

Donor(SID,SP,AGE,G,E) ⊇ E.Donor(SID,SP,AGE,G,E)

Donor(SID,“Homo S.”,tr(Y ),G,comb(ET,R)) ⊇ T.ClinicalDemographic(SID, ,Y,G,EY,R)

Donor( ,SP,AGE,G, ) ⊇ G.Gsm( , ,SP,AGE,G, , , , )

BioSample(SID,T,tr(BT),tr(BT),tr(HS),tr(HS)) ⊇ E.Biosample(SID, ,T,BT,HS)

BioSample(SID,“tissue”,PS, ,tr(ST,TT),DIS) ⊇ T.BiospecimenSample(SID,CID,ST,TT),

T.Case(CID, ,DIS,PS, )

BioSample(SID,tr(SN ),tr(SN ),tr(SN ),tr(D),tr(D)) ⊇ G.Gsm(SID, , , , ,SN,D, , )

Case(SID,SS) ⊇ E.Experiment(SID, , , , , ,SS, )

Case(SID,SS) ⊇ T.Case(SID, , , ,SS)

Case(SID,tr(O,C )) ⊇ G.Gse(SID,O,C, , , )

ExperimentType(TE,comb(TE,T),P,T,A) ⊇ E.Experiment( , ,TE,T,A, , ,P)

ExperimentType(TE,tr(TE),P, , ) ⊇ T.BiospecimenReadGroup(RGID, ,P),

T.DataFile( ,RGID, , , , ,TE)

ExperimentType(tr(TE),tr(TE),tr(P), , ) ⊇ G.Gse(EID, , , ,TE),

G.Gsm( ,EID, , , , , , ,P)

As we already discussed in Sect. 2.3, the values of some of the attributes areacquired exactly as they are in the original source, others need the application ofsimple manually provided functions for textual transformation (denoted as tr),others are computed as textual combination of multiple source fields (denoted ascomb, and finally others need manual curation (values are enclosed in invertedcommas). As an example, Donor.Ethnicity corresponds to a combination ofthe attributes race and ethnicity of the ClinicalDemographic table, taken fromTCGA source. Tissue and CellLine attributes of the BioSample are both pro-duced by biosample term name of ENCODE which uses this attribute for bothof them - the content of this attribute depends on the value of biosample type(either “cell line” or “tissue”). Relevant integration efforts are addressed towardsdefining a shared set of homogenized values for each attribute. The values of theglobal attributes, given to the LHS variables, are to be intended as alreadyhomogenized to the reference ontologies (as indicated in Sect. 2.2), or to the

Page 13: Conceptual Modeling for Genomics: Building an Integrated ... · feature vector, e.g. Type [RSM] denotes Type as an attribute which is mandatory, restricted and single-valued, while

Conceptual Modeling for Genomics 337

chosen finite restricted dictionaries. Notice that all the mappings preliminarilyperform a value homogenization step, implicit in the integration process.

4 Related Works

A long stream of research tackled the problem of providing integrated accessto multiple, heterogeneous sources. A survey of very preliminary works is [14].Buneman et al. [6] described the problem of querying and transforming scientificdata residing in structured files of different formats. Along that work, BioK-leisli [8] and K2 [9] describe early systems supporting queries across multiplesources. BioKleisli was a federated database offering an object-oriented model;its main limitation was the lack of a global schema, imposing users to knowthe structure of underlying sources. To improve this aspect, K2 included GUS(Genomics Unified Schema), an extensive relational database schema support-ing a wide range of functional genomics data types. The BioProject [3] databasewas recently established to facilitate the organization and classification of projectmetadata submitted to NCBI, EBI and DDBJ databases.

A common approach in integrated data management is data warehousing,consisting of a-priori integration and reconciliation of data extracted from mul-tiple sources, such as in EnsMart/BioMart [13,34]. Along this direction, [22]describes a warehouse for integrating genomic and proteomic information usinggeneralization hierarchies and a modular, multilevel global schema to overcomedifferences among data sources. ER modeling (and UML class diagrams) wereused in [5]; models describe protein structures and genomic sequences, withrather complex concepts aiming at completely representing the underlying biol-ogy. [35] is a biomedical data warehouse supporting a data model (called BioStar)capturing the semantics of biomedical data and providing some extensibility tocope with the evolution of biological research methodologies.

Many other works [10,15,16,18,21,25,29] present conceptual models forexplaining biological entities and their interactions in terms of conceptual datastructures. With our approach, similar to DeepBlue [2], we instead use concep-tual modeling for driving the continuous process of metadata integration and foroffering high-level query interfaces on metadata for locating relevant datasets,under the assumption that users will then manage these datasets for solvingbiological or clinical questions. Similarly to DeepBlue, we hide the data sourcedifferences so as to provide easy-to-use interfaces, but differently from them wedisclose the semantic properties of the underlying sources and the metadata inte-gration process; moreover, we cover a broader spectrum of sources and providea richer set of concepts, including the management view.

5 Conclusions

The interest on an integrated repository for genomics stems from the hugeamount of resources that are becoming available. In this paper we provide GCM,a genomic conceptual model capable of capturing the metadata of heterogeneous

Page 14: Conceptual Modeling for Genomics: Building an Integrated ... · feature vector, e.g. Type [RSM] denotes Type as an attribute which is mandatory, restricted and single-valued, while

338 A. Bernasconi et al.

sources with a global-as-view approach. The model is supported by a method forconceptually designing global metadata through source attribute analysis and isvalidated by using three data sources: TCGA, ENCODE, and GEO.

Our GMQL system already provides access to datasets from TCGA,ENCODE, and Roadmap Epigenomics, that were identified as the most rele-vant in the course of collaborative projects with many biologists; we alreadydeveloped some tools for automatically importing such datasets and for convert-ing them to an integrated format, e.g., TCGA2BED [7]. Thanks to GCM, wecan also provide a coherent semantics to the metadata of integrated sources;throughout the GeCo project we plan to add more sources, according to needsof biologists, and to continuously integrate their metadata within GCM.

Acknowledgement. This research is funded by the ERC Advanced Grant projectGeCo (Data-Driven Genomic Computing), 2016–2021.

References

1. Adams, D., et al.: BLUEPRINT to decode the epigenetic signature written inblood. Nat. Biotechnol. 30(3), 224–226 (2012)

2. Albrecht, F., et al.: DeepBlue epigenomic data server: programmatic data retrievaland analysis of epigenome. Nucleic Acids Res. 44(W1), W581–W586 (2016)

3. Barrett, T., et al.: BioProject and BioSample databases at NCBI: facilitating cap-ture and organization of metadata. Nucleic Acids Res. 40(D1), 57–63 (2012)

4. Barrett, T., et al.: NCBI GEO: archive for functional genomics data sets – update.Nucleic Acids Res. 41(Database issue), D991–D995 (2013)

5. Bornberg-Bauer, E., Paton, N.W.: Conceptual data modelling for bioinformatics.Brief. Bioinform. 3(2), 166–180 (2002)

6. Buneman, P., et al.: A data transformation system for biological data sources. In:International Conference on Very Large Data Bases, pp. 158–169 (1995)

7. Cumbo, F., et al.: TCGA2BED: extracting, extending, integrating, and queryingThe Cancer Genome Atlas. BMC Bioinform. 18(6), 1–9 (2017)

8. Davidson, S.B., et al.: Biokleisli: a digital library for biomedical researchers. Int.J. Digit. Libr. 1(1), 36–53 (1997)

9. Davidson, S.B., et al.: K2/Kleisli and GUS: experiments in integrated access togenomic data sources. IBM Syst. J. 40(2), 512–531 (2001)

10. El-Ghalayini, H., et al.: Deriving conceptual data models from domain ontologiesfor bioinformatics. In: 2006 2nd Information and Communication Technologies,ICTTA 2006, vol. 2, pp. 3562–3567 (2006)

11. Fernandez, J.D., et al.: Ontology-based search of genomic metadata. IEEE/ACMTrans. Comput. Biol. Bioinform. 13(2), 233–247 (2016)

12. Galeota, E., Pelizzola, M.: Ontology-based annotations and semantic relations inlarge-scale (epi)genomics data. Brief. Bioinform. 18(3), 403–412 (2017)

13. Haider, S., et al.: BioMart Central Portal - unified access to biological data. NucleicAcids Res. 37(Web Server issue), 23–27 (2009)

14. Hernandez, T., Kambhampati, S.: Integration of biological sources: current systemsand challenges ahead. SIGMOD Rec. 33(3), 51–60 (2004)

15. Idrees, M., et al.: A review: conceptual data models for biological domain. JAPS,J. Anim. Plant Sci. 25(2), 337–345 (2015)

Page 15: Conceptual Modeling for Genomics: Building an Integrated ... · feature vector, e.g. Type [RSM] denotes Type as an attribute which is mandatory, restricted and single-valued, while

Conceptual Modeling for Genomics 339

16. Ji, F., Elmasri, R., et al.: Incorporating concepts for bioinformatics data modelinginto EER models. In: ACS/IEEE International Conference on Computer Systemsand Applications, pp. 189–192. IEEE Computer Society, Washington, DC, USA(2005)

17. Kaitoua, A., Pinoli, P., Bertoni, M., Ceri, S.: Framework for supporting genomicoperations. IEEE Trans. Comput. 66(3), 443–457 (2017)

18. Keet, M.C.: Biological data and conceptual modelling method. J. Concept. Model.29(1), 1–14 (2003)

19. Kundaje, A., et al.: Integrative analysis of 111 reference human epigenomes. Nature518(7539), 317–330 (2015)

20. Lenzerini, M.: Data integration: a theoretical perspective. In: Symposium on Prin-ciples of Database Systems, PODS, pp. 233–246. ACM, New York, NY, USA (2002)

21. Louie, B., et al.: Data integration and genomic medicine. J. Biomed. Inform. 40(1),5–16 (2007)

22. Masseroli, M., Canakoglu, A., Ceri, S.: Integration and querying of genomic andproteomic semantic annotations for biomedical knowledge extraction. IEEE/ACMTrans. Comput. Biol. Bioinform. 13(2), 209–219 (2016)

23. Masseroli, M., et al.: GenoMetric Query Language: a novel approach to large-scalegenomic data management. Bioinformatics 31(12), 1881–1888 (2015)

24. Masseroli, M., et al.: Modeling and interoperability of heterogeneous genomic bigdata for integrative processing and querying. Methods 111, 3–11 (2016)

25. Rechenmann, F.: Data modeling: the key to biological data integration. EMBnet.J. 18(B), 59–60 (2012)

26. Anonymous paper. Accelerating bioinformatics research with new software for bigdata to knowledge (BD2K), Paradigm4, April 2015. www.paradigm4.com

27. Consortium 1000Genomes: A map of human genome variation from population-scale sequencing. Nature 467(7319), 1061–1073 (2010)

28. Consortium ENCODE: An integrated encyclopedia of DNA elements in the humangenome. Nature 489(7414), 57–74 (2012)

29. Reyes Roman, J.F., Pastor, O., Casamayor, J.C., Valverde, F.: Applying conceptualmodeling to better understand the human genome. In: Comyn-Wattiau, I., Tanaka,K., Song, I.-Y., Yamamoto, S., Saeki, M. (eds.) ER 2016. LNCS, vol. 9974, pp. 404–412. Springer, Cham (2016). doi:10.1007/978-3-319-46397-1 31

30. Roy, A., et al.: Massively parallel processing of whole genome sequence data: anin-depth performance study. In: Proceedings of the 2017 ACM International Con-ference on Management of Data, SIGMOD 2017, Chicago, Illinois, USA, 14–19May 2017, pp. 187–202. ACM, New York (2017)

31. Sarntivijai, S., et al.: CLO: the cell line ontology. J. Biomed. Semant. 5(1), 37(2014)

32. Schomburg, I., et al.: BRENDA in 2013: new options and contents in BRENDA.Nucleic Acids Res. 41(Database issue), D764–D772 (2013)

33. Schriml, L.M., et al.: Disease Ontology: a backbone for disease semantic integration.Nucleic Acids Res. 40(Database issue), 940–946 (2012)

34. Smedley, D., et al.: The BioMart community portal: an innovative alternative tolarge, centralized data repositories. Nucleic Acids Res. 43(W1), 589–598 (2015)

35. Wang, L., et al.: BioStar models of clinical and genomic data for biomedical datawarehouse design. Int. J. Bioinform. Res. Appl. 1(1), 63–80 (2005)

36. Weinstein, J.N., et al.: The cancer genome atlas pan-cancer analysis project. Nat.Genet. 45(10), 1113–1120 (2013)

37. Zhu, Y., et al.: Geometadb: powerful alternative search engine for the gene expres-sion omnibus. Bioinformatics 24(23), 2798–2800 (2008)