5
APPLICATION Harmonizing, annotating and sharing data in biodiversityecosystem functioning research Karin Nadrowski 1 *, Sophia Ratcliffe 1 , Gerhard Bonisch 2 , Helge Bruelheide 3 , Jens Kattge 2 , Xiaojuan Liu 4 , Lutz Maicher 5,6 , Xiangcheng Mi 4 , Michael Prilop 5,6 , Daniel Seifarth 5 , Karl Welter 1,5 , Sven Windisch 5,7 and Christian Wirth 1 1 Institute of Special Botany and Functional Biodiversity Research, University of Leipzig, 04103 Leipzig, Germany; 2 Max Planck Institute of Biogeochemistry, 07745 Jena, Germany; 3 Institute of Botany/ Geobotany and Botanical Garden, Martin Luther University Halle Wittenberg, 06120 Halle, Germany; 4 Institute of Botany, Chinese Academy of Sciences, 20 Nanxincun, Beijing, Xiangshan, 100093, China; 5 Topic Maps Lab, Natural Language Processing Group, University of Leipzig, 04109 Leipzig, Germany; 6 Fraunhofer-Zentrum f ur Mittel- und Osteuropa (MOEZ), 04109 Leipzig, Germany; and 7 Business Development, ESEMOS GmbH, 04109 Leipzig, Germany Summary 1. The integrative research field of biodiversityecosystem functioning (BEF) requires close collaboration between researchers from different disciplines working on different scales in time, space as well as taxon resolu- tion. Data can describe anything from abiotic ecosystem components, to organisms, parts of organisms, genetic information or element stocks and flows. Researchers prefer the convenience of spreadsheets for data prepara- tion, which can lead to isolated data sets that are diverse in structure and follow diverging naming conventions. 2. BEFdata (https://github.com/befdata/befdata) is a new, open source web platform for the upload, validation and storage of data from a formatted Excel workbook. Metadata can be downloaded in Ecological Metadata Language (EML). BEFdata allows the harmonization of naming conventions by generating category lists from the primary data, which can be reviewed and managed via the Excel workbook or directly on the platform. BEF- data provides a secure environment during ongoing analysis; project members can only access primary data from other researchers after the acceptance of a data request. 3. Due to its generic database schema, BEFdata platforms can be used for any research domain working with tabular data. It supports the compilation of coherent data sets at the level of the primary data, allowing research- ers to explicitly model correlation structures across data sets for synthesis. The EML export enables efficient publishing of data in global repositories. Key-words: Ecoinformatics, BEF-China, cooperating research groups, web applications, Ruby on Rails, knowledge management, Ecological Metadata Language, semantic web, spreadsheets, Web of Data Introduction In biodiversityecosystem functioning (BEF) research, both the predictor biodiversity and the dependant variables ecosystem services and functions represent complex concepts. The data needed to establish BEF relationships are themselves highly heterogeneous and are typically generated by collabora- tive, interdisciplinary research consortia assembling expertise from various disciplines ranging from molecular ecology to remote sensing (Michener & Jones 2012). The diversity of data structures and scientific disciplines pose significant challenges when merging data sets to perform overarching meta-analyses. Here, we introduce the BEFdata platform that allows research- ers to manage naming conventions between data sets and to import metadata and primary data from the same spreadsheet. It includes a transparent data sharing mechanism for coopera- tive research projects. We use a generic data structure to accommodate the complexity of BEF research, which makes our approach useful to other scientific disciplines. In the following, we review the challenges of managing com- plex data, including (1) the heterogeneity of data structures, (2) the need to manage naming conventions at the primary data level and (3) the need for transparent data sharing mechanisms. The transdisciplinary nature, as well as the range of spatial and temporal scales typical of BEF research, is reflected in the complexity of BEF data sets. They may describe the properties of soil layers, plant traits, occurrences of individ- ual organisms, parts of individual organisms possibly at the *Correspondence author. E-mail: [email protected] © 2012 The Authors. Methods in Ecology and Evolution © 2012 British Ecological Society Methods in Ecology and Evolution 2013, 4, 201–205 doi: 10.1111/2041-210x.12009

Harmonizing, annotating and sharing data in biodiversity-ecosystem functioning research

  • Upload
    nick

  • View
    213

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Harmonizing, annotating and sharing data in biodiversity-ecosystem functioning research

APPLICATION

Harmonizing, annotating and sharing data in

biodiversity–ecosystem functioning research

KarinNadrowski1*, Sophia Ratcliffe1, GerhardB€onisch2, HelgeBruelheide3, JensKattge2,

Xiaojuan Liu4, LutzMaicher5,6, XiangchengMi4, Michael Prilop5,6, Daniel Seifarth5,

KarlWelter1,5, SvenWindisch5,7 andChristianWirth1

1Institute of Special Botany and Functional Biodiversity Research, University of Leipzig, 04103 Leipzig, Germany; 2MaxPlanck

Institute of Biogeochemistry, 07745 Jena, Germany; 3Institute of Botany/ Geobotany andBotanical Garden, Martin Luther

University HalleWittenberg, 06120Halle, Germany; 4Institute of Botany, Chinese Academy of Sciences, 20Nanxincun,

Beijing, Xiangshan, 100093, China; 5TopicMaps Lab, Natural LanguageProcessingGroup, University of Leipzig, 04109

Leipzig, Germany; 6Fraunhofer-Zentrum f€urMittel- undOsteuropa (MOEZ), 04109 Leipzig, Germany; and 7Business

Development, ESEMOSGmbH, 04109 Leipzig, Germany

Summary

1. The integrative research field of biodiversity–ecosystem functioning (BEF) requires close collaboration

between researchers from different disciplines working on different scales in time, space as well as taxon resolu-

tion. Data can describe anything from abiotic ecosystem components, to organisms, parts of organisms, genetic

information or element stocks and flows. Researchers prefer the convenience of spreadsheets for data prepara-

tion, which can lead to isolated data sets that are diverse in structure and follow diverging naming conventions.

2. BEFdata (https://github.com/befdata/befdata) is a new, open source web platform for the upload, validation

and storage of data from a formatted Excel workbook. Metadata can be downloaded in Ecological Metadata

Language (EML). BEFdata allows the harmonization of naming conventions by generating category lists from

the primary data, which can be reviewed andmanaged via the Excel workbook or directly on the platform. BEF-

data provides a secure environment during ongoing analysis; projectmembers can only access primary data from

other researchers after the acceptance of a data request.

3. Due to its generic database schema, BEFdata platforms can be used for any research domain working with

tabular data. It supports the compilation of coherent data sets at the level of the primary data, allowing research-

ers to explicitly model correlation structures across data sets for synthesis. The EML export enables efficient

publishing of data in global repositories.

Key-words: Ecoinformatics, BEF-China, cooperating research groups, web applications, Ruby on

Rails, knowledge management, EcologicalMetadata Language, semantic web, spreadsheets,Web of

Data

Introduction

In biodiversity–ecosystem functioning (BEF) research, both

the predictor – biodiversity – and the dependant variables –

ecosystem services and functions – represent complex concepts.

The data needed to establish BEF relationships are themselves

highly heterogeneous and are typically generated by collabora-

tive, interdisciplinary research consortia assembling expertise

from various disciplines ranging from molecular ecology to

remote sensing (Michener & Jones 2012). The diversity of data

structures and scientific disciplines pose significant challenges

whenmerging data sets to perform overarchingmeta-analyses.

Here, we introduce the BEFdata platform that allows research-

ers to manage naming conventions between data sets and to

import metadata and primary data from the same spreadsheet.

It includes a transparent data sharing mechanism for coopera-

tive research projects. We use a generic data structure to

accommodate the complexity of BEF research, which makes

our approach useful to other scientific disciplines.

In the following, we review the challenges of managing com-

plex data, including (1) the heterogeneity of data structures, (2)

the need to manage naming conventions at the primary data

level and (3) theneed for transparentdata sharingmechanisms.

The transdisciplinary nature, as well as the range of spatial

and temporal scales typical of BEF research, is reflected in

the complexity of BEF data sets. They may describe the

properties of soil layers, plant traits, occurrences of individ-

ual organisms, parts of individual organisms possibly at the*Correspondence author. E-mail: [email protected]

© 2012 The Authors. Methods in Ecology and Evolution © 2012 British Ecological Society

Methods in Ecology and Evolution 2013, 4, 201–205 doi: 10.1111/2041-210x.12009

Page 2: Harmonizing, annotating and sharing data in biodiversity-ecosystem functioning research

molecular level or aggregated properties of conceptual enti-

ties such as vegetation layers or ecosystem matter pools.

Additionally, the majority of data sets are human-entered

containing less than 1000 rows (Heidorn 2008; Lotz et al.

2012), each with a unique data structure. Researchers prefer

to use spreadsheets to prepare their data for analysis

(Tenopir et al. 2011), but without proper annotation, even

simple data sets can be difficult to understand.

When data sets are prepared independently in each research

project, it is often easier to generate new names for physical or

conceptual objects, than to work with names developed by

other groups. Examples of such naming conventions are the

codes given to plots, species names, individual IDs or categori-

cal parameter values. Diverging naming conventions increase

the effort required to harmonize data sets a posteriori. One way

of promoting data harmonization is by prescribing fixed data

structures that enforce the use of naming conventions. For

example, theDiversityWorkbench (Triebel 2011) offers valida-

tion against many different web services, including services for

scientific species names, habitat types, institutions or geo-

graphic context. However, these represent only a small subset

of the data resulting from BEF research. Another approach is

to allow any type of data file to be uploaded but ensure that

detailed metadata is included in a standard form. For example,

Metacat (KNB 2010) uses the Ecological Metadata Language

(EML) format (Fegraus et al. 2005). See Hern�andez-Ernst

et al. (2008) for a review of ecological information standards.

Data requests called ‘paper proposals’ or ‘proposals’ are

often used within cooperative research projects as a way to

make data exchange more transparent, to help in attributing

credit to data contributors and to increase trust and team spirit

(Stokstad 2011). They are formulated research ideas that spec-

ify what data are needed and whose expertise should be con-

sulted to answer a specific question. Cooperative research

projects that use paper proposals include, for example, the

TRY initiative (Kattge et al. 2011a), BEF-China (this article)

and the Nutrition Network (Stokstad 2011). To our knowl-

edge, there are no data management solutions that offer paper

proposal mechanisms to share data sets and protocol data

exchange.

BEFdata platform

The ‘BEFdata’ platform (Fig. 1) was developed within the

Biodiversity-Ecosystem Functioning Research Unit of the

German Research Foundation (BEF-China, http://www.bef-

china.de, FOR 891). BEFdata is an open source web applica-

tion written in Ruby on Rails (Ruby, Thomas & Heinemeier

Hansson 2011) and PostgreSQL (PostgreSQL Global Devel-

opmentGroup 2012). During upload, the data are harmonized

against existing data sets at the primary data level. We use a

generic data structure in that we store all primary data in a sin-

gle ‘sheetcells’ table (Kattge et al. 2011b, Appendix 1). BEFda-

ta provides an EMLmetadata export (Appendix 2). A detailed

C B A

Fig. 1. Screenshots of the welcome pages of the BEF-China group (http://china.befdata.biow.uni-leipzig.de), its Chinese partner projects

(http://159.226.89.107) and the FunDivEUROPE (http://fundiv.befdata.biow.uni-leipzig.de) BEFdata platforms.Data sets and paper proposals are

grouped by projects (A), by user (B) and on a separate data view (C). Primary data as well as metadata are uploaded exclusively through a formatted

Excel 2003workbook (Appendix 4) tominimize user interactionwithweb forms. For a usermanual, seeAppendix 3 or the BEFdata code repository

(https://github.com/befdata/befdata).

© 2012 The Authors. Methods in Ecology and Evolution © 2012 British Ecological Society, Methods in Ecology and Evolution, 4, 201–205

202 K. Nadrowski et al.

Page 3: Harmonizing, annotating and sharing data in biodiversity-ecosystem functioning research

user manual is provided in Appendix 3. Information on set-

ting up and managing the platform can be found online

(https://github.com/befdata/befdata). BEFdata platforms

are currently implemented by the BEF-China project

(http://china.befdata.biow.uni-leipzig.de) and its Chinese part-

ner projects (http://159.226.89.107) and by the FunDivEURO-

PE project (http://fundiv.befdata.biow.uni-leipzig.de).

Data harmonization

BEFdata platforms use a bottom-up approach to developing

naming conventions driven by the data. Primary data are

uploaded from the import workbook (Appendix 4). BEFdata

platforms currently support text, date, number and category

data types; each type has its own validation rules. Original

import values are stored in the ‘sheetcells’ table and are not

altered thereafter. A separate ‘categories’ table enables adher-

ence to naming conventions across data sets (Appendix 1).

Data columns from the import data are assigned to data

groups, and the upload process ensures that categories are

unique within data groups. Primary data of number, date and

category data types are matched to existing categories within

their data group during upload (Fig. 2). Having different cate-

gories available for numeric data allows the explicit definition

of missing data values. Invalid values are flagged to the user

for manual checking. See the user manual in Appendix 3 for

further information.

The bottom-up approach to naming conventions requires a

level of data management, which would not be needed when

using fixed naming conventions. Categories and data groups

can be browsed by members andmanaged by data owners and

administrators (Fig. 2). All the categories in the data groups

are listed on the individual data group page. Each category

also has its own page that lists all the primary data linked to

the category and the original uploaded value.

Administrators can rename, merge and split categories on

the platform (Fig. 2). Any changes are reflected in every data

set that is linked to the category.

Data owners can edit their data sets and reassign data

groups, which restarts the validation process. The data owner

can also download the workbook at any point. Any invalid

categories will be highlighted in the downloaded file, and any

missing or invalid data can be corrected in the workbook and

the workbook re-uploaded.

Data sharingworkflow: paper proposals

Access to data sets is restricted to the data owners. Mem-

bers who would like to use data sets for analysis must sub-

mit a paper proposal, which contains a list of the data

D C A

B

E

Fig. 2. Data group and category pages of a BEFdata platform. Categories are unique within data groups, and a data group page lists all its catego-

ries (A). During data import, primary data arematched to existing categories. Each category links to its own page (B), listing all the primary data it is

associated with (C), including their original import values (D). Administrators can rename or merge categories from the data group pages (E) and

split categories from the category page (B). See the text and the user manual in Appendix 3 for further information on how tomanage categories and

data groups.

© 2012 The Authors. Methods in Ecology and Evolution © 2012 British Ecological Society, Methods in Ecology and Evolution, 4, 201–205

BEFdata: data harmonization and sharing 203

Page 4: Harmonizing, annotating and sharing data in biodiversity-ecosystem functioning research

sets. Data sets can be added to a logged-in user’s cart, and

this collection of data sets can then be used for a paper

proposal. The proposal is initially reviewed by a project

board to make sure that it is novel, complementary and

does not compete with other activities, and then by all data

set owners listed in the proposal. Once all owners have

approved the proposal, proponents gain download access

to the requested data sets.

Discussion

BEFdata platforms allow the harmonization of both pri-

mary data and metadata for collaborating research pro-

jects. In comparison with initiatives that concentrate on

managing data set metadata (for example, Metacat, KNB

2010 or BExIS, Lotz et al. 2012), the focus of BEFdata is

on the primary data and specifically naming conventions

within primary data. Having complex but consistent sets

of primary data offers new possibilities for analysing eco-

system functioning. Current approaches to interdisciplinary

synthesis in BEF research compare the regression slopes

from separate analyses (Balvanera et al. 2006; Nadrowski,

Wirth & Scherer-Lorenzen 2010; Maestre et al. 2012). Con-

sistent data sets enable synthesis at the level of the primary

data where the correlation structure of data points can be

modelled explicitly using hierarchical modelling techniques

(Ogle et al. 2007).

The categories of BEFdata platforms are not controlled

vocabularies (NISO 2005). While homonyms can be

resolved because categories are nested within data groups,

it is not possible to specify narrower or broader terms or

to flag synonyms. However, BEFdata can make the use of

existing semantic tools easier: custom naming conventions

are exposed on a common platform where they can be

reviewed; data and metadata are stored in one relational

database, enabling seamless data and metadata interroga-

tion; and metadata can be exported in standard EML for-

mat. A logical further step is to implement data validation

against existing web services or thesauri (Nadrowski et al.

2012). The possibility of using web services to exchange

information between repositories will be the subject of

future BEFdata development. We are additionally evaluat-

ing the integration of BEFdata platforms into Kepler

workflows (Gries & Porter 2011; Pfaff, Nadrowski & Wirth

2012).

Conclusions

BEFdata platforms are communication tools that help

researchers in cooperative research projects speak the same

language using shared naming conventions, while having the

convenience of working with spreadsheets. Our implementa-

tion of the paper proposal process makes the data use more

transparent, which can increase synergies in cooperative

research programs.Global data visibility can lead to new scien-

tific collaborations, and data can be exported in EML format.

BEFdata platforms do not contain prescribed domain logic

and can thus be used by any scientific domain working with

tabular data. With this, we hope to make data management

and reuse within cooperating research projects more efficient

and enjoyable.

Current managers of BEFdata platforms have profited

from the speed of installation and customization (1 to

3 days for managers unexperienced with rails applications).

They continue to profit from bug fixes and new features

added to the common code repository. Initial feedback

from the current users has been positive. Researchers have

found it especially helpful to be able to extract automati-

cally assembled lists of names across data sets for species

or plots.

Acknowledgements

This manuscript was greatly improved by the comments of two anonymous

reviewers. The authors wish to thank all the members of the BEF-China project

for essential help and feedback in crafting the functionality of the BEFdata plat-

form. K. N, M. P., D. S., K. W, S. W. were supported by the German Science

Foundation (DFG) through the BEF-China project (FOR 981, sub-project ‘Data

management’) of C.W and H.B., and S. R. was supported by the EU project

FunDivEUROPE (265171, Work package 1, Task I.4 ‘Data management, data

quality assessment and control’) of C.W.

References

Balvanera, P., Pfisterer, A.B., Buchmann, N., He, J.-S., Nakashizuka, T.,

Raffaelli, D. & Schmid, B. (2006) Quantifying the evidence for biodiver-

sity effects on ecosystem functioning and services. Ecology Letters, 9,

1146–1156.Fegraus, E.H., Andelman, S., Jones, M.B. & Schildhauer, M. (2005)Maximizing

the value of ecological data with structured metadata: an introduction to Eco-

logical Metadata Language (EML) and principles for metadata creation. Bul-

letin of the Ecological Society of America, 86, 158–168.Gries, C. & Porter, J.H. (2011) Moving from custom scripts with extensive

instructions to a workflow system: use of the Kepler workflow engine in envi-

ronmental information management. Environmental InformationManagement

Conference 2011 (eds M.B. Jones & C. Gries), pp. 70–75. University of

California, Santa Barbara, CA.

Heidorn, P.B. (2008) Shedding light on the dark data in the long tail of science.

Library Trends, 57, 280–299.Hern�andez-Ernst, V., Poign�e, A., Voss, A., Voss, H., Berendsohn, W., Giddy, J.,

Gebhardt,M.,Hardisty, A., Schentz, H.&Magagna, B. (2008)Data&Model-

ling Tool Structures. Status Report on Infrastructures for Biodiversity Research.

Fraunhofer IAIS,CardiffUniversity, St. Augustin, Germany.

Kattge, J., D�ıaz, S., Lavorel, S., Prentice, I.C., Leadley, P., B€onisch, G., et al.

(2011a) TRY - a global database of plant traits. Global Change Biology, 17,

2905–2935.Kattge, J., Ogle, K., B€onisch, G., D�ıaz, S., Lavorel, S.,Madin, J., Nadrowski, K.,

N€ollert, S., Sartor, K. & Wirth, C. (2011b) A generic structure for plant trait

databases.Methods in Ecology and Evolution, 2, 202–213.KNB (2010)Administrator’s Guide forMetacat 1�9�3. National Center for Ecolog-

ical Analysis and Synthesis (NCEAS); Knowledge Network of Biocomplexity,

Santa Barbara, CA.

Lotz, T., Nieschulze, J., Bendix, J., Dobbermann, M. & K€onig-Ries, B. (2012)

Diverse or uniform? — Intercomparison of two major German project data-

bases for interdisciplinary collaborative functional biodiversity research. Eco-

logical Informatics, 8, 10–19.Maestre, F.T., Quero, J.L., Gotelli, N.J., Escudero, A., Ochoa, V., Delgado-

Baquerizo, M., et al. (2012) Plant species richness and ecosystem multifunc-

tionality in global drylands.Science, 335, 214–218.Michener, W.K. & Jones, M.B. (2012) Ecoinformatics: supporting ecology as a

data-intensive science.Trends in Ecology&Evolution, 27, 85–93.Nadrowski, K., Wirth, C. & Scherer-Lorenzen, M. (2010) Is forest diversity driv-

ing ecosystem function and service?Current Opinion in Environmental Sustain-

ability, 2, 75–79.Nadrowski, K., Seifarth,D., Ratcliffe, S.,Wirth, C. &Maicher, L. (2012) Identifi-

ers in e-Science platforms for the ecological sciences. Communities in New

© 2012 The Authors. Methods in Ecology and Evolution © 2012 British Ecological Society, Methods in Ecology and Evolution, 4, 201–205

204 K. Nadrowski et al.

Page 5: Harmonizing, annotating and sharing data in biodiversity-ecosystem functioning research

Media: Virtual Enterprises, Research Communities & Social Media Networks.

Proceedings of GeNeMe 2012 (eds T. Kohler & N. Kahnwald), pp. 259–272.TUDpress, Dresden.

NISO (2005) Guidelines for the Construction, Format, and Management of Mono-

lingual Controlled Vocabularies. NISOPress, Bethesda,MD.

Ogle, K. &Barber, J. (2007) Bayesian data-model integration in plant physiologi-

cal and ecosystem ecology. Progress in Botany (eds K. Esser, U. L€ottge, W.

Beyschlag& J.,Murata), pp. 281–311. Springer, Berlin,Heidelberg,Germany.

Pfaff, C.-T., Nadrowski, K. &Wirth, C. (2012) UsingKeplerWorkflows In Ecol-

ogy. F1000Posters, 3, 1356.

PostgreSQLGlobal Development Group. (2012) PostgreSQL – the world’s most

advanced open source database. URL www.postgresql.org [accessed 9

November 2012].

Ruby, S., Thomas, D. &Heinemeier Hansson, D. (2011)AgileWeb Development

with Rails. The Pragmatic Bookshelf, Raleigh,NorthCarolina.

Stokstad, E. (2011) Open-source ecology takes root across the world. Science,

334, 308–309.Tenopir, C., Allard, S., Douglass, K., Aydinoglu, A.U., Wu, L., Read, E.,

Manoff, M. & Frame, M. (2011) Data sharing by scientists: practices and

perceptions (CNeylon, Ed.).PLoSONE, 6, e21101.

Triebel, D. (2012) Diversity Workbench. Staatlichen Naturwissenschaftlichen

Sammlungen Bayerns (SNSB), Munchen. URL www.diversityworkbench.net

[accessed 9November 2012].

Received 20 July 2012; accepted 15October 2012

Handling Editor: Nick Isaac

Supporting Information

Additional Supporting Information may be found in the online version

of this article.

Appendix S1. Class diagram of the BEFdata platform. Rails applica-

tions by default provide a class for every database table. Relationships

between tables are not stored as foreign keys in the database but are

defined in the classes.

Appendix S2. EML document as downloaded from the BEF-China

BEFdata platform (http://china.befdata.biow.uni-leipzig.de), including

a version using pseudo-code that refers to BEFdata classes.

Appendix S3.The BEFdata platform usermanual.

Appendix S4. Excel workbook for importing data into BEFdata plat-

forms.

© 2012 The Authors. Methods in Ecology and Evolution © 2012 British Ecological Society, Methods in Ecology and Evolution, 4, 201–205

BEFdata: data harmonization and sharing 205