Upload
nick
View
213
Download
1
Embed Size (px)
Citation preview
APPLICATION
Harmonizing, annotating and sharing data in
biodiversity–ecosystem functioning research
KarinNadrowski1*, Sophia Ratcliffe1, GerhardB€onisch2, HelgeBruelheide3, JensKattge2,
Xiaojuan Liu4, LutzMaicher5,6, XiangchengMi4, Michael Prilop5,6, Daniel Seifarth5,
KarlWelter1,5, SvenWindisch5,7 andChristianWirth1
1Institute of Special Botany and Functional Biodiversity Research, University of Leipzig, 04103 Leipzig, Germany; 2MaxPlanck
Institute of Biogeochemistry, 07745 Jena, Germany; 3Institute of Botany/ Geobotany andBotanical Garden, Martin Luther
University HalleWittenberg, 06120Halle, Germany; 4Institute of Botany, Chinese Academy of Sciences, 20Nanxincun,
Beijing, Xiangshan, 100093, China; 5TopicMaps Lab, Natural LanguageProcessingGroup, University of Leipzig, 04109
Leipzig, Germany; 6Fraunhofer-Zentrum f€urMittel- undOsteuropa (MOEZ), 04109 Leipzig, Germany; and 7Business
Development, ESEMOSGmbH, 04109 Leipzig, Germany
Summary
1. The integrative research field of biodiversity–ecosystem functioning (BEF) requires close collaboration
between researchers from different disciplines working on different scales in time, space as well as taxon resolu-
tion. Data can describe anything from abiotic ecosystem components, to organisms, parts of organisms, genetic
information or element stocks and flows. Researchers prefer the convenience of spreadsheets for data prepara-
tion, which can lead to isolated data sets that are diverse in structure and follow diverging naming conventions.
2. BEFdata (https://github.com/befdata/befdata) is a new, open source web platform for the upload, validation
and storage of data from a formatted Excel workbook. Metadata can be downloaded in Ecological Metadata
Language (EML). BEFdata allows the harmonization of naming conventions by generating category lists from
the primary data, which can be reviewed andmanaged via the Excel workbook or directly on the platform. BEF-
data provides a secure environment during ongoing analysis; projectmembers can only access primary data from
other researchers after the acceptance of a data request.
3. Due to its generic database schema, BEFdata platforms can be used for any research domain working with
tabular data. It supports the compilation of coherent data sets at the level of the primary data, allowing research-
ers to explicitly model correlation structures across data sets for synthesis. The EML export enables efficient
publishing of data in global repositories.
Key-words: Ecoinformatics, BEF-China, cooperating research groups, web applications, Ruby on
Rails, knowledge management, EcologicalMetadata Language, semantic web, spreadsheets,Web of
Data
Introduction
In biodiversity–ecosystem functioning (BEF) research, both
the predictor – biodiversity – and the dependant variables –
ecosystem services and functions – represent complex concepts.
The data needed to establish BEF relationships are themselves
highly heterogeneous and are typically generated by collabora-
tive, interdisciplinary research consortia assembling expertise
from various disciplines ranging from molecular ecology to
remote sensing (Michener & Jones 2012). The diversity of data
structures and scientific disciplines pose significant challenges
whenmerging data sets to perform overarchingmeta-analyses.
Here, we introduce the BEFdata platform that allows research-
ers to manage naming conventions between data sets and to
import metadata and primary data from the same spreadsheet.
It includes a transparent data sharing mechanism for coopera-
tive research projects. We use a generic data structure to
accommodate the complexity of BEF research, which makes
our approach useful to other scientific disciplines.
In the following, we review the challenges of managing com-
plex data, including (1) the heterogeneity of data structures, (2)
the need to manage naming conventions at the primary data
level and (3) theneed for transparentdata sharingmechanisms.
The transdisciplinary nature, as well as the range of spatial
and temporal scales typical of BEF research, is reflected in
the complexity of BEF data sets. They may describe the
properties of soil layers, plant traits, occurrences of individ-
ual organisms, parts of individual organisms possibly at the*Correspondence author. E-mail: [email protected]
© 2012 The Authors. Methods in Ecology and Evolution © 2012 British Ecological Society
Methods in Ecology and Evolution 2013, 4, 201–205 doi: 10.1111/2041-210x.12009
molecular level or aggregated properties of conceptual enti-
ties such as vegetation layers or ecosystem matter pools.
Additionally, the majority of data sets are human-entered
containing less than 1000 rows (Heidorn 2008; Lotz et al.
2012), each with a unique data structure. Researchers prefer
to use spreadsheets to prepare their data for analysis
(Tenopir et al. 2011), but without proper annotation, even
simple data sets can be difficult to understand.
When data sets are prepared independently in each research
project, it is often easier to generate new names for physical or
conceptual objects, than to work with names developed by
other groups. Examples of such naming conventions are the
codes given to plots, species names, individual IDs or categori-
cal parameter values. Diverging naming conventions increase
the effort required to harmonize data sets a posteriori. One way
of promoting data harmonization is by prescribing fixed data
structures that enforce the use of naming conventions. For
example, theDiversityWorkbench (Triebel 2011) offers valida-
tion against many different web services, including services for
scientific species names, habitat types, institutions or geo-
graphic context. However, these represent only a small subset
of the data resulting from BEF research. Another approach is
to allow any type of data file to be uploaded but ensure that
detailed metadata is included in a standard form. For example,
Metacat (KNB 2010) uses the Ecological Metadata Language
(EML) format (Fegraus et al. 2005). See Hern�andez-Ernst
et al. (2008) for a review of ecological information standards.
Data requests called ‘paper proposals’ or ‘proposals’ are
often used within cooperative research projects as a way to
make data exchange more transparent, to help in attributing
credit to data contributors and to increase trust and team spirit
(Stokstad 2011). They are formulated research ideas that spec-
ify what data are needed and whose expertise should be con-
sulted to answer a specific question. Cooperative research
projects that use paper proposals include, for example, the
TRY initiative (Kattge et al. 2011a), BEF-China (this article)
and the Nutrition Network (Stokstad 2011). To our knowl-
edge, there are no data management solutions that offer paper
proposal mechanisms to share data sets and protocol data
exchange.
BEFdata platform
The ‘BEFdata’ platform (Fig. 1) was developed within the
Biodiversity-Ecosystem Functioning Research Unit of the
German Research Foundation (BEF-China, http://www.bef-
china.de, FOR 891). BEFdata is an open source web applica-
tion written in Ruby on Rails (Ruby, Thomas & Heinemeier
Hansson 2011) and PostgreSQL (PostgreSQL Global Devel-
opmentGroup 2012). During upload, the data are harmonized
against existing data sets at the primary data level. We use a
generic data structure in that we store all primary data in a sin-
gle ‘sheetcells’ table (Kattge et al. 2011b, Appendix 1). BEFda-
ta provides an EMLmetadata export (Appendix 2). A detailed
C B A
Fig. 1. Screenshots of the welcome pages of the BEF-China group (http://china.befdata.biow.uni-leipzig.de), its Chinese partner projects
(http://159.226.89.107) and the FunDivEUROPE (http://fundiv.befdata.biow.uni-leipzig.de) BEFdata platforms.Data sets and paper proposals are
grouped by projects (A), by user (B) and on a separate data view (C). Primary data as well as metadata are uploaded exclusively through a formatted
Excel 2003workbook (Appendix 4) tominimize user interactionwithweb forms. For a usermanual, seeAppendix 3 or the BEFdata code repository
(https://github.com/befdata/befdata).
© 2012 The Authors. Methods in Ecology and Evolution © 2012 British Ecological Society, Methods in Ecology and Evolution, 4, 201–205
202 K. Nadrowski et al.
user manual is provided in Appendix 3. Information on set-
ting up and managing the platform can be found online
(https://github.com/befdata/befdata). BEFdata platforms
are currently implemented by the BEF-China project
(http://china.befdata.biow.uni-leipzig.de) and its Chinese part-
ner projects (http://159.226.89.107) and by the FunDivEURO-
PE project (http://fundiv.befdata.biow.uni-leipzig.de).
Data harmonization
BEFdata platforms use a bottom-up approach to developing
naming conventions driven by the data. Primary data are
uploaded from the import workbook (Appendix 4). BEFdata
platforms currently support text, date, number and category
data types; each type has its own validation rules. Original
import values are stored in the ‘sheetcells’ table and are not
altered thereafter. A separate ‘categories’ table enables adher-
ence to naming conventions across data sets (Appendix 1).
Data columns from the import data are assigned to data
groups, and the upload process ensures that categories are
unique within data groups. Primary data of number, date and
category data types are matched to existing categories within
their data group during upload (Fig. 2). Having different cate-
gories available for numeric data allows the explicit definition
of missing data values. Invalid values are flagged to the user
for manual checking. See the user manual in Appendix 3 for
further information.
The bottom-up approach to naming conventions requires a
level of data management, which would not be needed when
using fixed naming conventions. Categories and data groups
can be browsed by members andmanaged by data owners and
administrators (Fig. 2). All the categories in the data groups
are listed on the individual data group page. Each category
also has its own page that lists all the primary data linked to
the category and the original uploaded value.
Administrators can rename, merge and split categories on
the platform (Fig. 2). Any changes are reflected in every data
set that is linked to the category.
Data owners can edit their data sets and reassign data
groups, which restarts the validation process. The data owner
can also download the workbook at any point. Any invalid
categories will be highlighted in the downloaded file, and any
missing or invalid data can be corrected in the workbook and
the workbook re-uploaded.
Data sharingworkflow: paper proposals
Access to data sets is restricted to the data owners. Mem-
bers who would like to use data sets for analysis must sub-
mit a paper proposal, which contains a list of the data
D C A
B
E
Fig. 2. Data group and category pages of a BEFdata platform. Categories are unique within data groups, and a data group page lists all its catego-
ries (A). During data import, primary data arematched to existing categories. Each category links to its own page (B), listing all the primary data it is
associated with (C), including their original import values (D). Administrators can rename or merge categories from the data group pages (E) and
split categories from the category page (B). See the text and the user manual in Appendix 3 for further information on how tomanage categories and
data groups.
© 2012 The Authors. Methods in Ecology and Evolution © 2012 British Ecological Society, Methods in Ecology and Evolution, 4, 201–205
BEFdata: data harmonization and sharing 203
sets. Data sets can be added to a logged-in user’s cart, and
this collection of data sets can then be used for a paper
proposal. The proposal is initially reviewed by a project
board to make sure that it is novel, complementary and
does not compete with other activities, and then by all data
set owners listed in the proposal. Once all owners have
approved the proposal, proponents gain download access
to the requested data sets.
Discussion
BEFdata platforms allow the harmonization of both pri-
mary data and metadata for collaborating research pro-
jects. In comparison with initiatives that concentrate on
managing data set metadata (for example, Metacat, KNB
2010 or BExIS, Lotz et al. 2012), the focus of BEFdata is
on the primary data and specifically naming conventions
within primary data. Having complex but consistent sets
of primary data offers new possibilities for analysing eco-
system functioning. Current approaches to interdisciplinary
synthesis in BEF research compare the regression slopes
from separate analyses (Balvanera et al. 2006; Nadrowski,
Wirth & Scherer-Lorenzen 2010; Maestre et al. 2012). Con-
sistent data sets enable synthesis at the level of the primary
data where the correlation structure of data points can be
modelled explicitly using hierarchical modelling techniques
(Ogle et al. 2007).
The categories of BEFdata platforms are not controlled
vocabularies (NISO 2005). While homonyms can be
resolved because categories are nested within data groups,
it is not possible to specify narrower or broader terms or
to flag synonyms. However, BEFdata can make the use of
existing semantic tools easier: custom naming conventions
are exposed on a common platform where they can be
reviewed; data and metadata are stored in one relational
database, enabling seamless data and metadata interroga-
tion; and metadata can be exported in standard EML for-
mat. A logical further step is to implement data validation
against existing web services or thesauri (Nadrowski et al.
2012). The possibility of using web services to exchange
information between repositories will be the subject of
future BEFdata development. We are additionally evaluat-
ing the integration of BEFdata platforms into Kepler
workflows (Gries & Porter 2011; Pfaff, Nadrowski & Wirth
2012).
Conclusions
BEFdata platforms are communication tools that help
researchers in cooperative research projects speak the same
language using shared naming conventions, while having the
convenience of working with spreadsheets. Our implementa-
tion of the paper proposal process makes the data use more
transparent, which can increase synergies in cooperative
research programs.Global data visibility can lead to new scien-
tific collaborations, and data can be exported in EML format.
BEFdata platforms do not contain prescribed domain logic
and can thus be used by any scientific domain working with
tabular data. With this, we hope to make data management
and reuse within cooperating research projects more efficient
and enjoyable.
Current managers of BEFdata platforms have profited
from the speed of installation and customization (1 to
3 days for managers unexperienced with rails applications).
They continue to profit from bug fixes and new features
added to the common code repository. Initial feedback
from the current users has been positive. Researchers have
found it especially helpful to be able to extract automati-
cally assembled lists of names across data sets for species
or plots.
Acknowledgements
This manuscript was greatly improved by the comments of two anonymous
reviewers. The authors wish to thank all the members of the BEF-China project
for essential help and feedback in crafting the functionality of the BEFdata plat-
form. K. N, M. P., D. S., K. W, S. W. were supported by the German Science
Foundation (DFG) through the BEF-China project (FOR 981, sub-project ‘Data
management’) of C.W and H.B., and S. R. was supported by the EU project
FunDivEUROPE (265171, Work package 1, Task I.4 ‘Data management, data
quality assessment and control’) of C.W.
References
Balvanera, P., Pfisterer, A.B., Buchmann, N., He, J.-S., Nakashizuka, T.,
Raffaelli, D. & Schmid, B. (2006) Quantifying the evidence for biodiver-
sity effects on ecosystem functioning and services. Ecology Letters, 9,
1146–1156.Fegraus, E.H., Andelman, S., Jones, M.B. & Schildhauer, M. (2005)Maximizing
the value of ecological data with structured metadata: an introduction to Eco-
logical Metadata Language (EML) and principles for metadata creation. Bul-
letin of the Ecological Society of America, 86, 158–168.Gries, C. & Porter, J.H. (2011) Moving from custom scripts with extensive
instructions to a workflow system: use of the Kepler workflow engine in envi-
ronmental information management. Environmental InformationManagement
Conference 2011 (eds M.B. Jones & C. Gries), pp. 70–75. University of
California, Santa Barbara, CA.
Heidorn, P.B. (2008) Shedding light on the dark data in the long tail of science.
Library Trends, 57, 280–299.Hern�andez-Ernst, V., Poign�e, A., Voss, A., Voss, H., Berendsohn, W., Giddy, J.,
Gebhardt,M.,Hardisty, A., Schentz, H.&Magagna, B. (2008)Data&Model-
ling Tool Structures. Status Report on Infrastructures for Biodiversity Research.
Fraunhofer IAIS,CardiffUniversity, St. Augustin, Germany.
Kattge, J., D�ıaz, S., Lavorel, S., Prentice, I.C., Leadley, P., B€onisch, G., et al.
(2011a) TRY - a global database of plant traits. Global Change Biology, 17,
2905–2935.Kattge, J., Ogle, K., B€onisch, G., D�ıaz, S., Lavorel, S.,Madin, J., Nadrowski, K.,
N€ollert, S., Sartor, K. & Wirth, C. (2011b) A generic structure for plant trait
databases.Methods in Ecology and Evolution, 2, 202–213.KNB (2010)Administrator’s Guide forMetacat 1�9�3. National Center for Ecolog-
ical Analysis and Synthesis (NCEAS); Knowledge Network of Biocomplexity,
Santa Barbara, CA.
Lotz, T., Nieschulze, J., Bendix, J., Dobbermann, M. & K€onig-Ries, B. (2012)
Diverse or uniform? — Intercomparison of two major German project data-
bases for interdisciplinary collaborative functional biodiversity research. Eco-
logical Informatics, 8, 10–19.Maestre, F.T., Quero, J.L., Gotelli, N.J., Escudero, A., Ochoa, V., Delgado-
Baquerizo, M., et al. (2012) Plant species richness and ecosystem multifunc-
tionality in global drylands.Science, 335, 214–218.Michener, W.K. & Jones, M.B. (2012) Ecoinformatics: supporting ecology as a
data-intensive science.Trends in Ecology&Evolution, 27, 85–93.Nadrowski, K., Wirth, C. & Scherer-Lorenzen, M. (2010) Is forest diversity driv-
ing ecosystem function and service?Current Opinion in Environmental Sustain-
ability, 2, 75–79.Nadrowski, K., Seifarth,D., Ratcliffe, S.,Wirth, C. &Maicher, L. (2012) Identifi-
ers in e-Science platforms for the ecological sciences. Communities in New
© 2012 The Authors. Methods in Ecology and Evolution © 2012 British Ecological Society, Methods in Ecology and Evolution, 4, 201–205
204 K. Nadrowski et al.
Media: Virtual Enterprises, Research Communities & Social Media Networks.
Proceedings of GeNeMe 2012 (eds T. Kohler & N. Kahnwald), pp. 259–272.TUDpress, Dresden.
NISO (2005) Guidelines for the Construction, Format, and Management of Mono-
lingual Controlled Vocabularies. NISOPress, Bethesda,MD.
Ogle, K. &Barber, J. (2007) Bayesian data-model integration in plant physiologi-
cal and ecosystem ecology. Progress in Botany (eds K. Esser, U. L€ottge, W.
Beyschlag& J.,Murata), pp. 281–311. Springer, Berlin,Heidelberg,Germany.
Pfaff, C.-T., Nadrowski, K. &Wirth, C. (2012) UsingKeplerWorkflows In Ecol-
ogy. F1000Posters, 3, 1356.
PostgreSQLGlobal Development Group. (2012) PostgreSQL – the world’s most
advanced open source database. URL www.postgresql.org [accessed 9
November 2012].
Ruby, S., Thomas, D. &Heinemeier Hansson, D. (2011)AgileWeb Development
with Rails. The Pragmatic Bookshelf, Raleigh,NorthCarolina.
Stokstad, E. (2011) Open-source ecology takes root across the world. Science,
334, 308–309.Tenopir, C., Allard, S., Douglass, K., Aydinoglu, A.U., Wu, L., Read, E.,
Manoff, M. & Frame, M. (2011) Data sharing by scientists: practices and
perceptions (CNeylon, Ed.).PLoSONE, 6, e21101.
Triebel, D. (2012) Diversity Workbench. Staatlichen Naturwissenschaftlichen
Sammlungen Bayerns (SNSB), Munchen. URL www.diversityworkbench.net
[accessed 9November 2012].
Received 20 July 2012; accepted 15October 2012
Handling Editor: Nick Isaac
Supporting Information
Additional Supporting Information may be found in the online version
of this article.
Appendix S1. Class diagram of the BEFdata platform. Rails applica-
tions by default provide a class for every database table. Relationships
between tables are not stored as foreign keys in the database but are
defined in the classes.
Appendix S2. EML document as downloaded from the BEF-China
BEFdata platform (http://china.befdata.biow.uni-leipzig.de), including
a version using pseudo-code that refers to BEFdata classes.
Appendix S3.The BEFdata platform usermanual.
Appendix S4. Excel workbook for importing data into BEFdata plat-
forms.
© 2012 The Authors. Methods in Ecology and Evolution © 2012 British Ecological Society, Methods in Ecology and Evolution, 4, 201–205
BEFdata: data harmonization and sharing 205