D6.2 Data Management Plan - DATABIO Data-driven Bioeconomy · D6.2 – Data Management Plan H2020 Contract No. 732064 Final – v1.0, 30/6/2017 Dissemination level: PU -Public Page

This document is part of a project that has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 732064. It is the property of the DataBio consortium and shall not be distributed or reproduced without the formal approval of the DataBio Management Committee.

Project Acronym: DataBio

Grant Agreement number: 732064 (H2020-ICT-2016-1 – Innovation Action)

Project Full Title: Data-Driven Bioeconomy

Project Coordinator: INTRASOFT International

DELIVERABLE

D6.2 – Data Management Plan

Dissemination level PU -Public

Type of Document Report

Contractual date of delivery M06 – 30/6/2017

Deliverable Leader CREA

Status - version, date Final – v1.0, 30/6/2017

WP / Task responsible WP6

Keywords: Data management plan, big data, bioeconomy

D6.2 – Data Management Plan H2020 Contract No. 732064 Final – v1.0, 30/6/2017

Dissemination level: PU -Public Page 2

Executive Summary This document presents DataBio’s D6.2 deliverable, Data Management Plan (DMP), the key

element of good data management. DataBio participates in the European Commission H2020

Program’s extended open research data pilot and hence, a DMP is required. And,

consequently, DataBio project’s datasets will be as open as possible and as closed as

necessary, focusing on sound big data management for the sake of best research practice,

and in order to create value, and foster knowledge and technology out of big datasets for the

good of man. The deliverable describes the data management life cycle for the data to be

collected, processed and/or generated by DataBio project, accounting also for the necessity

to make research data findable, accessible, interoperable and reusable (FAIR).

DataBio’s partners will be encouraged to adhere to sound data management to ensure that

data are well-managed, archived and preserved. Data preservation is synonymous to data

relevance since: (1) data can then be reused by other researchers, (2) data collector can direct

requests for data to the database, rather than address requests individually, (3) preserved

data have the potential to lead to new, unanticipated discoveries, (4) preserved data prevent

duplication of scientific studies that have already been conducted, and (5) archiving data

insures against loss by the data collector. The main issues addressed in this deliverable

include: (1) the purpose of data collection, (2) data type, format, size, velocity, beneficiaries,

and provenance, (3) use of historical data, (4) making data FAIR, (5) data management

support, (6) data security, and (7) ethical aspects.

Doubtless, big data is a new paradigm and is coercing changes in businesses and other

organizations. A few entities in EU are starting to manage the massive data sets and non-

traditional data structures that are typical of big data and/or managing big data by extending

their data management skills and their portfolios of data management software. Big data

management empowers those entities to efficiently automate business operations, operate

closer to real time, and through analytics, add value and learn valuable new facts about

business operations, customers, partners, etc. Within the DataBio framework, big data

management (BDM), is a mixture of conventional and new best practices, skills, teams, data

types, and in-house grown or vendor-built functionality. All of these are being realigned under

DataBio platform built upon partners own experiences and tools. It is anticipated that DataBio

will provide a solution which will assume that datasets will be distributed among different

infrastructures and that their accessibility could be complex, needing to have mechanisms

which facilitate data retrieval, processing, manipulation and visualization as seamlessly as

possible. The infrastructure will open new possibilities for ICT sector, including SMEs to

develop new Bioeconomy 4.0 and will also open new possibilities for companies from the

Earth Observation sector.

Some partners have scaled up pre-existing applications and databases to handle burgeoning

volumes of relational big data, or they have acquired new data management platforms that



are purpose-built for managing and analyzing multi-structured big data, including streaming

big data. Others are evaluating big data platforms in order to create a brisk market of vendor

products and services for managing and harnessing big data. The Hadoop Distributed File

System (HDFS), MapReduce, various Hadoop tools, complex event processing (for streaming

big data), NoSQL databases (for schema-free big data), in-memory databases (for real-time

analytic processing of big data), private clouds, in-database analytics, and grid computing, will

be some of the software products implemented within the DataBio framework.

During the lifecycle of the DataBio project, big data will be collected that is, very large data

sets (multi-terabyte or larger) consist of a wide range of data types (relational, text, multi-

structured data, etc.) from numerous sources. Most data will come from farm and forestry

machinery, fishing vessels, remote and proximal sensors and imagery, and many other

technologies. DataBio is purposefully collecting big data, specifically:

• Forestry: Big Data methods are expected to bring the possibility to both increase the value of the forests as well as to decrease the costs within sustainability limits set by natural growth and ecological aspects. The key technology is to gather more and more accurate information about the trees from a host of sensors including new generation of satellites, UAV images, laser scanning, mobile devices through crowdsourcing and machines operating in the forests.

• Agriculture: Big Data in Agriculture is currently a hot topic. DataBio aims at building a European vision of Big Data for agriculture. This vision is to offer solution which will increase role of Big Data role in Agri Food chains in Europe: a perspective, which prepared recommendation for future big data development in Europe.

• Fisheries: the ambition of this project is to herald and promote the use of Big Data analytical tools within fisheries applications by initiating several pilots which will demonstrate benefits of using Big Data in an analytical way for the fisheries, such as improved analysis of operational data, tools for planning and operational choices, crowdsourcing methods for fish stock estimation.

This is the first version of DataBio DMP; it will be updated over the course of the project as

warranted by significant changes arising during the project implementation, and the

requirements of the project consortium. At least two updates will be prepared, on Months

18 and 36 of the project.



Deliverable Leader: Ephrem Habyarimana (CREA)

Contributors:

Jaroslav Šmejkal (ZETOR), Tomas Mildorf (UWB), Bernard

Stevenot (SPACEBEL), Irene Matzakou (INTRASOFT), Ingo

Simonis (OGSE), Christian Zinke (INFAI), Karel Charvat (LESPRO)

Reviewers: Kyrill Meyer (INFAI), Tomas Mildorf (UWB), Erwin Goor (VITO),

Fabiana Fournier (IBM), Marco Folegani (MEEO)

Approved by: Athanasios Poulakidas (INTRASOFT)

Document History

Version Date Contributor(s) Description

0.1.1-2 12/05/2017 Ephrem

Habyarimana TOC

0.1.3 22/05/2017 Ephrem

Habyarimana Reviewed TOC, First assignments

0.2 30/05/2017 Tomas Mildorf Section 4.1 FAIR data costs

0.3 05/06/2017 Bernard Stevenot Section 6 Ethical issues

0.4 09/06/2017

Irene Matzakou,

Athanasios

Poulakidas

Section 5.4 - 5.5 Privacy and sensitive data

management

0.5.1 21/06/2017 Ingo Simonis Section 3.3 and 3.4 added

0.5.2 22/06/2017 Christian Zinke,

Jaroslav Šmejkal

Sections 2.2.4.4 Machine-generated data

and 4.2 added

0.6 23/06/2017 Ephrem

Habyarimana

Added: Executive summary, sections 1.2 &

2.1, and chapter 7

0.7 27/06/2017 Ephrem

Habyarimana

added section 1.3 and made edits

throughout the document.



0.8 28/06/2017 Tomas Mildorf

Update of Section 2.2.4.3, Section 2.5.4,

Section 2.5.5, Section 3.1.3 and Section

4.1

0.9 30/06/2017 Ephrem

Habyarimana

Included all tables for currently described

DataBio’s datasets; overall edit of entire

document.

1.0 30/06/2017 Athanasios

Poulakidas

Compliance to submission format and

minor changes.



Table of Contents EXECUTIVE SUMMARY ..................................................................................................................................... 2

TABLE OF CONTENTS ........................................................................................................................................ 6

TABLE OF FIGURES ........................................................................................................................................... 8

LIST OF TABLES ................................................................................................................................................ 8

DEFINITIONS, ACRONYMS AND ABBREVIATIONS ............................................................................................. 9

INTRODUCTION .................................................................................................................................... 10

1.1 PROJECT SUMMARY ..................................................................................................................................... 10 1.2 DOCUMENT SCOPE ...................................................................................................................................... 13 1.3 DOCUMENT STRUCTURE ............................................................................................................................... 14

DATA SUMMARY .................................................................................................................................. 15

2.1 PURPOSE OF DATA COLLECTION ...................................................................................................................... 15 2.2 DATA TYPES AND FORMATS ........................................................................................................................... 17

2.2.1 Structured data ............................................................................................................................. 17 2.2.2 Semi-structured data .................................................................................................................... 17 2.2.3 Unstructured data ......................................................................................................................... 19 2.2.4 New generation big data .............................................................................................................. 19

2.3 HISTORICAL DATA ........................................................................................................................................ 25 2.4 EXPECTED DATA SIZE AND VELOCITY ................................................................................................................. 26 2.5 DATA BENEFICIARIES .................................................................................................................................... 26

2.5.1 Agricultural Sector ........................................................................................................................ 27 2.5.2 Forestry Sector .............................................................................................................................. 27 2.5.3 Fishery Sector ................................................................................................................................ 28 2.5.4 Technical Staff ............................................................................................................................... 28 2.5.5 ICT sector ...................................................................................................................................... 28 2.5.6 Research and education ................................................................................................................ 30 2.5.7 Policy making bodies ..................................................................................................................... 30

FAIR DATA ............................................................................................................................................ 31

3.1 DATA FINDABILITY ....................................................................................................................................... 31 3.1.1 Data discoverability and metadata provision ............................................................................... 31 3.1.2 Data identification, naming mechanisms and search keyword approaches................................. 33 3.1.3 Data lineage .................................................................................................................................. 34

3.2 DATA ACCESSIBILITY ..................................................................................................................................... 37 3.2.1 Open data and closed data ........................................................................................................... 37 3.2.2 Data access mechanisms, software and tools .............................................................................. 38 3.2.3 Big data warehouse architectures and database management systems ..................................... 38

3.3 DATA INTEROPERABILITY ............................................................................................................................... 40 3.3.1 Interoperability mechanisms ........................................................................................................ 41 3.3.2 Inter-discipline interoperability and ontologies ............................................................................ 41

3.4 PROMOTING DATA REUSE .............................................................................................................................. 42

DATA MANAGEMENT SUPPORT ............................................................................................................ 43

4.1 FAIR DATA COSTS........................................................................................................................................ 43



4.2 BIG DATA MANAGERS ................................................................................................................................... 43 4.2.1 Project manager ........................................................................................................................... 43 4.2.2 Business Analysts .......................................................................................................................... 44 4.2.3 Data Scientists .............................................................................................................................. 44 4.2.4 Data Engineer / Architect ............................................................................................................. 44 4.2.5 Platform architects ....................................................................................................................... 44 4.2.6 IT/Operation manager .................................................................................................................. 44 4.2.7 Consultant ..................................................................................................................................... 45 4.2.8 Business User ................................................................................................................................ 45 4.2.9 Pilot experts .................................................................................................................................. 45

DATA SECURITY .................................................................................................................................... 46

5.1 INTRODUCTION ........................................................................................................................................... 46 5.2 DATA RECOVERY .......................................................................................................................................... 47 5.3 PRIVACY AND SENSITIVE DATA MANAGEMENT ................................................................................................... 48

5.3.1 Introduction .................................................................................................................................. 48 5.3.2 Enterprise Data (commercial sensitive data) ................................................................................ 48 5.3.3 Personal Data................................................................................................................................ 49

5.4 GENERAL PRIVACY CONCERNS ........................................................................................................................ 50

ETHICAL ISSUES ..................................................................................................................................... 51

CONCLUSIONS ...................................................................................................................................... 52

REFERENCES ......................................................................................................................................... 54

APPENDIX A DATABIO DATASETS ........................................................................................................... 55

A.1 SMART POI DATA SET (UWB - D03.01) .................................................................................................... 56 A.2 OPEN TRANSPORT MAP (UWB - D03.02) ................................................................................................. 58 A.3 SENTINELS SCIENTIFIC HUB DATASETS VIA FEDEO GATEWAY (SPACEBEL -D07.01) .......................................... 60 A.4 NASA CMR LANDSAT DATASETS VIA FEDEO GATEWAY (SPACEBEL - D07.02) ............................................... 61 A.5 OPEN LAND USE (LESPRO - D02.01) ......................................................................................................... 62 A.6 FOREST RESOURCE DATA (METSAK - D18.01) ............................................................................................ 64 A.7 CUSTOMER AND FOREST ESTATE DATA (METSAK - D18.02) .......................................................................... 65 A.8 STORM DAMAGE OBSERVATIONS AND POSSIBLE RISK AREAS (METSAK - D18.03) .............................................. 67 A.9 QUALITY CONTROL DATA (METSAK - D18.04) ........................................................................................... 68 A.10 ONTOLOGY FOR (PRECISION) AGRICULTURE (PSNC - D09.01) ....................................................................... 69 A.11 WUUDIS DATA (MHGS - D20.01) ............................................................................................................ 71 A.12 SIGPAC (TRAGSA - D11.05) .................................................................................................................... 72 A.13 FIELD DATA - PILOT B2 (TRAGSA - D11.07)................................................................................................. 74 A.14 IACS (NP - D13.01) .............................................................................................................................. 75 A.15 SENTINEL DATA ...................................................................................................................................... 76 A.16 TREE SPECIES MAP (FMI - D14.03) .......................................................................................................... 76 A.17 STAND AGE MAP (FMI - D14.04) ............................................................................................................. 77 A.18 CANOPY HEIGHT MAP (FMI - D14.05) ....................................................................................................... 78 A.19 LEAF AREA INDEX (FMI - D14.06)............................................................................................................. 79 A.20 FOREST DAMAGE (FMI - D14.07) ............................................................................................................. 80 A.21 HYPERSPECTRAL IMAGE ORTHOMOSAIC (SENOP - D44.02) ............................................................................ 81 A.22 GAIATRONS IOT (DS13.01) ................................................................................................................... 81 A.23 PHENOMICS, METABOLOMICS, GENOMICS AND ENVIRONMENTAL DATASETS (CERTH - DS40.01) ......................... 82



Table of Figures FIGURE 1: DATABIO’S ANALYTICS AND BIG DATA VALUE APPROACH ..................................................................................... 16 FIGURE 2: THE PROCESSING DATA LIFECYCLE ................................................................................................................... 36 FIGURE 3: THE “DISCIPLINARY DATA INTEGRATION PLATFORM: WHERE DO YOU SSIT? (SOURCE: WYBORN) .................................. 41 FIGURE 4: DATABIO’S DATA MANAGERS ......................................................................................................................... 45 FIGURE 5: DATA LIFECYCLE .......................................................................................................................................... 46 FIGURE 6: THE DATA MODEL OF SMART POINTS OF INTEREST ............................................................................................ 58 FIGURE 7: THE DATA MODEL OF OPEN TRANSPORT MAP ................................................................................................... 60 FIGURE 8: FEDEO CLIENT (C07.05) ............................................................................................................................. 61

List of Tables TABLE 1: THE DATABIO CONSORTIUM PARTNERS ............................................................................................................. 10 TABLE 2: SENSOR DATA TOOLS, RESOLUTION AND SPATIAL DENSITY ..................................................................................... 20 TABLE 3: GEOSPATIAL DATA TOOLS, FORMAT AND ORIGIN ................................................................................................. 24 TABLE 4: GENOMIC, BIOCHEMICAL AND METABOLOMIC DATA TOOLS, DESCRIPTION AND ACQUISITION ........................................ 25



Definitions, Acronyms and Abbreviations Acronym/

Abbreviation Title

BDVA Big Data Value Association

EC European Commission

EO Earth Observation

ETL Extract Transform Load

DMP Data Management Plan

GSM Global System for Mobile

GSP Global Positioning System

FAIR Findable Accessible Interoperable and Reusable

HDFS Hadoop Distributed File System

ICT Information and Communications Technology

IoT Internet of Things

JDBC Java DataBase Connectivity

JSON JavaScript Object Notation

NoSQL Not Only SQL

OBDC Open Database Connectivity

OEM Object Exchange Model

OGC Open Geospatial Consortium

REST Representational State Transfer

RFID Radio-Frequency IDentification

RPAS Remotely Piloted Aircraft Systems

SME Small-Medium Enterprise

SOAP Simple Object Access Protocol

SQL Structured Query Language

UAV Unmanned Air Vehicle

UI User Interface

WP Work Package

XML eXtensible Markup Language



Introduction 1.1 Project Summary The data intensive target sector on which the

DataBio project focuses is the Data-Driven

Bioeconomy. DataBio focuses on utilizing Big

Data to contribute to the production of the

best possible raw materials from agriculture,

forestry and fishery (aquaculture) for the

bioeconomy industry, as well as their further

processing into food, energy and

biomaterials, while taking into account various accountability and sustainability issues.

DataBio will deploy state-of-the-art big data technologies and existing partners’ infrastructure

and solutions, linked together through the DataBio Platform. These will aggregate Big Data

from the three identified sectors (agriculture, forestry and fishery), intelligently process them

and allow the three sectors to selectively utilize numerous platform components, according

to their requirements. The execution will be through continuous cooperation of end user and

technology provider companies, bioeconomy and technology research institutes, and

stakeholders from the big data value PPP programme.

DataBio is driven by the development, use and evaluation of a large number of pilots in the

three identified sectors, where associated partners and additional stakeholders are also

involved. The selected pilot concepts will be transformed to pilot implementations utilizing

co-innovative methods and tools. The pilots select and utilize the best suitable market-ready

or almost market-ready ICT, Big Data and Earth Observation methods, technologies, tools and

services to be integrated to the common DataBio Platform.

Based on the pilot results and the new DataBio Platform, new solutions and new business

opportunities are expected to emerge. DataBio will organize a series of trainings and

hackathons to support its uptake and to enable developers outside the consortium to design

and develop new tools, services and applications based on and for the DataBio Platform.

The DataBio consortium is listed in Table 1. For more information about the project see [REF-

01].

Table 1: The DataBio consortium partners

Number Name Short name Country

1 (CO) INTRASOFT INTERNATIONAL SA INTRASOFT Belgium



2 LESPROJEKT SLUZBY SRO LESPRO Czech Republic

3 ZAPADOCESKA UNIVERZITA V PLZNI UWB Czech Republic

4

FRAUNHOFER GESELLSCHAFT ZUR FOERDERUNG DER

ANGEWANDTEN FORSCHUNG E.V. Fraunhofer Germany

5 ATOS SPAIN SA ATOS Spain

6 STIFTELSEN SINTEF SINTEF ICT Norway

7 SPACEBEL SA SPACEBEL Belgium

8

VLAAMSE INSTELLING VOOR TECHNOLOGISCH

ONDERZOEK N.V. VITO Belgium

9

INSTYTUT CHEMII BIOORGANICZNEJ POLSKIEJ

AKADEMII NAUK PSNC Poland

10 CIAOTECH Srl CiaoT Italy

11 EMPRESA DE TRANSFORMACION AGRARIA SA TRAGSA Spain

12 INSTITUT FUR ANGEWANDTE INFORMATIK (INFAI) EV INFAI Germany

13 NEUROPUBLIC AE PLIROFORIKIS & EPIKOINONION NP Greece

14

Ústav pro hospodářskou úpravu lesů Brandýs nad

Labem UHUL FMI Czech Republic

15 INNOVATION ENGINEERING SRL InnoE Italy

16 Teknologian tutkimuskeskus VTT Oy VTT Finland

17 SINTEF FISKERI OG HAVBRUK AS

SINTEF

Fishery Norway

18 SUOMEN METSAKESKUS-FINLANDS SKOGSCENTRAL METSAK Finland

19 IBM ISRAEL - SCIENCE AND TECHNOLOGY LTD IBM Israel

20 MHG SYSTEMS OY - MHGS MHGS Finland

21 NB ADVIES BV NB Advies Netherlands

22

CONSIGLIO PER LA RICERCA IN AGRICOLTURA E

L'ANALISI DELL'ECONOMIA AGRARIA CREA Italy

23 FUNDACION AZTI - AZTI FUNDAZIOA AZTI Spain

24 KINGS BAY AS KingsBay Norway



25 EROS AS Eros Norway

26 ERVIK & SAEVIK AS ESAS Norway

27 LIEGRUPPEN FISKERI AS LiegFi Norway

28 E-GEOS SPA e-geos Italy

29 DANMARKS TEKNISKE UNIVERSITET DTU Denmark

30 FEDERUNACOMA SRL UNIPERSONALE Federu Italy

31

CSEM CENTRE SUISSE D'ELECTRONIQUE ET DE

MICROTECHNIQUE SA - RECHERCHE ET

DEVELOPPEMENT CSEM Switzerland

32 UNIVERSITAET ST. GALLEN UStG Switzerland

33 NORGES SILDESALGSLAG SA Sildes Norway

34 EXUS SOFTWARE LTD EXUS

United

Kingdom

35 CYBERNETICA AS CYBER Estonia

36

GAIA EPICHEIREIN ANONYMI ETAIREIA PSIFIAKON

YPIRESION GAIA Greece

37 SOFTEAM Softeam France

38

FUNDACION CITOLIVA, CENTRO DE INNOVACION Y

TECNOLOGIA DEL OLIVAR Y DEL ACEITE CITOLIVA Spain

39 TERRASIGNA SRL TerraS Romania

40

ETHNIKO KENTRO EREVNAS KAI TECHNOLOGIKIS

ANAPTYXIS CERTH Greece

41

METEOROLOGICAL AND ENVIRONMENTAL EARTH

OBSERVATION SRL MEEO Italy

42 ECHEBASTAR FLEET SOCIEDAD LIMITADA ECHEBF Spain

43 NOVAMONT SPA Novam Italy

44 SENOP OY Senop Finland

45

UNIVERSIDAD DEL PAIS VASCO/ EUSKAL HERRIKO

UNIBERTSITATEA EHU/UPV Spain

46

OPEN GEOSPATIAL CONSORTIUM (EUROPE) LIMITED

LBG OGCE

United

Kingdom



47 ZETOR TRACTORS AS ZETOR Czech Republic

48

COOPERATIVA AGRICOLA CESENATE SOCIETA

COOPERATIVA AGRICOLA CAC Italy

1.2 Document Scope This document outlines DataBio’s data management plan (DMP), formally documenting how

data will be handled both during the implementation and upon natural termination of the

project. Many DMP aspects will be considered including metadata generation, data

preservation, data security and ethics, accounting for the FAIR (Findable, Accessible,

Interoperable, Re-usable) data principle. DataBio, Data-driven Bioeconomy project, is an

innovation big data intensive action involving public private partnership to promote

productivity on EU companies in three of the major bioeconomy sectors namely, Agriculture,

forestry and fishery. Experiences from US show that bioeconomy can get a significant boost

from Big Data. In Europe, this sector has until now attracted few large ICT vendors. A central

goal of DataBio is to increase participation of European ICT industry in the development of

Big Data systems for boosting the lagging bioeconomy productivity. As a good case in point,

European agriculture, forestry and fishery can benefit greatly from the European Copernicus

space program which has currently launched its third Sentinel satellite, telemetry IoT, UAVs,

etc.

Farm and forestry machinery, and fishing vessels in use today collect large quantities of data

in unprecedented pattern. Remote and proximal sensors and imagery, and many other

technologies, are all working together to give details about crop and soil properties, marine

environment, weeds and pests, sunlight and shade, and many other primary production

relevant variables. Deploying big data analytics in these data can help the farmers, foresters

and fishers to adjust and improve the productivity of their business operations. On the other

hand, large data sets such as those coming from the Copernicus earth monitoring

infrastructure, are increasingly available on different levels of granularity, but they are

heterogeneous, at times also unstructured, hard to analyze and distributed across various

sectors and different providers. It is here that data management plan comes in. It is

anticipated that DataBio will provide a solution which will assume that datasets will be

distributed among different infrastructures and that their accessibility could be complex,

needing to have mechanisms which facilitate data retrieval, processing, manipulation and

visualization as seamlessly as possible. The infrastructure will open new possibilities for ICT

sector, including SMEs to develop new Bioeconomy 4.0 and will also open new possibilities

for companies from the Earth Observation sector.

This DMP will be updated over the course of DataBio project whenever significant changes

arise. The updates of this document will increasingly provide in-depths on DataBio DMP

strategies with particular interest on the aspects of findability, accessibility, interoperability



and reusability of the Big Data the project produces. At least two updates will be prepared,

on Month 18 and Month 36 of the project.

1.3 Document Structure This document is comprised of the following chapters:

Chapter 1 presents an introduction to the project and the document.

Chapter 2 presents the data summary including the purpose of data collection, data size, type

and format, historical data reuse and data beneficiaries.

Chapter 3 outlines DataBio’s FAIR data strategies.

Chapter 4 describes data management support.

Chapter 5 describes data security.

Chapter 6 describes ethical issues.

Chapter 7 presents the concluding remarks.

Appendix A presents the managed data sets.



Data Summary 2.1 Purpose of data collection During the lifecycle of the DataBio project, big data will be collected that is, very large data

sets (multi-terabyte or larger) consisting of a wide range of data types (relational, text, multi-

structured data, etc.) from numerous sources, including relatively new generation big data

(machines, sensors, genomics, etc.). The ultimate purpose of data collection is to use the data

as a source of information in the implementation of a variety of big data analytics algorithms,

services and applications DataBio will deploy to create a value, new business facts and insights

with a particular focus on the bioeconomy industry. The big datasets are part of the building

blocks of the DataBio’s big data technology platform (Figure 1) that was designed to help

European companies increase productivity. Big Data experts provide common analytic

technology support for the main common and typical Bioeconomy applications/analytics that

are now emerging through the pilots in the project. Data from the past will be managed and

analyzed, including many different kind of data sources: i.e., descriptive analytics and classical

query/reporting (in need of variety management - and handling and analysis of all of the data

from the past, including performance data, transactional data, attitudinal data, descriptive

data, behavioural data, location-related data, interactional data, from many different

sources). Big data from the present time will be harnessed in the process of monitoring and

real-time analytics - pilot services (in need of velocity processing - and handling of real-time

data from the present) - trigging alarms, actuators etc.

Harnessing big data for the future time include forecasting, prediction and recommendation

analytics - pilot services (in need of volume processing - and processing of large amounts of

data combining knowledge from the past and present, and from models, to provide insight

for the future).



Figure 1: DataBio’s analytics and big data value approach

Specifically:

• Forestry: Big Data methods are expected to bring the possibility to both increase the value of the forests as well as to decrease the costs within sustainability limits set by natural growth and ecological aspects. The key technology is to gather more and more accurate information about the trees from a host of sensors including new generation of satellites, UAV images, laser scanning, mobile devices through crowdsourcing and machines operating in the forests.

• Agriculture: Big Data in Agriculture is currently a hot topic. The DataBio intention is to build a European vision of Big Data for agriculture. This vision is to offer solutions which will increase the role of Big Data role in Agri Food chains in Europe: a perspective, which will prepare recommendation for future big data development in Europe.

• Fisheries: the ambition is to herald and promote the use of Big Data analytical tools within fisheries applications by initiating several pilots which will demonstrate benefits of using Big Data in an analytical way for the fisheries, such as improved analysis of operational data, tools for planning and operational choices, crowdsourcing methods for fish stock estimation.



• The use of Big data analytics will bring about innovation. It will generate significant economic value, extend the relevant market sectors, and herald novel business/organizational models. The cross-cutting character of the geo-spatial Big Data solutions allows the straightforward extension of the scope of applications beyond the bio-economy sectors. Such extensions of the market for the Big Data technologies are foreseen in economic sectors, such as: Urban planning, Water quality, Public safety (incl. technological and natural hazards), Protection of critical infrastructures, Waste management. On the other hand, the Big Data technologies revolutionize the business approach in the geospatial market and foster the emergence of innovative business/organizational models; indeed, to achieve the cost effectiveness of the services to the customers, it is necessary to organize the offer to the market on a territorial/local basis, as the users share the same geospatial sources of data and are best served by local players (service providers). This can be illustrated by a network of European services providers, developing proximity relationships with their customers and sharing their knowledge through the network.

2.2 Data types and formats The DataBio specific data types, formats and sources are listed in detail in Appendix A; below

are described key features of the data used in the project.

2.2.1 Structured data

Structured data refers to any data that resides in a fixed field within a record or file. This

includes data contained in relational databases, spreadsheets, and data in forms of events

such as sensor data. Structured data first depends on creating a data model – a model of the

types of business data that will be recorded and how they will be stored, processed and

accessed. This includes defining what fields of data will be stored and how that data will be

stored: data type (numeric, currency, alphabetic, name, date, address) and any restrictions

on the data input (number of characters; restricted to certain terms such as Mr., Ms. or Dr.;

M or F).

2.2.2 Semi-structured data

Semi-structured data is a cross between structured and unstructured data. It is a type of

structured data, but lacks the strict data model structure. With semi-structured data, tags or

other types of markers are used to identify certain elements within the data, but the data

doesn't have a rigid structure. For example, word processing software now can include

metadata showing the author's name and the date created, with the bulk of the document

just being unstructured text. Emails have the sender, recipient, date, time and other fixed

fields added to the unstructured data of the email message content and any attachments.

Photos or other graphics can be tagged with keywords such as the creator, date, location and

keywords, making it possible to organize and locate graphics. XML and other markup

languages are often used to manage semi-structured data. Semi-structured data is therefore



a form of structured data that does not conform with the formal structure of data models

associated with relational databases or other forms of data tables, but nonetheless contains

tags or other markers to separate semantic elements and enforce hierarchies of records and

fields within the data. Therefore, it is also known as self-describing structure. In semi-

structured data, the entities belonging to the same class may have different attributes even

though they are grouped together, and the attributes' order is not important. Semi-structured

data are increasingly occurring since the advent of the Internet where full-text documents

and databases are not the only forms of data anymore, and different applications need a

medium for exchanging information. In object-oriented databases, one often finds semi-

structured data.

XML and other markup languages, email, and EDI are all forms of semi-structured data. OEM

(Object Exchange Model) was created prior to XML as a means of self-describing a data

structure. XML has been popularized by web services that are developed utilizing SOAP

principles. Some types of data described here as "semi-structured", especially XML, suffer

from the impression that they are incapable of structural rigor at the same functional level as

Relational Tables and Rows. Indeed, the view of XML as inherently semi-structured

(previously, it was referred to as "unstructured") has handicapped its use for a widening range

of data-centric applications. Even documents, normally thought of as the epitome of semi-

structure, can be designed with virtually the same rigor as database schema, enforced by the

XML schema and processed by both commercial and custom software programs without

reducing their usability by human readers.

In view of this fact, XML might be referred to as having "flexible structure" capable of human-

centric flow and hierarchy as well as highly rigorous element structure and data typing. The

concept of XML as "human-readable", however, can only be taken so far. Some

implementations/dialects of XML, such as the XML representation of the contents of a

Microsoft Word document, as implemented in Office 2007 and later versions, utilize dozens

or even hundreds of different kinds of tags that reflect a particular problem domain - in

Word's case, formatting at the character and paragraph and document level, definitions of

styles, inclusion of citations, etc. - which are nested within each other in complex ways.

Understanding even a portion of such an XML document by reading it, let alone catching

errors in its structure, is impossible without a very deep prior understanding of the specific

XML implementation, along with assistance by software that understands the XML schema

that has been employed. Such text is not "human-understandable" any more than a book

written in Swahili (which uses the Latin alphabet) would be to an American or Western

European who does not know a word of that language: the tags are symbols that are

meaningless to a person unfamiliar with the domain.

JSON or JavaScript Object Notation, is an open standard format that uses human-readable

text to transmit data objects consisting of attribute–value pairs. It is used primarily to transmit

data between a server and web application, as an alternative to XML. JSON has been



popularized by web services developed utilizing REST principles. There is a new breed of

databases such as MongoDB and Couchbase that store data natively in JSON format,

leveraging the pros of semi-structured data architecture.

2.2.3 Unstructured data

Unstructured data (or unstructured information) refers to information that either does not

have a pre-defined data model or is not organized in a pre-defined manner. This results in

irregularities and ambiguities that make it difficult to understand using traditional programs

as compared to data stored in “field” form in databases or annotated (semantically tagged)

in documents. Unstructured data can't be so readily classified and fit into a neat box: photos

and graphic images, videos, streaming instrument data, webpages, PDF files, PowerPoint

presentations, emails, blog entries, wikis and word processing documents.

In 1998, Merrill Lynch cited a rule of thumb that somewhere around 80-90% of all potentially

usable business information may originate in unstructured form. This rule of thumb is not

based on primary or any quantitative research, but nonetheless is accepted by some. IDC and

EMC project that data will grow to 40 zettabytes by 2020, resulting in a 50-fold growth from

the beginning of 2010. Computer World states that unstructured information might account

for more than 70%–80% of all data in organizations.

Software that creates machine-processable structure can utilize the linguistic, auditory, and

visual structure that exist in all forms of human communication. Algorithms can infer this

inherent structure from text, for instance, by examining word morphology, sentence syntax,

and other small- and large-scale patterns. Unstructured information can then be enriched and

tagged to address ambiguities and relevancy-based techniques then used to facilitate search

and discovery. Examples of "unstructured data" may include books, journals, documents,

metadata, health records, audio, video, analog data, images, files, and unstructured text such

as the body of an e-mail message, Web page, or word-processor document. While the main

content being conveyed does not have a defined structure, it generally comes packaged in

objects (e.g. in files or documents, …) that themselves have structure and are thus a mix of

structured and unstructured data, but collectively this is still referred to as "unstructured

data".

2.2.4 New generation big data

The new generation big data is in particular focusing on semi-structured and unstructured

data, often in combination with structured data.

In the BDVA reference model for big data technologies a distinction is done between 6

different big data types.



2.2.4.1 Sensor data

Within the Databio pilots, several key parameters will be monitored through sensorial

platforms and sensor data will be collected along the way to support the project activities.

Two types of sensor data have been already identified and namely, a) IoT data from in-situ

sensors and telemetric stations, b) imagery data from unmanned aerial sensing platforms

(drones), c) imagery from hand-held or mounted optical sensors.

2.2.4.1.1 Internet of Things data

The IoT data are a major subgroup of sensor data involved in multiple pilot activities in the

Databio project. IoT data are sent via TCP/UDP protocol in various formats (e.g. txt with time

series data, json strings) and can be further divided into the following categories:

• Agro-climatic/Field telemetry stations which contribute with raw data (numerical values) related to several parameters. As different pilots focus on different application scenarios, the following table summarizes several IoT-based monitoring approaches to be followed.

Table 2: Sensor data tools, resolution and spatial density

Pilot Mission, instrument Data resolution and spatial

density

A1.1,

B1.2,

C1.1,

C2.2

NP’s GAIAtrons, which are telemetry IoT stations

with modular/expandable design will be used to

monitor ambient temperature, humidity, solar

radiation, leaf wetness, rainfall volume, wind

speed and direction, barometric pressure

(GAIAtron atmo), soil temperature and humidity

(multi-depth) (GAIAtron soil)

Time step for data collection

every 10 minutes. One station

per microclimate zone (300ha -

1100 ha for atmo, 300ha -

3300ha for soil)



A1.2,

B1.3

Field bound sensors will be used to monitor air

temperature, air moisture, solar radiation, leaf

wetness, rainfall, wind speed and direction, soil

moisture, soil temperature, soil EC/salinity, PAR,

and barometric pressure. These sensors consist

in technology platform of retriever and pups

wireless sensor network and SpecConnect, a

cloud based crop data management solution.

Time step for data collection is

customizable from 1 to 60

minutes; Field sensors will be

used to monitor 5 tandemly

located sites at a density: a) Air

temperature, air moisture,

rainfall, wind data and solar

radiation: one bloc of sensors

per 5 ha

b) Leaf wetness: two sensors

per ha

c) Soil moisture, soil

temperature and soil

EC/salinity: one combined

sensor per ha

A2.1 Environmental indoor: air temperature, air

relative humidity, solar radiation, crop leaf

temperature (remotely and in contact),

soil/substrate water content. Environmental

outdoor: wind speed and direction, evaporation,

rain, UVA, UVB

To be determined

B1.1 Agro-climatic IoT stations monitoring

temperature, relative and absolute humidity,

wind parameters

To be determined

• Control data in the parcels/fields measuring sprinklers, drippers, metering devices, valves, alarm settings, heating, pumping state, pressure switches, etc.

• Contact sensing data that determine problems with great precision, speeding up the use of techniques which help to solve problems

• Vessel and buoy-based stations which contribute with raw data (numerical values), typically hydro acoustic and machinery data

2.2.4.1.2 Drone data

A specific subset of sensor data generated and processed within DataBio project is images

produced by cameras on-board drones or RPAS (Remotely Piloted Aircraft Systems). In

particular, some DataBio pilots will use optical (RGB), thermal or multispectral images and 3D



point-clouds acquired from RPAS. The information generated by drone-airborne cameras is

usually Image Data (JPEG or JPEG2000). A general description of the workflow is provided

below.

Data acquired by the RGB sensor

The RGB sensor acquires individual pictures in .JPG format, together with their ‘geotag’ files,

which are downloaded from the RPAS and processed into:

• .LAS files: 3D point clouds (x, y, z), which are then processed to produce Digital Models (Terrain- DTM, Surface-DSM, Elevation-DEM, Vegetation-DVM)

• .TIF files: which are then processed into an orthorectified mosaic. In order to obtain smaller files, mosaics are usually exported to compressed .ECW format.

Data acquired by the thermal sensor

The Thermal sensor acquires a video file which is downloaded from the RPAS and:

• split into frames in .TIF format (pixels contain Digital Numbers: 0-255)

• 1 of every 10 frames is selected (with an overlap of about 80%, so as not to process an excessive amount of information)

Data acquired by the multispectral sensor

The multispectral sensor acquires individual pictures from the 6 spectral channels in .RAW

format, which are downloaded from the RPAS and processed into:

• .TIF files (16 bits), which are then processed to produce a 6-bands .TIF mosaic (pixels contain Digital Numbers: 0-255)

2.2.4.1.3 Data from hand-held or mounted optical sensors

Images from hand-held or mounted cameras will be collected using truck-held or hand held

full Range / high resolution UV-VIS-NIR-SWIR Spectroradiometer.

2.2.4.2 Machine-generated data

Machine-generated data in the DataBio project are data produced by ships, boats and

machinery used in agriculture and in forestry (such as tractors). These data will serve for

further analysis and optimisation of processes in the bio-economy sector.

For illustration purposes, examples of data collected by tractors in agriculture are described.

Tractors are equipped by the following units:

• Control units for data control, data collection and analyses including dashboards, transmission control unit, hydrostatic or hydrodynamic system control unit, engine control unit.

• Global Positioning System (GPS) units or Global System for Mobile Communications (GSM) units for tractor tracking.



• Unit for displaying characteristics of field/soil characteristics including area, quality, boundaries and yields.

These units generate the following data:

• Identification of tractor + identification of driver by code or by RFID module.

• Identification of the current operation status.

• Time identification by the date and the current time.

• Precise tractor location tracking (daily route, starts, stops, speed).

• Tractor hours - monitoring working hours in time and place.

• Information from tachometer [Σ km] and [Σ working hrs and min].

• Identification of the current maintenance status.

• Tractor diagnostic: failure modes or failure codes

• Information about the date of the last calibration of each tractor systems + information about setting, information about SW version, last update, etc.

• The amount of fuel in the fuel tank [L].

• Online information about sudden loss of fuel in the fuel tank.

• Fuel consumption per trip / per time period / per kilometer (monitoring of fuel consumption in various dependencies e.g. motor load).

• Total fuel consumption per day [L/day].

• Engine speed [run/min].

• Possibility to online setup engine speed in range [run/min from - to], signaling when limits are exceeding.

• Current position of accelerator pedal [% from scale 0-100 %].

• Charging level of the main battery [V].

• Current temperature of the cooling weather [C ͦor F ͦ ].

• Current temperature of the motor oil [C ͦ or F ͦ ].

• Current temperature of after treatment [C ͦor F ͦ].

• Current temperature of the transmission oil [C ͦor F ͦ].

• Diagnosis gear shift [grades backward and forward].

• Current engine load [% from scale 0-100 %]

2.2.4.3 Geospatial data

The DataBio pilots will collect earth observation (EO) data from a number of sources which

will be refined during the project. Currently, it is confirmed that the following EO data will be

collected and used as input data:



Table 3: Geospatial data tools, format and origin

Mission,

instrument

Format Origin

Sentinel-1, C-SAR SLC, GRD Copernicus Open Access Hub

(https://scihub.copernicus.eu/)

Sentinel-2, MSI L1C Copernicus Open Access Hub

(https://scihub.copernicus.eu/)

Information about the expected sizes will be added, when the information becomes available.

In addition to EO data, DataBio will utilise other geospatial data from EU, national, local,

private and open repositories including Land Parcel Identification System data, cadastral data,

Open Land Use map (http://sdi4apps.eu/open_land_use/), Urban Atlas and Corine Land

Cover, Proba-V data (www.vito-eodata.be).

The meteo-data will be collected mainly from EO systems based and will be collected from

European data sources such as COPERNICUS products, EUMETSAT H-SAF products, but also

other EO data sources such as VIIRS and MODIS and ASTER will be considered. As

complementary data sources, the weather forecast models output (ECMWF) and the regional

weather services output usually based on ground weather stations can be considered

according to the specific target areas of the pilots."

2.2.4.4 Genomics data

Within the DataBio Pilot 1.1.2 different data will be collected and produced. Three categories

of data have been already identified for the Pilot and namely, a) in-situ sensors (including

image capture) and farm data, b) genomic data from plant breeding efforts in Green Houses

produced using Next Generation Sequencers (NGS), c) biochemical data of tomato fruits

produced by chromatographs (LC/MS/MS, GS/MS, HPLC).

In-situ sensors/Environmental outdoor: Wind speed and direction, Evaporation, Rain, Light

intensity, UVA, UVB.

In-situ sensors/Environmental indoor: Air temperature, Air relative humidity, Crop leaf

temperature (remotely and in contact), Soil/substrate water content, crop type, etc.).

Farm Data:

• In-Situ measurements: Soil nutritional status.

• Farm logs (work calendar, technical practices at farm level, irrigation information,).

http://sdi4apps.eu/open_land_use/

http://www.vito-eodata.be/



• Farm profile (Static farm information, such as size

Table 4: Genomic, biochemical and metabolomic data tools, description and acquisition

Pilot A1.1.2 Mission, Instrument Data description and acquisition

Genomic

data

To characterize the genetic

diversity of local tomato

varieties used for breeding. To

use the genetic- genomic

information to guide the

breeding efforts (as a selection

tool for higher performance)

and develop a model to predict

the final breeding result in

order to achieve rapidly and

with less financial burden

varieties of higher performance.

Data will be produced using two

Illumina NGS Macchines.

Data produced from Illumina machines

stored in compressed text files (fastq).

Data will be produced from plant

biological samples (leaf and fruit).

Collection will be done in 2 different

plant stages (plantlets and mature

plants). Genomic data will be produced

using standard and customized

protocols at CERTH. Genomic data,

although plait text in format, are big-

volume data and pose challenges in

their storage, handling and processing.

Preliminary analysis will be performed

using the local HPC computational

facility.

Biochemical,

metabolomic

data

To characterize the biochemical

profile of fruits from tomato

varieties used for breeding.

Data will be produced from

different chromatographs and

mass spectrometers

Data will be mainly proprietary binary

based archives converted to XML or

other open formats. Data will be

acquired from biological samples of

tomato fruits.

While genomic data are stored in raw format as files, environmental data, which are

generated using a network of sensors, will be stored in a database along with the time

information and will be processed as time series data.

2.3 Historical data In the context of doing machine learning and predictive and prescriptive analytics it is

important to be able to use historical data for training and validation purposes. Machine

learning algorithms will use existing historical data as training data both for supervised and

unsupervised learning. Information about datasets and the time periods concerned with

historical datasets to be used for DataBio can be found in Appendix A. Historical data can also



serve as training complex event processing applications. In this case, historical data is injected

as “happening in real-time” therefore serving as testing the complex event driven application

in hand before running it in real-environment.

2.4 Expected data size and velocity The big data “V” characteristics of Volume and Velocity is being described for each of the

identified data sets in the DataBio projects - typically with measurements of total historical

volumes and new/additional data per time unit. The DataBio-specific Data Volumes and

velocities (or injection rates) can be found in Appendix A.

2.5 Data beneficiaries In this section, this document analyses the key data beneficiaries who will benefit from the

use of big data in several fields as analytics, data sets, business value, sales or marketing. This

section will consider both tangibles and intangibles concepts.

In examining the value of big data, it is necessary to evaluate who is affected by them and

their usage. In some cases, the individual whose data is processed directly receives a benefit.

Nevertheless, regarding Data Driven Bio-Economy, the benefit to the individual can be

considered as indirect. In other cases, the relevant individual receives no benefit attributable,

with big data value reaped by business, government, or society at large.

Concerning General Community, the collection and use of an individual’s data benefits not

only that individual, but also members of a proximate class, such as users of a similar product

or residents of a geographical area. In the case of organizations, Big Data analysis often

benefits those organizations that collect and harness the data. Data-driven profits may be

viewed as enhancing allocative efficiency by facilitating the free economy. The emergence,

expansion, and widespread use of innovative products and services at decreasing marginal

costs have revolutionized global economies and societal structures, facilitating access to

technology and knowledge and fomenting social change. With more data, businesses can

optimize distribution methods, efficiently allocate credit, and robustly combat fraud,

benefitting consumers as a whole.

On the other hand, big data analysis can provide a direct benefit to those individuals whose

information is being used. However, DataBio project is not directly involved on those specific

cases (see chapter6 about ethical issues).

Regarding general benefits, big data is creating enormous value for the global economy,

driving innovation, productivity, efficiency, and growth. Data has become the driving force

behind almost every interaction between individuals, businesses, and governments. The uses

of big data can be transformative and are sometimes difficult to anticipate at the time of initial

collection.



This section does not provide a comprehensive taxonomy of big data benefits. It would be

pretentious to do so, ranking the relative importance of weighty social goals. Rather it posits

that such benefits must be accounted for by rigorous analysis considering the priorities of a

nation, society, or economy. Only then, can benefits be assessed within an economic

framework.

Besides those general concepts on Big Data Beneficiaries, it is possible to analyse the impact

of DataBio project results regarding the final users of the different technologies, tools and

services to be developed. Using this approach, and taking into account that more detailed

information is available at Deliverables D1.1, D2.1 and D3.1 regarding Agricultural, Forestry

and Fishery pilots definition, the main beneficiaries of big data are described in the following

sections.

2.5.1 Agricultural Sector

One of the proposed agricultural pilots is about the use of tractor units able to online send

information regarding current operations to the driver or farmer. The prototypes will be

equipped with units for tracking and tracing (GPS - Global Positioning System or GSM - Global

System for Mobile Communications) and the unit for displaying characteristics of soil units.

The proposed solution will meet Farmers requests on cost reduction and improved

productivity in order to increase their economic benefits following, also, sustainable

agriculture practices.

In other case, Smart farming services provided as irrigation through flexible mechanisms and

UIs (web, mobile, tablet compatible) will promote the adoption of technological tools (IoT,

data analytics) and collaboration with certified professionals to optimize farm productivity.

Therefore, Farming Cooperatives will obtain, again, cost reduction and improved productivity

migrating from standard to sustainable smart-agriculture practices. As a summary, main

beneficiaries of DataBio will be Farming cooperatives, farmers and land owners.

2.5.2 Forestry Sector

Data sharing and a collaborative environment enable improved tools for sustainable forest

management decisions and operations. Forest management services make data accessible for

forest owners, and other end users, and integrate this data for e-contracting, online purchase

and sales of timber and biomass. Higher data volumes and better data accessibility increase

the probability that the data will be updated and maintained.

DataBio WP2 will develop and pilot standardized procedures for collecting and transferring

Big Data based on DataBio WP4 platform from silvicultural activities executed in the forest.

As a summary, the Big Data beneficiaries related to WP2 – Forestry Pilots activities will be:

• Forest owners (private, public, timberland investors)

• Forest authority experts

• Forest companies



• Contractors and service providers

2.5.3 Fishery Sector

Regarding WP3 – Fisheries Pilot, in Pilot A2: Small pelagic fisheries immediate operational

choices, the main users and beneficiaries of this pilot will be the ship owners and masters on

board small pelagic vessels. Modern pelagic vessels are equipped with increasingly complex

machinery systems for propulsion, manoeuvring and power generation. Due to that, the

vessel is always in an operational state, but the configuration of the vessel systems imposes

constraints on operation. The captain is tasked with safe operation of the vessel, while the

efficiency of the vessel systems may be increased if the captain is informed about the actual

operational state, potential for improvement and expected results of available actions.

The goal of the pilot B2: Oceanic tuna fisheries planning is to create tools that aid in trip

planning by presenting historical catch data as well as attempting to forecast where the fish

might be in the near future. The forecast model will be constructed from historical data of

catches with the data available by the skippers at that moment (oceanographical data, buoys

data etc). In that case, the main beneficiary of DataBio development will be tuna fisheries

companies. Therefore, as a summary, DataBio WP3 beneficiaries will be the broad range of

fisheries stakeholders from companies, captains and vessels owners.

2.5.4 Technical Staff

Adoption rates aside, the potential benefits of utilising big data and related technologies are

significant both in scale and scope and include, for example: better/more targeted marketing

activities, improved business decision making, cost reduction and generation of operational

efficiencies, enhanced planning and strategic decision making and increased business agility,

fraud detection, waste reduction and customer retention to name but a few. Obviously, the

ability of firms to realize business benefits will be dependent on company characteristics such

as size, data dependency and nature of business activity.

A core concern voiced by many of those participating in big data focused studies is the ability

of employers to find and attract the talent needed for both a) the successful implementation

of big data solutions and b) the subsequent realisation of associated business benefits.

Although ‘Data Scientist’ may currently be the most requested profile in big data, the

recruitment of Data Scientists (in volume terms at least) appears relatively low down the wish

list of recruiters. Instead, the openings most commonly arising in the big data field (as is the

case for IT recruitment) are development positions.

2.5.5 ICT sector

2.5.5.1 Developers

The generic title of developer is normally employed together with a detailed description of

the specific technical related skills required for the post and it is this description that defines



the specific type of development activity undertaken. The technical skills most often cited by

recruiters in adverts for big data Developers are: NoSQL (MongoDB in particular), Java, SQL,

JavaScript, MySQL, Linux, Oracle, Hadoop (especially Cassandra), HTML and Spring.

2.5.5.2 Architects

More specifically, however, applicants for these positions are required to hold skills in a range

of technical disciplines including: Oracle (in particular, BI EE), Java, SQL, Hadoop and SQL

Server, whilst the main generic areas of technical Knowledge and competence required were:

Data Modelling, ETL, and Enterprise Architecture, Open Source and Analytics.

2.5.5.3 Analysts

Particular process/methodological skills required from applicants for analyst positions were

primarily in respect of: Data Modelling, ETL, Analytics and Data.

2.5.5.4 Administrators

In general, the technical skills most often requested by employers from big data

Administrators at that time were: Linux, MySQL and Puppet, Hadoop and Oracle, whilst the

process and methodological competences most often requested were in the areas of

Configuration Management, Disaster Recovery, Clustering and ETL.

2.5.5.5 Project Managers

The specific types of Project Manager most often required by big data recruiters are Oracle

Project Managers, Technical Project Managers and Business Intelligence Project Managers.

Aside from Oracle (and in particular BI EE, EBS and EBS R12), which was specified in over two-

thirds of all adverts for big data related Project Management posts, other technical skills often

needed by applicants for this type of position were: Netezza, Business Objects and Hyperion.

Process and methodological skills commonly required included ETL and Agile Software

Development together with a range of more ‘business focused’ skills, i.e. PRINCE2 and

Stakeholder Management.

2.5.5.6 Data Designers

The most commonly requested technical skills associated with these posts to have been

Oracle (particularly BIEE) and SQL followed by Netezza, SQL Server, MySQL and UNIX.

Common process and methodological skills needed were: ETL, Data Modelling, Analytics, CSS,

Unit Testing, Data Integration and Data Mining, whilst more general knowledge requirements

related to the need for experience and understanding of Business Intelligence, Data

Warehouse, Big Data, Migration and Middleware.

2.5.5.7 Data Scientists

The core technical skills needed to secure a position as a Data Scientist are found to be:

Hadoop, Java, NoSQL and C++. As was the case for other big data positions, adverts for Data

Scientists often made reference to a need for various process and methodological skills and



competences. Interestingly however, in this case, such references were found to be much

more commonplace and (perhaps as would be expected) most often focused upon data

and/or statistical themes, i.e. Statistics, Analytics and Mathematics.

2.5.6 Research and education

Researchers, scientists and academics are one of the largest groups for data reuse. DataBio

data published as open data will be used for further research and for educational purposes

(e.g. thesis).

2.5.7 Policy making bodies

The DataBio data and results will serve as a basis for decision making bodies, especially for

policy evaluation and feedback on policy implementation. This includes mainly the European

Commission, national and regional public authorities.



FAIR Data The FAIR principle ensures that data can be discovered through catalogs or search engines, is

accessible through open interfaces, is compliant to standards to interoperable processing of

that data, and therefore can be easily being reused.

3.1 Data findability

3.1.1 Data discoverability and metadata provision

Metadata is, as its name implies, data about data. It describes the properties of a dataset.

Metadata can cover various types of information. Descriptive metadata includes elements

such as the title, abstract, author and keywords, and is mostly used to discover and identify a

dataset. Another type is administrative metadata with elements such as the license,

intellectual property rights, when and how the dataset was created, who has access to it, etc.

The datasets on the DataBio Infrastructure are either added locally, by a user, harvested from

existing data portals, or fetched from operational systems or IoT ecosystems. In DataBio, the

definition of a set of metadata elements is necessary in order to allow identification of the

vast amount information resources managed for which metadata is created, its classification

and identification of its geographic location and temporal reference, quality and validity,

conformity with implementing rules on the interoperability of spatial data sets and services,

constraints related to access and use, and organization responsible for the resource.

In addition, metadata elements related to the metadata record itself are also necessary to

monitor that the metadata created are kept up to date, and for identifying the organization

responsible for the creation and maintenance of the metadata. Such minimum set of

metadata elements is also necessary to comply with Directive 2007/2/EC and does not

preclude the possibility for organizations to document the information resources more

extensively with additional elements derived from international standards or working

practices in their community of interest.

Metadata referred to datasets and dataset series (particularly relevant for DataBio will be the

EO products derived from satellite imagery) should adhere to the profile originating from the

INSPIRE Metadata regulation with added theme-specific metadata elements for the

agriculture, forestry and fishery domains if necessary. This approach will ensure that

metadata created for the datasets, dataset series and services will be compliant with the

INSPIRE requirements as well international standards ISO EN 19115 (Geographic Information

– Metadata; with special emphasis in ISO 19115-2:2009 Geographic information -- Metadata

-- Part 2: Extensions for imagery and gridded data), ISO EN 19119 (Geographic Information –

Services), ISO EN 19139 (Geographic Information – Metadata – Metadata XML Schema) and

ISO EN ISO 19156 (Earth Observation Metadata profile of Observations & Measurements).

Besides, INSPIRE conformant metadata may be expressed also through the DCAT Application



Profile1, which defines a minimum set of metadata elements to ensure cross-domain and

cross-border interoperability between metadata schemas used in European data portals. If

adopted by DataBio, such a mapping could support the inclusion of INSPIRE metadata in the

Pan-European Open Data Portal for wider discovery across sectors beyond the geospatial

domain.

A Distribution represents a way in which the data is made available. DCAT is a rather small

vocabulary, but deliberately leaves many details open. It welcomes “application profiles”:

more specific specifications built on top of DCAT resp GeoDCAT – AP as geospatial extension.

For sensors we will focused on SensorML. SensorML can be used to describe a wide range of

sensors, including both dynamic and stationary platforms and both in-situ and remote

sensors. Other possibility is Semantic Sensor Net Ontology which describes sensors and

observations, and related concepts. It does not describe domain concepts, time, locations,

etc. these are intended to be included from other ontologies via OWL imports. This ontology

is developed by the W3C Semantic Sensor Networks Incubator Group (SSN-XG).

In DataBio, there is a need for metadata harmonization of the spatial and non-spatial datasets

and services. GeoDCAT-AP was an obvious choice due to the strong focus on geographic

datasets. The main advantage is that it enables users to query all datasets in a uniform way.

GeoDCAT-AP is still very new, and the implementation of the new standard within EUXDAT

can provide feedback to OGC, W3C & JRC from both technical and end user point of view.

Several software components are available in the DataBio architecture that have varying

support for GeoDCAT-AP, being Micka2, CKAN3 and GeoNetwork4. For the DataBio purposes

we will need also integrate Semantic Sensor Net Ontology and SensorML.

For enabling compatibility with COPERNICUS, INSPIRE and GEOSS, the DataBio project will

make three extensions: i) Module for extended harvesting INSPIRE metadata to DCAT, based

on XSLT and easy configuration; ii)Module for user friendly visualisation of INSPIRE metadata

in CKAN; and iii)Module to output metadata in GeoDCAT-AP resp SensorDCAT. We plan use

Micka and CKAN systems. MICKA is a complex system for metadata management used for

building Spatial Data Infrastructure (SDI) and geo portal solutions. It contains tools for editing

and the management of spatial data and services metadata, and other sources (documents,

websites, etc.). CKAN supports DCAT to import or export its datasets. CKAN enables

harvesting data from OGC:CSW catalogues, but not all mandatory INSPIRE metadata elements

are supported. Unfortunately, the DCAT output does not fulfil all INSPIRE requirements, nor

is GeoDCAT-AP fully supported.

1 https://joinup.ec.europa.eu/asset/dcat_application_profile/description 2 http://micka.bnhelp.cz/ 3 https://ckan.org/ 4 http://geonetwork-opensource.org/

https://joinup.ec.europa.eu/asset/dcat_application_profile/description

http://micka.bnhelp.cz/

https://ckan.org/

http://geonetwork-opensource.org/



An ongoing programme of spatial data infrastructure projects, undertaken with academic and

commercial partners, enables DataBio to contribute to the creation of standard data

specifications and policies. This ensures their databases remain of high quality, compatible

and can interact with one another to deliver data which provides practical and tangible

benefits for European society. The network’s mission is to provide and disseminate statistical

information which has to be objective, independent and of high quality. Federal statistics are

available to everybody: politicians, authorities, businesses and citizens.

3.1.2 Data identification, naming mechanisms and search keyword approaches

For data identification, naming and search keywords we will use INSPIRE data registry. The

INSPIRE infrastructure involves a number of items, which require clear descriptions and the

possibility to be referenced through unique identifiers. Examples for such items include

INSPIRE themes, code lists, application schemas or discovery services. Registers provide a

means to assign identifiers to items and their labels, definitions and descriptions (in different

languages). The INSPIRE Registry is a service giving access to INSPIRE semantic assets (e.g.

application schemas, meta/data codelists, themes), and assigning to each of them a persistent

URI. As such, this service can be considered also as a metadata directory/catalogue for

INSPIRE, as well as a registry for the INSPIRE "terminology". Starting from June 2013, when

the INSPIRE Registry was first published, a number of version have been released,

implementing new features based on the community's feedback. Now, recently, a new

version of the INSPIRE Registry has been published, which, among other features, makes

available its content also in RDF/XML:

http://inspire.ec.europa.eu/registry/5

The INSPIRE registry provides a central access point to a number of centrally managed INSPIRE

registers6. INSPIRE registry include:

● INSPIRE application schema register

● INSPIRE code list register

● INSPIRE enumeration register

● INSPIRE feature concept dictionary

● INSPIRE glossary

● INSPIRE layer register

● INSPIRE media-types register

● INSPIRE metadata code list register

● INSPIRE reference document register

● INSPIRE theme register

5 https://www.rd-alliance.org/group/metadata-ig/post/inspire-registry-rdf-representation-now-

supported.html 6 http://inspire.ec.europa.eu/registry/

http://inspire.ec.europa.eu/registry/

http://inspire.ec.europa.eu/applicationschema

http://inspire.ec.europa.eu/codelist

http://inspire.ec.europa.eu/enumeration

http://inspire.ec.europa.eu/featureconcept

http://inspire.ec.europa.eu/glossary

http://inspire.ec.europa.eu/layer

http://inspire.ec.europa.eu/media-types

http://inspire.ec.europa.eu/metadata-codelist

http://inspire.ec.europa.eu/document

http://inspire.ec.europa.eu/theme

https://www.rd-alliance.org/group/metadata-ig/post/inspire-registry-rdf-representation-now-supported.html

https://www.rd-alliance.org/group/metadata-ig/post/inspire-registry-rdf-representation-now-supported.html

http://inspire.ec.europa.eu/registry/



Most relevant for naming in metadata is INSPIRE metadata code list register, which contains

the code lists and their values, as defined in the INSPIRE implementing rules on metadata.7

3.1.3 Data lineage

Data lineage refers to the sources of information, such as entities and processes, involved in

producing or delivering an artifact. Data lineage records the derivation history of a data

product. The history could include the algorithms used, the process steps taken, the

computing environment run, data sources input to the processes, the organization/person

responsible for the product, etc. Provenance provides important information to data users

for them to determine the usability and reliability of the product. In the science domain, the

data provenance is especially important since scientists need to use the information to

determine the scientific validity of a data product and to decide if such a product can be used

as the basis for further scientific analysis. The provenance of information is crucial to making

determinations about whether information is trusted, how to integrate diverse information

sources, and how to give credit to originators when reusing information [REF-02]. In an open

and inclusive environment such as the Web, users find information that is often contradictory

or questionable. Reasoners in the Semantic Web will need explicit representations of

provenance information in order to make trust judgments about the information they use.

With the arrival of massive amounts of Semantic Web data (eg, via the Linked Open Data

community) information about the origin of that data, ie, provenance, becomes an important

factor in developing new Semantic Web applications. Therefore, a crucial enabler of the

Semantic Web deployment is the explicit representation of provenance information that is

accessible to machines, not just to humans. Data provenance as the information about how

data was derived. Both are critical to the ability to interpret a particular data item.

Provenance is often conflated with metadata and trust. Metadata is used to represent

properties of objects. Many of those properties have to do with provenance, so the two are

often equated. Trust is derived from provenance information, and typically is a subjective

judgment that depends on context and use [REF-03].

W3C PROV Family of Documents defines a model, corresponding serializations and other

supporting definitions to enable the interoperable interchange of provenance information in

heterogeneous environments such as the Web [REF-04]. Current standards include [REF-05]:

PROV-DM: The PROV Data Model [REF-06] - PROV-DM is a core data model for provenance

for building representations of the entities, people and processes involved in producing a

piece of data or thing in the world. PROV-DM is domain-agnostic, but with well-defined

extensibility points allowing further domain-specific and application-specific extensions to be

defined. It is accompanied by PROV-ASN, a technology-independent abstract syntax notation,

which allows serializations of PROV-DM instances to be created for human consumption,

7 http://inspire.ec.europa.eu/metadata-codelist

http://inspire.ec.europa.eu/metadata-codelist



which facilitates its mapping to concrete syntax, and which is used as the basis for a formal

semantics.

PROV-O: The PROV Ontology [REF-07] - This specification defines the PROV Ontology as the

normative representation of the PROV Data Model using the Web Ontology Language

(OWL2). This document is part of a set of specifications being created to address the issue of

provenance interchange in Web applications.

Constraints of the PROV Data Model [REF-08] - PROV-DM, the PROV data model, is a data

model for provenance that describes the entities, people and activities involved in producing

a piece of data or thing. PROV-DM is structured in six components, dealing with: (1) entities

and activities, and the time at which they were created, used, or ended; (2) agents bearing

responsibility for entities that were generated and activities that happened; (3) derivations of

entities from entities; (4) properties to link entities that refer to a same thing; (5) collections

forming a logical structure for its members; (6) a simple annotation mechanism.

PROV-N: The Provenance Notation [REF-09] - PROV-DM, the PROV data model, is a data

model for provenance that describes the entities, people and activities involved in producing

a piece of data or thing. PROV-DM is structured in six components, dealing with: (1) entities

and activities, and the time at which they were created, used, or ended; (2) agents bearing

responsibility for entities that were generated and activities that happened; (3) derivations of

entities from entities; (4) properties to link entities that refer to the same thing; (5) collections

forming a logical structure for its members; (6) a simple annotation mechanism.

Figure 2 [REF-10] is a generic data lifecycle in the context of a data processing environment

where data are first discovered by the user with the help of metadata and provenance

catalogues.



Figure 2: The processing data lifecycle

During the data processing phase, data replica information may be entered in replica

catalogues (which contain metadata about the data location), data may be transferred

between storage and execution sites, and software components may be staged to the

execution sites as well. While data are being processed, provenance information can be

automatically captured and then stored in a provenance store. The resulting derived data

products (both intermediate and final) can also be stored in an archive, with metadata about

them stored in a metadata catalogue and location information stored in a replica catalogue.

Data Provenance is also addressed in W3C DCAT Metadata model [REF-11].

dcat:CatalogRecord describes a dataset entry in the catalog. It is used to capture provenance

information about dataset entries in a catalog. This class is optional and not all catalogs will

use it. It exists for catalogs where a distinction is made between metadata about a dataset

and metadata about the dataset's entry in the catalog. For example, the publication date

property of the dataset reflects the date when the information was originally made available

by the publishing agency, while the publication date of the catalog record is the date when

the dataset was added to the catalog. In cases where both dates differ, or where only the

latter is known, the publication date should only be specified for the catalog record. W3C

PROV Ontology [prov-o] allows describing further provenance information such as the details

of the process and the agent involved in a particular change to a dataset. Detailed

specification of data provenance is also additional requirements for DCAT – AP specification

effort [REF-12].



3.2 Data accessibility Through DataBio experiments with a large number of tools and technologies identified in WP4

and WP5, a common data access pattern shall be developed. Ideally, this pattern is based on

internationally adopted standards, such as OGC WFS for feature data, OGC WCS for coverage

data, OGC WMS for maps, or OGC SOS for sensor data.

3.2.1 Open data and closed data

Everyone from citizens to civil servants, researchers and entrepreneurs can benefit from open

data. In this respect, the aim is to make effective use of Open Data. This data is already

available in public domains and is not within the control of the DataBio project.

All data rests on a scale between closed and open because there are variances in how

information is shared between the two points in the continuum. Closed data might be shared

with specific individuals within a corporate setting. Open data may require attribution to the

contributing source, but still be completely available to the end user.

Generally, open data differs from closed data in three key ways8:

1. Open data is accessible, usually via a data warehouse on the internet.

2. It is available in a readable format.

3. It’s licensed as open source, which allows anyone to use the data or share it for non-commercial or commercial gain.

Closed data restricts access to the information in several potential ways:

1. It is only available to certain individuals within an organization.

2. The data is patented or proprietary.

3. The data is semi-restricted to certain groups.

4. Data that is open to the public through a licensure fee or other prerequisite.

5. Data that is difficult to access, such as paper records that haven’t been digitized.

The perfect example of closed data could be information that requires a security clearance;

health-related information collected by a hospital or insurance carrier; or, on a smaller scale,

your own personal tax returns.

There are also other datasets used for the pilots, like e.g. cartography, 3D or land use data

but those are stored in databases which are not available through the Open Data portals.

Once the use case specification and requirements have been completed these data may also

be needed for the processing and visualisation within the DataBio applications. However, this

data – in its raw format – may not be made available to external stakeholders for further use

due to licensing and/or privacy issues. Therefore, at this stage, the data management plan

will not cover these datasets.

8 www.opendatasoft.com

http://www.opendatasoft.com/



3.2.2 Data access mechanisms, software and tools

Data access is the process of entering a database to store or retrieve data. Data Access Tools

are end user oriented tools that allow users to build structured query language (SQL) queries

by pointing and clicking on the list of table and fields in the data warehouse.

Thorough computing history, there have been different methods and languages already that

were used for data access and these varied depending on the type of data warehouse. The

data warehouse contains a rich repository of data pertaining to organizational business rules,

policies, events and histories and these warehouses store data in different and incompatible

formats so several data access tools have been developed to overcome problems of data

incompatibilities.

Recent advancement in information technology has brought about new and innovative

software applications that have more standardized languages, format, and methods to serve

as interface among different data formats. Some of these more popular standards include

SQL, OBDC, ADO.NET, JDBC, XML, XPath, XQuery and Web Services.

3.2.3 Big data warehouse architectures and database management systems

Depending on the project needs, there are different possibilities to store data:

3.2.3.1 Relational Database

This is a digital database whose organization is based on the relational model of data. The

various software systems used to maintain relational databases are known as a relational

database management system (RDBMS). Virtually all relational database systems use SQL

(Structured Query Language) as the language for querying and maintaining the database. A

relational database has the important advantage of being easy to extend. After the original

database creation, a new data category can be added without requiring that all existing

applications be modified.

This model organizes data into one or more tables (or "relations") of columns and rows, with

a unique key identifying each row. Rows are also called records or tuples. Generally, each

table/relation represents one "entity type" (such as customer or product). The rows represent

instances of that type of entity and the columns representing values attributed to that

instance.

The definition of a relational database results in a table of metadata or formal descriptions of

the tables, columns, domains, and constraints.

When creating a relational database, the domain of possible values can be defined in a data

column and further constraints that may apply to that data value can be described. For

example, a domain of possible customers could allow up to ten possible customer names but

be constrained in one table to allowing only three of these customer names to be specifiable.



An example of a relational database management system is the Microsoft SQL Server,

developed by Microsoft. As a database server, it is a software product with the primary

function of storing and retrieving data as requested by other software applications—which

may run either on the same computer or on another computer across a network (including

the Internet). Microsoft makes SQL Server available in multiple editions, with different feature

sets and targeting different users.

PostgreSQL – for specific domains: PostgreSQL, often simply Postgres, is an object-relational

database management system (ORDBMS) with an emphasis on extensibility and standards

compliance. As a database server, its primary functions are to store data securely and return

that data in response to requests from other software applications. It can handle workloads

ranging from small single-machine applications to large Internet-facing applications (or for

data warehousing) with many concurrent users; on macOS Server, PostgreSQL is the default

database. It is also available for Microsoft Windows and Linux.

PostgreSQL is developed by the PostgreSQL Global Development Group, a diverse group of

many companies and individual contributors. It is free and open-source, released under the

terms of the PostgreSQL License, a permissive software license. Furthermore, it is ACID-

compliant and transactional. PostgreSQL has updatable views and materialized views,

triggers, foreign keys; supports functions and stored procedures, and other expandability.

3.2.3.2 Big Data storage solutions

A NoSQL (originally referring to "non-SQL", "non-relational" or "not only SQL") database

provides a mechanism for storage and retrieval of data which is modeled in means other than

the tabular relations used in relational databases. Such databases have existed since the late

1960s, but did not obtain the "NoSQL" moniker until a surge of popularity in the early twenty-

first century, triggered by the needs of Web 2.0 companies such as Facebook, Google, and

Amazon.com. NoSQL databases are increasingly used in big data and real-time web

applications. NoSQL systems are also sometimes called "Not only SQL" to emphasize that they

may support SQL-like query languages.

Motivations for this approach include: simplicity of design, simpler "horizontal" scaling to

clusters of machines (which is a problem for relational databases), and finer control over

availability. The data structures used by NoSQL databases (e.g. key-value, wide column, graph,

or document) are different from those used by default in relational databases, making some

operations faster in NoSQL. The particular suitability of a given NoSQL database depends on

the problem it must solve. Sometimes the data structures used by NoSQL databases are also

viewed as "more flexible" than relational database tables.

MongoDB: MongoDB (from humongous) is a free and open-source cross-platform document-

oriented database program. Classified as a NoSQL database program, MongoDB uses JSON-

like documents with schemas. MongoDB is developed by MongoDB Inc. and is free and open-



source, published under a combination of the GNU Affero General Public License and the

Apache License.

MongoDB supports field, range queries, regular expression searches. Queries can return

specific fields of documents and also include user-defined JavaScript functions. Queries can

also be configured to return a random sample of results of a given size. MongoDB can be used

as a file system with load balancing and data replication features over multiple machines for

storing files. This function, called Grid File System, is included with MongoDB drivers.

MongoDB exposes functions for file manipulation and content to developers. GridFS is used

in plugins for NGINX and lighttpd. GridFS divides a file into parts, or chunks, and stores each

of those chunks as a separate document.

MongoDB based (but not restricted to) is GeoRocket, developed by Fraunhofer IGD. It

provides high-performance data storage and is schema agnostic and format preserving. For

more information please refer to D4.1 which describes the components applied in the DataBio

project.

3.3 Data interoperability Data can be made available in many different formats implementing different information

models. The heterogeneity of these models reduces the level of interoperability that can be

achieved. In principle, the combination of a standardized data access interface, a standardized

transport protocol, and a standardized data model ensure seamless integration of data across

platforms, tools, domains, or communities.

When the amount of data grows, mechanisms have to be explored to ensure interoperability

while handling large volumes of data. Currently, the amount of data can still be handled using

OGC models and data exchange services. We will need to review this element during the

course of the project. For now, data interoperability is envisioned to be ensured through

compliance with internationally adopted standards.

Eventually, interoperability requires different phenotypes when being applied in various

“disciplinary” settings. The following figure illustrates that concept (source: Wyborn 2017).

https://en.wikipedia.org/wiki/JavaScript

https://en.wikipedia.org/wiki/JavaScript



Figure 3: The “disciplinary data integration platform: where do you ssit? (source: Wyborn)

The intra-disciplinary type remains within a single discipline. The level of standardization

needs to cover the discipline needs, but little attention is usually paid to cross-discipline

standards. The multi-disciplinary situation has many people from different domains working

together, but eventually they all remain within their silos and data exchange is limited to the

bare minimum.

The cross-disciplinary setting is what we are experiencing at the beginning of DataBio. All

disciplines are interfacing and reformatting their data to make it fit. The model works as long

as data exchange is minor, but does not scale, as it requires bilateral agreements between

various parties. The interdisciplinary approach is targeted in DataBio. The goal here is to

adhere to a minimum set of standards. Ideally, the specific characteristics are standardized

between all partners upfront. This model adds minimum overhead to all parties, as a single

mapping needs to be implemented per party (or, even better, the new model is used natively

from now on). The transdisciplinary approach starts with data already provided as linked data

with links across the various disciplines, well-defined vocabularies, and a set of mapping rules

to ensure usability of data generated in arbitrary disciplines.

3.3.1 Interoperability mechanisms

Key to interoperable data exchange are standardized interfaces. Currently, the amount of

data processing and exchange tools is extremely large. We expect a consolidation of the

number of tools during the first 15 months of the project. We will revise the requirements set

by the various pilots and the data sets made available regularly to ensure that proper

recommendations can be given at any time.

3.3.2 Inter-discipline interoperability and ontologies

A key element to interoperability within and across disciplines are shared semantics, but the

Semantic Web is still in its infancy and it is not clear to which extent it will become widely

accepted within data intensive communities in the near future. It requires graph-structures

for data and/or metadata, well defined vocabularies and ontologies, and lacks both the



necessary tools to get DataBio data operational within reasonable amounts of time.

Therefore, at this stage it is mainly recommended to observe the topic of vocabularies and

ontologies, but concentrate on initial base-vocabularies and their governance to ensure that

at least base parameters are well defined.

3.4 Promoting data reuse The reuse of data is a key component in FAIR. It ensures that data can be reused for purposes

other than it was initially created for. This reuse improves the cost-balance of the initial data

production and allows cross-fertilization across communities. DataBio will advertise all the

data produced to ensure that they are known to wider audience. In combination with

standardized models and interfaces as described above and complemented with metadata

and a catalog system that allows proper discovery, DataBio can serve as valuable input outside

of the project.

At this stage, it is not clear what licensing models need to be applied for the various data

products produced in DataBio. Generally, the focus shall be on public domain attribution and

open licenses that maximize reusability in other contexts. All data products produced by

DataBio will be reviewed for FAIR principles once a year by the data producing organization.

on the other hand, DataBio is open to any third-party data and process provisioning. Data

quality is a key component for data reuse. Without proper quality parameters, data cannot

be integrated in external processes, as the level of uncertainty of the remote processes

becomes undefined. DataBio will review its data products for quality information provided as

part of the metadata. Currently, ISO quality flags are envisioned to be used.



Data management support 4.1 FAIR data costs The DataBio consortium will handle both the open data and data with restricted access. These

data will be used by the project and the project pilots to demonstrate the power of big data.

These data will be published through the DataBio infrastructure.

The current list of datasets and their details are described in Appendix A. All data are either

open data or data with restricted access provided for free to the consortium partners for

project purposes. DataBio does not foresee to purchase any data.

The consortium has the knowledge and tools to make data FAIR, i.e. findable, accessible,

interoperable and reusable. To make data FAIR is one of the project objectives and

appropriate resources were allocated by each partner to cover costs for data harmonisation,

integration and publication.

The DataBio project has allocated appropriate resources to the sustainability of the project

results. This includes the sustainability of FAIR data that are in the scope of the project.

To satisfy the dataset reusability requirement, DataBio anticipated several strategies for data

storage and preservation. Dataset storage and preservation plan will include but not limited

to disk drives, solid-state drives, in-memory functions and off-premises storage. Insofar as

security concerns are not an issue, DataBio partners will be encouraged to store data in the

publicly available certified data repositories.

4.2 Big data managers Managing Big Data also includes a specific structure or role-system, which means in fact types

of people how manage or use Big Data in a specific way. Following chapter will describe the

team structures for Big Data Management in DataBio.

DataBio will employ a two-layer approach for the management of the data used. On the first

layer, the management of data provided in any of the participating institutions is done locally.

On the second layer, data used in the context of DataBio and needed in the context of data

exchange or integration across organizations will be subject to the methodologies described

within this document. These are enforced by the roles described below.

4.2.1 Project manager

DataBio includes a diverse group of talented professionals, which have to be led. Beside the

complex pilot-driven management structure, Intrasoft can be called the main project

manager.



4.2.2 Business Analysts

Business analysts are business-oriented domain experts, which are comfortable with data

handling. They have deep insights in business requirements and logics and make sure that big

data applications and platforms are capable to them. Business analysts are the connection

between “non-technical” business user and technical developers. This includes techno-

economic analysis as well as advanced visualisation services. DataBio has five Business

analysts from five different organizations: Lesprojekt, ATOS, CIAOTECH, IBM and CREA.

4.2.3 Data Scientists

Data scientists represent the data experts and analysis within the DataBio consortium. They

are able to turn raw data into purified insights and value with data science methods,

techniques and tools. They have strong programming skills and can handle big data as well as

linked data (incl. metadata). Furthermore, they are able to identify datasets for different

requirements and develop solutions with regard to common standards. They are also able to

visualise eloquently the results and findings. Within the DataBio consortium following partner

are data scientists: Lesprojekt, UWB, Fraunhofer IGD, SINTEF, InfAI, INNOVATION

ENGINEERING SRL, OGC, VITO.

4.2.3.1 Data Scientists: Machine Learning Experts

One of the most important parts of DataBio is making sense and value of data in different bio-

economic sectors. In order to do so, methods, techniques and tools of machine learning are

necessary to handle the huge amount of data. The DataBio project has several partner which

are capable machine learning experts with different specialisations. These are: PSNC, InfAI,

INNOVATION ENGINEERING SRL, VTT, IBM, CREA, DTU, CSEM, EXUS, Terrasigna, CERTH

4.2.4 Data Engineer / Architect

Data Engineers or Architects are data professionals who prepare the big data to be ready for

analysis. This includes data discovery, data integration, data processing (and pre-processing)

extraction and exchange as well as the quality control. Furthermore, they focus on design and

architecture. DataBio have thirteen partners who fulfil this important role: UWB, ATOS,

SpaceBel, VITO, IBM, InfAI, MHG, CREA, e-GEOS, DTU, Cybernetica, CERTH and Rikola.

4.2.5 Platform architects

The data platform and its architecture is one of the most important part of DataBio. In order

to ensure a valid platform design, systems integration and platform development, high

experienced platform architects are needed. This role will taken by Intrasoft, ATOS,

Fraunhofer IGD, SINTEF and VTT.

4.2.6 IT/Operation manager

Some of the realized pilots will be very processing intensive, which requires a very good

infrastructure. In order to provide and manage this infrastructure specific operation manager

are needed. This function will be fulfilled by PSNC and Softeam.



4.2.7 Consultant

Big Data Consultant are responsible for support, guidance and help within all design and

implementation phases. That includes high knowledge and practice in design big data

solutions as well as develop data pipelines that leverage structured and unstructured data

from multiple sources. The DataBio consortium have several partners which fulfil this role,

including SpaceBel, CIAOTECH, InfAI, FMI, Federunacoma, University of St. Gallen, CITOLIVA

and OGC

4.2.8 Business User

Business users are direct (business) beneficiaries of the developed DataBio solutions. Further,

they are important to specify detailed domain requirements and implement the solutions.

These partners are TRAGSA, Neuropublic, Finnish Forest Centre, MHG, LIMETRI, Kings Bay,

Eros, Ervik & Saevik, Liegruppen Fiskeri, Norges Sildesalgslag SA, GAIA, MEEO, Echebastar,

Novamont, Rikola, UPV/EHU, ZETOR and CAC

4.2.9 Pilot experts

In order to specify and prioritize requirements as well as manage the different pilots, finding

synergies and connecting the different experts into the pilot, domain experts are needed.

These are Lesprojekt, FMI, VTT, SINTEF, Finnish Forest Centre and AZTI.

Figure 4: DataBio’s data managers



Data security 5.1 Introduction In order to be able to address data security properly, one has to identify the various phases

of data lifecycle, from their creation, to their use, sharing, archive and deletion. Handling

project data securely throughout their lifecycle lays the foundations of a sensitive data

protection strategy. In this context, the project consortium will determine specific security

controls to apply in each phase, evaluating during the course of the project their level of

compliance. Those data lifecycle phases are featured in the image below and are summarized

as follows:

Figure 5: Data lifecycle

1. Phase 1: Create

This first phase includes the creation of structured or unstructured (raw) data. For the needs

of the DataBio project, those sensitive data are classified in the following categories: a)

Enterprise Data (commercially sensitive data), b) Personal Data (personal sensitive data) and

c) other data that are not applicable in one of the previous categories. Especially for the

enterprise data, upon the creation phase already, security classification occurs based on an

enterprise data security policy.

2. Phase 2: Store

Once data is created and included in a file, then it is stored somewhere. What needs to be

ensured is that stored data is protected and the necessary data security controls have been

implemented, so as to secure and minimize risk of information leak, ensuring efficient data



privacy. More information about this phase is found in sections 5.2 about data recovery and

5.3 about secure storage.

3. Phase 3: Use

During this phase when data is viewed, processed, modified and saved, security controls are

directly applied to data, with a focus on monitoring user activity and applying security controls

to ensure data leak prevention.

4. Phase 4: Share

Data is constantly being shared between employees, customers and partners, necessitating a

strategy that continuously monitors data stores and users. Data move among a variety of

public and private storage locations, applications and operating environments, and are

accessed by various data owners from different devices and platforms. That can happen at

any stage of the data security lifecycle, which is why it’s important to apply the right security

controls at the right time.

5. Phase 5: Archive

In the case of data leaving active use but still needed to be available, they should be securely

archived in appropriate storages, normally of low cost and performance, sometimes offline.

This may cover also version control where older versions of original (raw) data files and data

source processing programs are maintained in archive storages, per case. These backups are

then stored and can be brought back online within a reasonable timeframe that will ensure

that there is no detrimental effect of the data being lost or corrupted.

6. Phase 6: Destroy

In the case of data no longer needed, this data should be deleted securely so as to avoid any

data leakage.

5.2 Data recovery Data recovery strategy (also called disaster recovery plan) is not only a plan, but also ongoing

process of minimizing a risk of data loss that can be a consequence of different random

events.

Since DataBio is a project dealing with Big Data scenarios, the context of data recovery is

focused mostly on management procedures of data centers that are able to store and process

significant amount of data. The disasters that can occur can be classified into two categories:

• Natural disasters (floods, hurricanes, tornadoes or earthquakes): because they cannot be avoided it is possible to minimize their effects on IT infrastructure (distributed backups)

• Man-made disasters (infrastructure failure, software bugs, hackers attacks): besides minimizing the effect it is possible to prevent them in different ways (regular software updates, good, active protection mechanisms, regular testing procedures)

The most important elements of Data recovery plan are:



• Backup management: well-designed automatic procedures for regular storing copies of datasets on separate machines or even geographically distributed places

• Replication of data to an off-site location, which overcomes the need to restore the data (only the systems then need to be restored or synchronized), often making use of storage area network (SAN) technology

• Private Cloud solutions that replicate the management data (VMs, Templates and disks) into the storage domains that are part of the private cloud setup.

• Hybrid Cloud solutions that replicate both on-site and to off-site data centers. These solutions provide the ability to instantly fail-over to local on-site hardware, but in the event of a physical disaster, servers can be brought up in the cloud data centers as well.

• The use of high availability systems which keep both the data and system replicated off-site, enabling continuous access to systems and data, even after a disaster (often associated with cloud storage)

Several partners in the project are infrastructure providers. They ensure high quality in terms

of reliability and scalability.

5.3 Privacy and sensitive data management

5.3.1 Introduction

With regards to privacy and sensitive data management, it is confirmed that these activities

will be rigorously implemented in compliance to the privacy and data collection rules and

regulations as they are applied nationally and in the EU, as well as with the H2020 rules. The

next sections include more specific information regarding those activities, rules and measures

based on the classification of data made in the introduction of this section (5.1).

5.3.2 Enterprise Data (commercial sensitive data)

This category of data includes the (raw) data coming from specific sensor nodes and other

similar data management systems and sources from the various project partners in each pilot

case. They also include data about technologies and other assets protected by IPR and are

considered to be highly-commercially sensitive, belonging to the partner that provides them

for the various research and pilot activities within DataBio project. Therefore, access to those

data will be controlled and exchanges normally take place between specific end users and

partners involved in their use and management within each pilot case for DataBio related

activities.

Following also project GA and CA, each partner who provides or otherwise makes available to

any other project partner shared information represents that: (i) it has the authority to

disclose this shared information, (ii) where legally required and relevant, it has obtained

appropriate informed consents from all individuals involved, or from any other applicable

institution, all in compliance with applicable regulations; and (iii) there is no restriction in



place that would prevent any such other project partner from using this shared information

for the purpose of DataBio project and the exploitation thereof.

The abovementioned rules are also applied to any new data stemming from the project

activities. This data will be also anonymised and protected and only based on the above rules

our partners will be able to make data available to external industry stakeholders to utilise

them for their own purposes. Related publications will be released and disseminated through

the project dissemination and exploitation channels to make these parties aware of the

project as well as appropriate access to any data (see Appendix A for DataBio specific data).

On a technical level, data are protected by IPRs are often accessed as a service, with specific

access rights given under specific terms. Alternatively, they are shared encrypted or similarly

protected with the keys provided under specific terms.

5.3.3 Personal Data

According to the Grant Agreement, it has been agreed by all partners that any Background,

Results, Confidential Information and/or any and all data and/or information that is provided,

disclosed or otherwise made available between the Parties shall not include personal data.

Accordingly, each Party agreed that it will take all necessary steps to ensure that all Personal

Data is removed from the Shared Information, made illegible, or otherwise made inaccessible

(i.e. de-identify) to the other Parties prior to providing the Shared Information.

Therefore, no personal sensitive data are included in data exchanged between partners

within DataBio. Data created within project activities, e.g. some pilot activities, could initially

involve personal and/or sensitive data from human participants, like location and id, DataBio

will apply specific security measures for their informed consent and data protection in line

with the legislation and regulations in force in the countries where the research will be carried

out, with most relevant rules to the project being the following:

• The Charter of Fundamental Rights of the EU, specifically the article concerning the protection of personal data

• Council Directive 83/570/EEC of 26 October 1983 amending Directives 95/46/EC of the European Parliament and of the Council of 24 October 1995 on the protection of individuals with regard to the processing of personal data and on the free movement of such data.

Regarding the procedure that is required in order to be able to participate in any DataBio

activities, we foresee that all potential participants will have to read and sign an informed

consent form before starting the participation. This form aims to fully inform the participants

about the study procedure and goals in order to guarantee that they have basic information

in order to make the decision about whether to participate or not in the project activity. It

shall include a summary and schedule of the study, the objectives and descriptions of the

DataBio system and its components. All participants have the right to receive a copy of the

documents of this form. Participants will receive a generic user ID to identify them in the



system and to anonymise their identities. No full names will be stored anywhere

electronically. All gathered personal data shall be password protected and encrypted. Users’

personal data will be safeguarded from other people not involved in the project. No adults

unable to give informed consent will be involved.

It should be stated that the protection of the privacy of participants is a responsibility of all

persons involved in research with human participants. Privacy means, that the participant can

control the access to personal information and is able to decide who has access to the

collected data in the future. Due to the principle of autonomy, the participants will be asked

for their agreement before private and personal information is collected. It will be ensured

that all persons involved in the project activities understand and respect the requirement for

confidentiality. The participants will be informed about the confidentiality policy that is used

in this research project.

5.4 General privacy concerns Other privacy concerns will be addressed as following:

• External experts: Any external experts that will be involved in the project shall be required to sign an appropriate non-disclosure agreement prior to participating in any project related meeting, decision or activity.

• Publications: Hints to or identifiable personal information of any participant in (scientific) publications should be omitted. It is avoided to reveal the identity of participants in research deliberately or inadvertently, without the expressed permission of the participants.

• Dissemination: Dissemination of data between partners. This relates to access to data, data formats, and methods of archiving (electronic and paper), including data handling, data analyses, and research communications. Access to private information will be granted only to DataBio partners for purposes of evaluation of the system and only in an anonymised form, i.e. any personally identifiable information such as name, phone number, location, address, etc. will be omitted.

• Protection: The lead project partner of every pilot case is responsible for the protection of the participants’ privacy throughout the whole project, including procedures such as communications, data exchange, presentation of findings, etc.

• Control: The responsible project partners are not allowed to circulate information without anonymisation. This means that only relevant attributes, i.e. gender, age, etc. are retained.

• Information: As already mentioned above, the protection of the confidentiality implies informing the participants about what may be done with their data (i.e. data sharing). Individuals that participate in any study must have the right to request and obtain free of charge information on his/her personal data subjected to processing, on the origin of such data and on their communication or intended communication.



Ethical issues In line with the Consortium’s commitment in the DATABIO proposal, the ethics and

responsibility work in the project is guided by the principles of responsible research and

innovation in the information society

(http://renevonschomberg.wordpress.com/implementing-responsible-research-

andinnovation/), by the guidelines of European Group on Ethics

(http://ec.europa.eu/bepa/european-groupethics).

Since the research activities do not include any human trial, animal intervention or acquisition

of tissues thereof, there are no ethical concerns. Remote sensing of fields, forests or fish

stocks does not cause any ethical concerns.

The Partners agreed that any Background, Results, Confidential Information and/or any and

all data and/or information that is provided, disclosed or otherwise made available between

the Partners during the implementation of the Action and/or for any Exploitation activities

(“Shared Information”), shall not include personal data as defined by Article 2, Section (a) of

the Data Protection Directive (95/46/EEC) (hereinafter referred to as “Personal Data”).

Accordingly each Partner agrees that it will take all necessary steps to ensure that all Personal

Data is removed from the Shared Information, made illegible, or otherwise made inaccessible

(i.e. de-identify) to any other Party prior to providing the Shared Information to such other

Party.

Each Partner who provides or otherwise make available to any other Partner Shared

Information (“Contributor”) represents that: (i) it has the authority to disclose the Shared

Information, if any, which it provides to the Partner; (ii) where legally required and relevant,

it has obtained appropriate informed consents from all the individuals involved, or from any

other applicable institution, all in compliance with applicable regulations; and (iii) there is no

restriction in place that would prevent any such other Partner from using the Shared

Information for the purpose of the DATABIO Action and the exploitation thereof.

Any Advisory Board member or external expert shall be required to sign an appropriate non-

disclosure agreement prior to participating in any project related meeting, decision or activity.

http://renevonschomberg.wordpress.com/implementing-responsible-research-andinnovation/

http://renevonschomberg.wordpress.com/implementing-responsible-research-andinnovation/

http://ec.europa.eu/bepa/european-groupethics



Conclusions The DataBio project is an EU lighthouse project with eighteen pilots running from hundreds

of piloting sites across Europe in the three main bioeconomy sectors, agriculture, forestry,

and fishery. During the lifecycle of the DataBio project, big data will be collected consisting of

very large data sets including a wide range of data types from numerous sources. Most data

will come from farm and forestry machinery, fishing vessels, remote and proximal sensors

and imagery, and many other technologies. In this document, DataBio’s D6.2 deliverable

“Data Management Plan” was presented as the key element of good data management. As

DataBio participates in the European Commission H2020 Program’s extended ORD pilot, a

DMP is required and as a consequence, DataBio project’s datasets will be as open as possible

and as closed as necessary, focusing on sound big data management for the sake of best

research practice, and in order to create value, and foster knowledge and technology out of

big datasets for the good of man.

The data management life cycle for the data to be collected, processed and/or generated by

DataBio project was described, accounting also for the necessity to make research data

findable, accessible, interoperable and re-usable, without compromising the security and

ethics requirements. As a part of the project implementation, DataBio’s partners will be

encouraged to adhere to sound data management to ensure that data are well-managed,

archived and preserved. This is the first version of DataBio DMP; it will be updated over the

course of the project as warranted by significant changes arising during the project

implementation, and the project consortium. The scheduled advanced releases of this

document will particularly include information on the repositories where the data will be

preserved, the security measures, and several other FAIR aspects.





References Reference Name of document (include authors, version, date etc. where applicable)

[REF-01] DataBio website. www.databio.eu. Retrieved 2017-06-20.

[REF-02] Di, Liping, and Peng Yue. "Provenance in earth science cyberinfrastructure." A White

Paper for NSF EarthCube (2011).

[REF-03] https://www.w3.org/2005/Incubator/prov/wiki/What_Is_Provenance

[REF-04] https://www.w3.org/TR/prov-overview/

[REF-05] https://www.w3.org/standards/techs/provenance#w3c_all

[REF-06] https://www.w3.org/TR/2013/REC-prov-dm-20130430/

[REF-07] https://www.w3.org/TR/2013/REC-prov-o-20130430/

[REF-08] https://www.w3.org/TR/2013/REC-prov-constraints-20130430/

[REF-09] https://www.w3.org/TR/2013/REC-prov-n-20130430/

[REF-10] Deelman, E., et al. "Chapter 12: Metadata and provenance management." Scientific

Data Management: Challenges, Existing Technology, and Deployment. CRC Press.

Available at: http://arxiv. org/ftp/arxiv/papers/1005/1005.2643. pdf (2010).

[REF-11] https://www.w3.org/TR/vocab-dcat/

[REF-12] http://adsabs.harvard.edu/abs/2014AGUFMIN34B..05D

[REF-13] https://www.w3.org/2016/11/sdsvoc/SDSVoc16_paper_27.pdf

http://www.databio.eu/

https://www.w3.org/2005/Incubator/prov/wiki/What_Is_Provenance

https://www.w3.org/TR/prov-overview/

https://www.w3.org/standards/techs/provenance#w3c_all

https://www.w3.org/TR/2013/REC-prov-dm-20130430/

https://www.w3.org/TR/2013/REC-prov-o-20130430/

https://www.w3.org/TR/2013/REC-prov-constraints-20130430/

https://www.w3.org/TR/2013/REC-prov-n-20130430/

https://www.w3.org/TR/vocab-dcat/

http://adsabs.harvard.edu/abs/2014AGUFMIN34B..05D

https://www.w3.org/2016/11/sdsvoc/SDSVoc16_paper_27.pdf



Appendix A DataBio Datasets The Dataset descriptions has been collected in WP4 related to D4.1 Data Set descriptions in

cooperation with WP6 for the Data Management Plan. The data set descriptions include both

open data sets and data sets with restricted access.

The following provides a general description of each data set (both open and with various

protection/access levels) and a characterisation of each data set according to the following

template:

Name <Name of the data set>

Identifier < e.g. D07.01 : D = "Data" , 07= partner number, 01 = sequential number 01, 02,...>

Owner <Provider of the dataset/data model >

Description <Describe the level of dataset (e.g. Landsat 8 archive) and not other more detailed levels (e.g. image from 2017/01/26)>

Classification(s) <Keywords, e.g. EO data>

Date <Date (range) when the resource will become or did become available >

Area coverage <Geographical area of the dataset>

Time coverage <Period of time that the dataset describes>

Format <e.g. SENTINEL-SAFE format, Excel …>

Licence <Information about rights held in and over the dataset, including specific license name>

Related datasets <Link to the related datasets, identifiers of the descriptions>

Data set size <Indicative data size>

Frequency of update <e.g. daily, yearly..>

Access interfaces <e.g. SQL, REST>

Contact point <Email of the contact person>



A.1 Smart POI data set (UWB - D03.01)

Name Smart POI data set

Identifier D03.01

Owner 03 - UWB

Description

The Smart Points of Interest data set is the seamless and open resource

of POIs that is available for other users to download, search or reuse in

applications and services.

Its principal target is to provide information as Linked data together

with other data set containing road network.

The added value of the Smart approach in comparison to other similar

solutions consists in implementation of linked data, using of

standardized and respected datatype properties and development of

the completely harmonized data set with uniform data model and

common classification.

Classification(s) Points of Interests, VGI

Date 2014-05-01

Area coverage World

Time coverage 2014-present

Format RDF

Licence ODbL licence (http://opendatacommons.org/licenses/odbl/1.0/)

Related datasets

Data set size 20-30 GB



Frequency of

update quarterly

Access interfaces SPARQL Endpoint

Contact point [email protected], [email protected]

The Smart Points of Interest data set (SPOI) is the seamless and open resource of POIs that is

available for all users to download, search or reuse in applications and services. It is available

under Open Data Commons Open Database License (ODbL ~

http://opendatacommons.org/licenses/odbl/)

SPOI’s principal target is to provide information as Linked data together with other data set

containing road network. The added value of the Smart approach in comparison to other

similar solutions consists in implementation of linked data, using of standardized and

respected datatype properties and development of the completely harmonized data set with

uniform data model and common classification.

The SPOI data set is created as a combination of global data (selected points from

OpenStreetMap) and local data provided by the SDI4Apps partners or data available on the

web. The dataset can be reached by Sparql endpoint (http://data.plan4all.eu/sparql), for

detailed information please follow: http://sdi4apps.eu/spoi.

http://opendatacommons.org/licenses/odbl/

http://openstreetmap.org/

http://sdi4apps.eu/

http://data.plan4all.eu/sparql

http://sdi4apps.eu/spoi/



Figure 6: The data model of Smart Points Of Interest

A.2 Open Transport Map (UWB - D03.02)

Name Open Transport Map

Identifier D03.02

Owner 03 - UWB

Description

The Open Transport Map displays a road network which – is suitable for routing – – visualizes average daily Traffic Volumes for the whole EU – – visualizes time related Traffic Volumes (in OTN Pilot Cities - Antwerp, Birmingham, Issy-le-Moulineaux, Liberec region) – Talking technical, the Open Transport Map – can serve as a map itself as well as a layer embedded in your map –



– is derived from the most popular open dataset - OpenStreetMap – – is accessible via both GUI and API – – covers the whole European Union –

Classification(s) transport map, vector line data

Date 2015-present

Area coverage European Union


Format WMS, WFS, shapefile, PostGIS

Licence ODbL licence (http://opendatacommons.org/licenses/odbl/1.0/)

Related datasets OpenStreetMap

Data set size 20 GB

Frequency of update semiannually

Access interfaces GUI, WMS, WFS, shapefile, all described at http://opentransportmap.info

Contact point [email protected]

The Open Transport Map (OTM) displays a road network which is suitable for routing and

visualizes average daily Traffic Volumes for the whole EU. The data are available under Open

Data Commons Open Database License (ODbL ~

http://opendatacommons.org/licenses/odbl/).

The underlying data come from OpenStreetMap and are accessible in a scheme compatible

to INSPIRE Transport Network. The traffic volumes were calculated in the EU project

OpenTransportNet (http://opentransportnet.eu) using a scalable cloud platform using

Hadoop and Spark. The Open Transport Map can serve as a map itself as well as a layer

embedded in your map as it is accessible via both GUI and API. Detailed information please

see at: http://opentransportmap.info.

http://opendatacommons.org/licenses/odbl/

http://opentransportnet.eu/

http://opentransportmap.info/



Figure 7: The data model of Open Transport Map

A.3 Sentinels Scientific Hub Datasets via FedEO Gateway (SPACEBEL -

D07.01) Sentinel Products available on the Sentinels Scientific Data Hub (Sentinel-1, Sentinel-2) can

be discovered and accessed via the FedEO Gateway (C07.01) that returns Sentinel collections

and datasets metadata (including product download URL) via an OGC 13-026r8 OpenSearch

interface. The available geographical area is the global world and the temporal coverage starts

on April 2014 for Sentinel-1 and June 2015 for Sentinel-2. The access to the datasets metadata

and the products download requires an account (user/password) that can be obtained at

https://scihub.copernicus.eu/dhus/#/self-registration. Access to Sentinel Products and

metadata information can be done via the user interface of the FedEO Portlet (C07.05).



A.4 NASA CMR Landsat Datasets via FedEO Gateway (SPACEBEL -

D07.02) All datasets and collections metadata (including Landsat-8 collections) provided by the NASA

Common Metadata Repository (CMR), around 32000 collections, are accessible through an

OGC 13-026r8 OpenSearch interface via the FedEO Gateway (C07.01). The available

geographical area and the temporal coverage for the datasets/products are specified in each

collection metadata. In the case of Landsat-8, the coverage is the global world starting on

April 2013. To download Landsat-8 products, an account is needed on EROS Registration

System (ERS) at the following URL https://ers.cr.usgs.gov/register/. The download URL is

included in the catalog search response.

Collections and then products metadata including the product download URL metadata can

be accessed via the component C07.05 FedEO Portlet acting as client of the FedEO Gateway

(C07.01). The following picture illustrates the retrieval of Landsat-8 datasets through the

FedEO Portlet (C07.05).

Figure 8: FedEO Client (C07.05)

https://ers.cr.usgs.gov/register/



A.5 Open Land Use (Lespro - D02.01)

Name Open Land Use

Identifier D02.01

Owner 02 - Lesp

Description

Open Land Use Map is a composite map that is intended to create detailed land-use maps of various regions based on certain pan-Europen datasets such as CORINE Landcover, UrbanAtlas enriched by available regional data. The dataset is derived from available open datasources at different levels of detail and coverage. These data sources include: 1) Digital cadastral maps if available 2) Land Parcel Identification System if Available 3) Urban Atlas(European Environmental Agency) 4) CORINE Land Cover 2006 (European Environmental Agency) 5) Open Street Map The order of the data sources is according to the level of detail and, therefore, the priority for data integration.

Classification(s) Land Use, Cadastral parcels, Urban Atlas

Date 2015

Area coverage Europe


Format GML

Licence ODBS

Related datasets CLC, Urban Atlas, Cadastre, LPIS

Data set size hunderds of GB

Frequency of update semianually

Access interfaces REST, OGC WMS, WFS


Land use involves management and modification of natural environment or wilderness into

built environment such as fields, pastures, and settlements. It also has been defined as "the

arrangements, activities and inputs people undertake in a certain land cover type to produce,

change or maintain it" (FAO, 1997a; FAO/UNEP, 1999). Land use practices vary considerably



across the world. The United Nations' Food and Agriculture Organization Water Development

Division explains that "Land use concerns the products and/or benefits obtained from use of

the land as well as the land management actions (activities) carried out by humans to produce

those products and benefits9.

The Open Land Use (OLU) data model joins two basic data models of the INSPIRE Land Use

specification – existing land use and planned land use. The main difference among INSPIRE

data models and OLU model has been caused by the fact that OLU data model connects

planning and existing land use data. In the OLU the different attributes are used for both types

of land use data. The OLU model also follows INSPIRE land use specification (uses same data

attributes; the set of used attributes is larger than in the case of Land Use Database Schema),

but it works with more simple view on data. Both models are transformable to each other

and it is also possible to migrate data from these model to or from other data sets that are in

harmony with INSPIRE specification. The main reason for above-mentioned differences is

determine by different usage of data and data models. OLU will be used for any land use (and

land cover) data, Land Use Database Schema serves just to spatial planning data as a special

part of land use data.

There are several datasets which could be used for creating harmonised land use dataset.

Land use is a dataset which is used in many specialisms including agriculture, spatial or urban

planning, environment protection and maintenance and restoration of environmental

functions.

Currently Open Land Use cover all EU with different level of accuracy:

Europe

The base European dataset is derived from the set of available data sources that are helping

identify the land use in particular locality. The list of the sources used so far on the Pan-

European level includes:

1. Urban Atlas

2. CORINE Land Cover 2012

The sources are mentioned in the order they were combined (1 - has the highest geometrical

and semantic precedence and so on) to create the map.

Czech Republic

The dataset is derived from the set of available data sources that are helping identify the land

use in particular locality. The list of the sources used so far includes:

1. Digital Cadastre

2. LPIS (Land Parcel Identification System)

9 FAO Land and Water Division



3. Urban Atlas

4. CORINE Land Cover

Austria

The dataset is derived from the set of available data sources that are helping identify the land

use in particular locality. The list of the sources used so far includes:

1. LPIS (Land Parcel Identification System)

2. Urban Atlas


Flanders

The dataset is derived from the set of available datasources that are helping identify the

landuse in particular locality. The list of the sources used so far includes:

1. GRBGis Large Scale Reference Database

2. Urban Atlas


Open Land Use is available on http://sdi4apps.eu/open_land_use/.

A.6 Forest resource data (METSAK - D18.01)

Name Forest resource data

Identifier D18.01

Owner METSAK

Description Existing METSAK's forest resource data

Classification(s)

Forest resource data, forest resource inventory, tree data, tree strata,

growth places, geometry, compartments.

Date 2013

Area coverage At the moment about 80 % of the area of privately owned forests.

Time coverage Up to date information on forest resources, time coverage not relevant.





Format Standard format in relational database, XML format in data import/export.

Licence NA

Related datasets Customer and forest estate data D18.02

Data set size 200 GB

Frequency of

update

Updates per decade per area through forest resource inventories, yearly

growth calculations, updates according to field measurements and

notifications from forest owners and forestry operators.

Access

interfaces

Metsään.fi user interface, Web Service and SOAP interfaces on the back

ground.


The pilot uses METSAK’s forest resource data concerning privately owned Finnish forests from

METSAK’s forest resource data system. The forest resource data consists of basic data of tree

stands (development class, dominant tree species, scanned height, scanned intensity, stand

measurement date), strata of tree stands (mean age, basal area, number of stems, mean

diameter, mean height, total volume, volume of logwood, volume of pulpwood), growth place

data (classification, fertility class, soil type, drainage state, ditching year, accessibility, growth

place data source, growth place data measurement date), geometry and compartment

numbering. The forest resource data is available in a standard format for external use with

consent of a forest owner.

The forest resources are invented once in a decade per certain area using remote sensing and

aerial photographs. The new data is analysed and in some parts measured in the field. Other

updates on the forest resource data are yearly growth calculations, possible notifications of

forest use or other forestry operations or so called Kemera financing operations and possible

new aerial photographs to be interpreted.

A.7 Customer and forest estate data (METSAK - D18.02)

Name Customer and forest estate data

Identifier D18.02



Owner METSAK

Description

Customer and forest estate data from METSAK's customer information

system.

Classification(s) Customer, forest estate, property, ownership, consent

Date 2011

Area coverage Not relevant

Time coverage Not relevant

Format Relational database

Licence NA

Related datasets Forest resource data D18.01

Data set size NA

Frequency of

update Constant updates when needed.

Access

interfaces

Metsään.fi user interface, Web Service and SOAP interfaces on the back

ground.


The forest resource data is connected with the customer and forest estate data of METSAK.

The essential part of the Metsään.fi eService use is the information on who owns certain

forest estates and who has the rights to read and to use the forest resource data of a certain



forest owner. The pilot uses METSAK’s customer information system, which contains all these

data.

A.8 Storm damage observations and possible risk areas (METSAK -

D18.03)

Name Storm damage observations and possible risk areas

Identifier D18.03

Owner METSAK

Description

Storm damage observations and analyzed storm damage risk areas based

on observations.

Classification(s) Forest damage, type of damage, storm, geometry, tree species,

Date During 2017-2018

Area coverage TBD

Time coverage NA

Format WMS-maps, standardization is ongoing.

Licence NA

Related datasets Forest resource data D18.01, customer and forest estate data D18.02

Data set size Unknown, NA

Frequency of

update Unknown, NA



Access

interfaces

Metsään.fi user interface, WMS interfaces on the back ground.


One of the new data concerning this pilot is storm damage observations, which are planned

to be crowdsourced. The storm damage observations consist of location, type of the damage,

evaluation of the extent of the damage, tree species and distance from the road. The storm

damage data supplements forest resource data. Possible storm damage areas are evaluated

based on the storm damage observations collected. The possible risk areas are presented to

the users on a map layer.

A.9 Quality control data (METSAK - D18.04)

Name Quality control data

Identifier D18.04

Owner METSAK

Description

Quality control data on forest work done collected from the forestry

operators

Classification(s)

Quality control, best practices guidelines for forest management, sample

plot, compartment, national average.

Date TBD

Area coverage TBD

Time coverage TBD

Format

Relational database, XML format for import/export. Will be part of the

forest data standard during 2017.

Licence NA



Related datasets Forest resource data D18.01

Data set size NA

Frequency of

update NA

Access

interfaces

Metsään.fi user interface, Web Service on the back ground.


The quality control data of the work done in forest is part of following the Best Practice

Guidelines for Forest Management. The data is already being collected and saved in METSAK’s

information systems, but the amount that data needs to be increased. The data is planned to

be collected also through mobile application.

This pilot is about presenting the quality control data in Metsään.fi eService for forest owners

and forestry operators, and supporting the requirement specification of a new mobile

application and its interfaces. In Metsään.fi the forest owners should be able to follow the

quality of work done in their forests and compare it to the national average. The forestry

operators have the quality data of their own work done in forest in Metsään.fi and also the

possibility to compare it to the national average.

The quality control data consists of forest estate, number of the financing conclusion,

geometry of compartments type of the forest work, sample plot locations, measured data per

sample plot, measurement averages per compartment, measurement date and user

information. The quality control data will be added to the existing forest data standard during

2017.

A.10 Ontology for (Precision) Agriculture (PSNC - D09.01)

Name FOODIE ontology

Identifier D09.01

Owner PSNC



Description

The ontology enables the representation of data compliant with FOODIE

data model in semantic format and their interlinking with established

vocabularies and ontologies (e.g., AGROVOC). Thus, in line with FOODIE

data model, different agricultural-related concepts can be described and

represented, including agricultural facilities, crop and soil data,

treatments, interventions, agriculture machinery, etc. Additionally, the

ontology (as the model), are based on on the INSPIRE directive, ISO

standards (e.g. 19156, 19157) and OGC standards. The ontology can be

used for data semantization tasks, in order to enable access to the (semi-

)structured data (e.g., tabular, relation-al), as well as the publication of

such data following linked data principles.

Classification(s) Ontology, OWL, INSPIRE, ISO/OGC

Date Aug-15

Area coverage Agnostic

Time coverage Agnostic

Format OWL

Licence Creative Commons Attribution 3.0

Related datasets FOODIE data model

Data set size 100KB

Frequency of

update fixed

Access

interfaces

SPARQL




The (FOODIE) ontology enables the representation of data compliant with FOODIE data model

in semantic format and their interlinking with established vocabularies and ontologies (e.g.,

AGROVOC). Thus, in line with FOODIE data model, different agricultural-related concepts can

be described and represented, including agricultural facilities, crop and soil data, treatments,

interventions, agriculture machinery, etc. Also, in line with FOODIE data model, the ontology

is based on the INSPIRE directive, ISO standards (e.g. 19156, 19157) and OGC standards. The

ontology can be used for different semantic tasks, such as data semantization for the

transformation of (semi-)structured data (e.g., tabular, relational) to semantic format;

ontology-based data access, e.g., accessing relational databases as virtual, read-only RDF

graphs; publication of linked data, including the discovery of links with relevant datasets in

the Linked Open Data cloud.

A.11 Wuudis data (MHGS - D20.01)

Name Wuudis

Identifier D20.01

Owner MHG Systems Oy Ltd

Description Forest data in XML and JSON formats

Classification(s)

Date 21th February 2017 (Pilot area data loaded to the Wuudis)

Area coverage Hankasalmi and Äänekoski area, Finland

Time coverage

Format JSON application/json, XML application/xml

Licence Owner of the data

Related datasets http://metsatietostandardit.wm.fi/en



Data set size Dynamic

Frequency of

update Dynamic

Access

interfaces

REST

Contact point

[email protected], veli-

[email protected]

Wuudis uses Finnish forest information standard as basic data import/export format. Wuudis

service data model is based to the Finnish forest information standard. All development

activities during DataBio project that will affect to the Wuudis data model are based on

Finnish forest information standard. Forest information standard includes set of different

standardized schemas (like timber sales, logistics etc.). Some of these schemas can be used in

the DataBio and some new specifications are developed during project.

Basic information about the forest information standard:

http://www.metsatietostandardit.fi/en . Base forest information standard XML schema

description can be found here:

https://extra.bitcomp.fi/metsastandardi_ehdotus/V8/MV/doc/index.html. This schema

includes basic forest property data, stands, operations, tree stratums. Everything is based on

this basic real estate information. Whole schema repository can be found here:

https://www.bitcomp.fi/metsatietostandardit/

Wuudis also has open REST API that uses plain JSON which is faster than standard based XML

data transfer. With JSON interface different kind of query parameters can be also used and

data can be fetched in parts (like single stand or operation). All available resources are listed

in the WADL documentation: https://wuudis.com/api/application.wadl

One important dataset for Wuudis is different map layers. Wuudis uses global map services

like Google and Microsoft (Bing) to provide world-wide satellite map layers to the end users.

Wuudis also provides map layers from National Land Survey of Finland’s WMS/WMTS service.

More information about National Land Survey of Finland map services can be found here:

http://www.maanmittauslaitos.fi/en/maps-and-spatial-data/maps/view-maps

A.12 SigPAC (Tragsa - D11.05)

Name SigPAC

http://www.metsatietostandardit.fi/en

https://extra.bitcomp.fi/metsastandardi_ehdotus/V8/MV/doc/index.html

https://www.bitcomp.fi/metsatietostandardit/

https://wuudis.com/api/application.wadl

http://www.maanmittauslaitos.fi/en/maps-and-spatial-data/maps/view-maps



Identifier D11.05

Owner Junta de Castilla y Leon (Autonomic Government)

Description CAP Information System. Land parcel identification system.

Classification(s) Spatial dataset.

Date

Area coverage Pilot Area

Time coverage 2016 - End of the Project

Format ESRI Shape File

Licence CC-BY

Related datasets Cadaster information

Data set size Mb

Frequency of

update Annual

Access

interfaces http://www.datosabiertos.jcyl.es/web/jcyl/set/es/urbanismoinfraestruct

uras/SIGPAC/1284225645888

Contact point

CAP Information System. Land parcel identification system. Junta de Castilla y Leon

(Autonomic Government)



A.13 Field data - pilot B2 (Tragsa - D11.07)

Name Field data - pilot B2

Identifier D11.07

Owner TRAGSA Group

Description

Direct observations + Direct & Lab measurements: Chlorophyll content,

morphology, green & dry weight, hydric potential, Leaf Area Index (LAI),

visual classification of damages. Features TBD according to the pilot

needs.

Classification(s) Field data

Date Specific dates TBD according to the pilot needs: 2017-2019

Area coverage Study sites TBD in: Extremadura, Galicia

Time coverage Specific dates TBD according to the pilot needs: 2017-2019

Format TBD

Licence Under agreement. Property of TRAGSA Group

Related datasets RPAS data

Data set size TBD

Frequency of

update NA

Access interfaces NA




Direct observations + Direct & Lab measurements: Chlorophyll content, morphology, green &

dry weight, hydric potential, Leaf Area Index (LAI), visual classification of damages. Features

TBD according to the pilot needs.

A.14 IACS (NP - D13.01)

Name IACS

Identifier D13.01

Owner NP, GAIA

Description Anonymised IACS data

Classification(s) IACS

Date 2017-2018

Area coverage Greek Pilot Area of T1.4.2

Time coverage 2016-2017

Format GeoJSON

Licence NP private

Related datasets D13.02.xlsx, DS13.01.xlsx

Data set size 100,000 records

Frequency of update Yearly

Access interfaces SQL




Anonymised IACS data

A.15 Sentinel Data • Sentinel-2 HR Optical data Sentinel-2 archive. European Space Agency (ESA). Global

coverage. NP has the data for its pilot areas (Τ1.2.1, Τ1.4.1, Τ1.4.2) corresponding to 6 tiles. Thematic Exploitation Platforms, such as the Forestry TEP (C16.10), are available for online analytics.

• Sentinel-2 L1 data (C14.01). Sentinel-2 L1 data archive. ESA. Czech Republic

• Sentinel-1 IWS data (C14.02). Sentinel-1 L1 data archive. EO data. Czech Republic

A.16 Tree species map (FMI - D14.03)

Name Tree species map

Identifier D15.3

Owner ESA

Description Tree species map

Classification(s) Raster dataset

Date 2017

Area coverage Czech Republic

Time coverage 2017

Format GeoTiff

Licence Property of FMI

Related datasets <Link to the related datasets, identifiers of the descriptions>

Data set size Approximately 1Gb

Frequency of update Fixed





Tree species map. Raster dataset based on classification of Sentinel-2 multi-temporal data

and National forest inventory of Czech Republic. 20 m spatial resolution, distinguished six

most abundant tree species in Czech Republic.

A.17 Stand age map (FMI - D14.04)

Name Stand age map

Identifier D15.4

Owner FMI

Description

Stand age map, 20m resolution, pixel value corresponds to the age of

dominant tree species


Date 2017


Time coverage 2017

Format GeoTiff


Related datasets Derived from forest management plans of FMI


Frequency of

update Fixed





Vector layer based on Czech forest management plans, stand age based on detailed forest

inventory, countrywide, 10 years update interval.

A.18 Canopy height map (FMI - D14.05)

Name Canopy height map

Identifier D15.5

Owner FMI

Description

Canopy height map, 20m resolution, pixel value corresponds to the

height of dominant tree species


Date 2017


Time coverage 2017

Format GeoTiff


Related datasets Derived from stereo-ortophoto maps of FMI


Frequency of

update Fixed





Stand age (growth stages) according to canopy height model derived from aerial stereo-

orthophoto interpretation of Czech Land Survey (data available countrywide every second

year). Spatial resolution 5 m. Distinguished 4 different growth stages and absolute canopy

height.

A.19 Leaf area index (FMI - D14.06)

Name Leaf area index

Identifier D15.6

Owner FMI

Description

Leaf area index assessment for national forest inventory of Czech

Republic, digital hemispherical photography

Classification(s) Photography, numeric values

Date 2017


Time coverage 2015-2017

Format GeoTiff, CSV


Related

datasets Derived from digital hemispherical photography


Frequency of

update Based on field campaigns



Access

interfaces

<e.g. SQL, REST>


Leaf area index and canopy closure for selected National forest inventory sites, based on

interpretation of digital hemispherical photography, in total 100 to 200 sites available. Vector

point layer (centre of inventory plot and LAI value). In-situ observations of forest damage.

A.20 Forest damage (FMI - D14.07)

Name Forest damage

Identifier D15.7

Owner FMI

Description In-situ observations of forest damage

Classification(s) Photography, numeric values

Date 2017


Time coverage 2017 -

Format GeoTiff, CSV


Related datasets Derived from Wuudis mobile application

Data set size Gigabytes

Frequency of update Based on field campaigns





In-situ observations of forest damage. FMI. Czech Republic. Forestry statistics for selected

plots - information about the amount of salvage cutting.

A.21 Hyperspectral image orthomosaic (Senop - D44.02) Orthorectified hyperspectral mosaic, n-bands, band-matched. Format: ENVI/multipage

TIF/single band TIF.

A.22 GAIATrons IoT (DS13.01)

Name Internet of Things (IoT)

Identifier DS13.01

Owner NP

Description

Measurements collected from ground-stations called GAIATrons. There are

two type of stations. The GAIATron Atmo station measures atmospheric

parameters and the GAIATron Soil station measures soil parameters.

Classificatio

n(s) Atmposheric parameters (Leaf Wetness, Wind Direction/Period/Strength,

Rain, Pressure, Temperature, Relative Humidity)

Soil Parameters (Huminity and Temperature)

Date

Starting measure date 1/5/2016 for the Greek Pilots in T1.2.1. Pilot A1.1:

Precision agriculture in olives, fruits, grapes

Starting measure date 1/5/2017 for the Greek Pilot in T1.3.1. Pilot B1.1:

Cereals and biomass crops (Greece)

Starting measure date 1/5/2016 for the Greek Pilot in T1.4.1. Pilot

Description_T1.4.1_Pilot C1.1- Insurance (Greece)

Area

coverage

The coverage area for each station varies. GAIATrons in total covers the

Greek Pilots areas

(for more details please check the Greek pilot descriptions)



Input

Format Text files over TCP/UDP

Event

format

Set of <key><value> pairs that describe all the event types (mentioned in

classification)

Licence Visible to pilot. Data owner is NP.

Related

datasets D13.02.xlsx

Stream size

3,000 entries per GAIATron atmo station per day approx., and 1,500 entries

per GAIATron soil station per day approx.

Contact

point [email protected]

Name Internet of Things (IoT)

Identifier DS13.01

Measurements collected from ground-stations called GAIATrons. There are two type of

stations. The GAIATron Atmo station measures atmospheric parameters and the GAIATron

Soil station measures soil parameters. The coverage area for each station varies. GAIATrons

in total covers the Greek Pilots areas (for more details please check the Greek pilot

descriptions)

A.23 Phenomics, metabolomics, genomics and environmental

datasets (CERTH - DS40.01)

Name <Name of the stream data source>

Identifier DS40.01

Owner CERTH

Description

Phenomics, metabolomics, genomics and environmental datasets,

Genomic predictions and selection data



Classification(

s) raw txt data; csv data

Date 1M to 12M

Area

coverage Regions of Thessalia

Input Format Fastq, fasta, txt, tsv, json, csv

Event format <Set of pairs of <attribute><value> that describe the event type>

Licence

<Information about rights held in and over the stream data including

specific license name>

Related

datasets <Link to the related datasets, identifiers of the descriptions>

Stream size <Indicative data stream size, e.g 100 Mbps, 100 events/second >

Contact point [email protected], [email protected]

Phenomics, metabolomics, genomics and environmental datasets, Genomic predictions and

selection data.