26
Not our data, but we use it in research Wietse Dol, LEI-WUR ([email protected]) 6 October 2014

Not our data, but we use it in research Wietse Dol, LEI-WUR ([email protected]) 6 October 2014

Embed Size (px)

Citation preview

Page 1: Not our data, but we use it in research Wietse Dol, LEI-WUR (W.Dol@wur.nl) 6 October 2014

Not our data, but we use it in research

Wietse Dol, LEI-WUR ([email protected])

6 October 2014

Page 2: Not our data, but we use it in research Wietse Dol, LEI-WUR (W.Dol@wur.nl) 6 October 2014

Wietse Dol

PhD Econometrics

10 years University of Groningen (Econometrics, sampling theory)

21 years LEI (many different departments)

Data and models, i.e. use/reuse and quality, trouble shooter + statistical methods + ICT + user interfacing

Not and IT specialist but a researcher (I build software because I use it myself)

Many model projects and user interfaces for models (not only LEI)

Since 2006: data, data quality ≡ MetaBase

Page 3: Not our data, but we use it in research Wietse Dol, LEI-WUR (W.Dol@wur.nl) 6 October 2014

LEI: Agricultural Economic Research Institute

Part of Wageningen University & Research center (WUR)

Part of the Social Science Group within the WUR

We are the research part of WUR/SSG (advice ministry of Economic Affairs) in The Hague

Consultancy (applied research): ministries, EU, local government, industry,…

Collecting data (Farm data: FADN), building models and agricultural content specialists

Page 4: Not our data, but we use it in research Wietse Dol, LEI-WUR (W.Dol@wur.nl) 6 October 2014

University vs. Research center

University: teaching, publications, new theory and technology

Research center:

●applied work/consultancy

●reusing things from the past (e.g. yearly publications)

●sharing knowledge (how to become a content specialist)/teaching for small groups

●working in groups (different disciplines)

●Working in (inter)national groups with many different disciplines

Research centers have experience in data management.

Page 5: Not our data, but we use it in research Wietse Dol, LEI-WUR (W.Dol@wur.nl) 6 October 2014
Page 6: Not our data, but we use it in research Wietse Dol, LEI-WUR (W.Dol@wur.nl) 6 October 2014

Primary vs. Secondary research data

Research data: collected, observed, or created, for the purpose of analysis to produce and validate original research results.

Primary data: you collect, targeted to answer/validate your questions.

Secondary data: not yours, e.g. from website.

• More and more need of secondary data (primary is expensive and takes a lot of time to collect).

• Quality of data

• Meta-information and Versioning is crucial

Page 7: Not our data, but we use it in research Wietse Dol, LEI-WUR (W.Dol@wur.nl) 6 October 2014

Production data

Meta-information: Source, Version, Dimension, Definitions etc. without proper information you use the wrong data

is FR with or without DOM?

Is the production in tons or in Euros.

Does the year start 1-1 and ends 31-12?

What’s the definition of Tomato

Owner of the data/Version of the data/conditions usage…

Product Country Year Production

Tomato NL 2005 325

Wheat BE 1999 100

Sugar FR 2003 450

Page 8: Not our data, but we use it in research Wietse Dol, LEI-WUR (W.Dol@wur.nl) 6 October 2014

Lifecycle Model of data

http://www.dcc.ac.uk/resources/curation-lifecycle-model

Page 9: Not our data, but we use it in research Wietse Dol, LEI-WUR (W.Dol@wur.nl) 6 October 2014

Data

Use data

How to get the data, filter it and store it

Inspection and Quality checks on the data

How to make it available for others

What scientific actions are done on the data

Curate, preserve, versions, … Lifecycle Model

Don’t do it alone, do it as a GROUP and

communicate

EverybodyNot oftenSeldom

Page 10: Not our data, but we use it in research Wietse Dol, LEI-WUR (W.Dol@wur.nl) 6 October 2014

Types of databases according MetaBase

Statistical database

Scientific database

Meta-database

Page 11: Not our data, but we use it in research Wietse Dol, LEI-WUR (W.Dol@wur.nl) 6 October 2014

Statistical database: secondary data

Databases provided by international organizations like EU, FAO, OECD, World bank are in general statistical databases:

●Good web interfaces for downloading data

●Data are stored as they are received

●Data are consistent in their own domain

●No aggregations are made when underlying data are missing

●Not much attention for data checking

●No versioning system (data changes

Page 12: Not our data, but we use it in research Wietse Dol, LEI-WUR (W.Dol@wur.nl) 6 October 2014

Scientific versus Statistical database

Problems with statistical database:

●Different definitions of territories and commodities

●Typing errors

●Missing data

●Break in series

Scientific database:

●Problems solved

●Transparency (original data sources and underlying assumptions are kept)

●Versioning of the data

●Essential for modeling and research

Page 13: Not our data, but we use it in research Wietse Dol, LEI-WUR (W.Dol@wur.nl) 6 October 2014

Structural design of a scientific database

Key words for structural design HarDFACTS project IPTS 2007 done by vTI/LEI

●Transparent

●Harmonised

●Complete

●Consistent

Harmonised Database for Agricultural Commodity Time Series

=> The amount of effort/costs scares institutes but it is often a “hidden” costs.

Page 14: Not our data, but we use it in research Wietse Dol, LEI-WUR (W.Dol@wur.nl) 6 October 2014

Transparent

Original data from statistical database are stored

Complete and consistent data are stored

Original and completed data can be compared

Calculation procedures are stored and can be repeated (scripting language)

HarmonisedDefinition used here is to bring together the different international databases in one framework and to link the data through a unique coding system (keywords are classifications and tree structures, super-classifications)

Page 15: Not our data, but we use it in research Wietse Dol, LEI-WUR (W.Dol@wur.nl) 6 October 2014

Complete

Definition used in MetaBase is that an econometric procedures will be proposed to complete the new (time) series in the database (especially needed for models).

Consistent

Definition used here is that the inter relationship of the data in the database holds over classifications (time, territories and variables).

Page 16: Not our data, but we use it in research Wietse Dol, LEI-WUR (W.Dol@wur.nl) 6 October 2014

Versioning of your research

Main reason for versioning: Reproducibility

Software you use changes: software versions

Data changes/is updated/corrected: data versions

You discover errors in your research process or you improve the procedure: model versions

Best advice: do not use a spreadsheet but a language with a scripting language (SQL, R, GAMS,…) and store data in a database (with a good data model). This documents how the original data was transformed into the data of your research

Store data and scripts in a version control system SVN: like Turtoise http://tortoisesvn.net/

Do it as a group and (re)use others results.

Page 17: Not our data, but we use it in research Wietse Dol, LEI-WUR (W.Dol@wur.nl) 6 October 2014
Page 18: Not our data, but we use it in research Wietse Dol, LEI-WUR (W.Dol@wur.nl) 6 October 2014

Versioning 2

Try to separate Model (script) from Data

Make generic scripts when possible (re-use)

Store Script and Data in separate SVN repositories

Add meta-information to data as well as your scripts

I.e. register versions of the software you use

Test if your data and code also runs on other computers

Example: Outlier testing in MetaBase

Page 19: Not our data, but we use it in research Wietse Dol, LEI-WUR (W.Dol@wur.nl) 6 October 2014

Land under permanent crop in Spain by Eurostat

Page 20: Not our data, but we use it in research Wietse Dol, LEI-WUR (W.Dol@wur.nl) 6 October 2014

Versioning 3

Versioning looks time consuming, but when you make mistakes it is easy to go back to an old situation. It is also a first good step in sharing data etc. Works very well in groups.

Easy to see differences between versions.

Versioning makes it possible to reproduce research, also in 5 years time.

Frequency of versioning: some make a version every day. Practical advice: make a version when you have a publication.

Page 21: Not our data, but we use it in research Wietse Dol, LEI-WUR (W.Dol@wur.nl) 6 October 2014

MetaBase: data management for data

Page 22: Not our data, but we use it in research Wietse Dol, LEI-WUR (W.Dol@wur.nl) 6 October 2014

MetaBase

1. many different data sources (e.g. FAO, Eurostat) all in same user-interface (SDMX, NetCDF)

2. find data alternatives using Meta-Information

3. search data content (e.g. oilseed)

4. all content easily available in research software

5. recodings, aggregations and concordances are all implemented in GAMS

6. Statistical methods in GAMS and R

7. Versioning Eurostat (monthly), FAO (twice per year)

8. Example: http://www.agrimatie.nl/

Page 23: Not our data, but we use it in research Wietse Dol, LEI-WUR (W.Dol@wur.nl) 6 October 2014
Page 24: Not our data, but we use it in research Wietse Dol, LEI-WUR (W.Dol@wur.nl) 6 October 2014
Page 25: Not our data, but we use it in research Wietse Dol, LEI-WUR (W.Dol@wur.nl) 6 October 2014
Page 26: Not our data, but we use it in research Wietse Dol, LEI-WUR (W.Dol@wur.nl) 6 October 2014

Always play with your dataand communicate

Wishes, problems, requests: [email protected]