Webinar@AIMS: INRA's Big Data Perspectives and Implementation Challenges

INRA's Big Data perspectives and implementation challenges

Pascal NeveuUMR MISTEA

INRA - Montpellier

11 avril 2013

Pascal Neveu 2

What is INRA?

French Institute for Agricultural Research

• Specialties:

– Agriculture– Environment– Food

● 10000 Persons (4000 researchers)

● 18 Centres (> 40 sites)

11 avril 2013

Pascal Neveu 3

What is INRA ?

Technical staff (4000) --> data producer

INRA Data characteristics:

- Spread over territory

- Many disciplines

- National and international collaborations

Pascal Neveu / AG MIA 2014 4

Raise integrated issues and challenges:

– How to adapt agriculture to climate change?

– How agriculture impacts environment?

– Agroecology «producing and supplying food in a different way »

– Global food security and needs of adaptation

– Plant treatment and food safety

– ...

Agronomic Sciences

11 avril 2013

Pascal Neveu 5

Data challenges in ScienceModern science must deal with:

● More experimental data

● A lot of datasets available on the Web

● More collaborative and integrative approaches

→ Management, sharing and data analysis play an increasing role in research

Discover, combine and analysethese data

→ Big Data Challenges

11 avril 2013

Pascal Neveu 6

Big Data

A definition: data sets grow so large and complex that it becomes impossible to process using traditional data processing methods (Management and Analysis)

Make data valuable (information, knowledge, decision support)

11 avril 2013

Pascal Neveu 7

Agronomic Big Data V characteristics

● Volume: massive data and growing size → hard to store, manage and analyze

● Variety and Complexity: different sources, scales, disciplines different semantics, schemas and formats etc. → hard to understand, combine, integrate

● Velocity: speed of data generation → have to be processed on line

● Validity, Veracity, Vulnerability, Volatility, Visibility, Visualisation, etc.

11 avril 2013

Pascal Neveu 8

Why Big Data is important in Agronomic Sciences?

Production of a lot of heterogeneous data for understanding

● Open new insights

● Allow to know:

– Which theories are consistent and which ones are not!

– When data did not quite match what we expect…

→ Decision support needs (integrative and predictive approaches)

11 avril 2013

Pascal Neveu 9

Illustration: Plant Phenotyping

Environment

Plant Genotype

Phenotype

Interactions

11 avril 2013

Pascal Neveu 10

Searching for the most adapted genotypes

High throughput

Various Environments

Many Plant Genotypes

High frequency anddynamic trait observations of a lot of Phenotypes

Interactions

11 avril 2013

Pascal Neveu 11

Why high throughput phenotyping is important for agriculture?

● Adaptation to climate change

● More efficient use of natural resources (including water and soil) in our farming practices

● Sustainable management and equity

● Food securityCrop performance (yields are globally decreasing)

● …

Genotyping and Phenotyping

Plant phenotyping has become a bottleneck for progress in plant science and plant breeding

11 avril 2013

Pascal Neveu 12

Phenome High throughput plant phenotyping

French Infrastructure 9 multi-species platforms

● 2 green house platforms

● 5 field platforms

● 2 omics platforms

11 avril 2013

Pascal Neveu 13

5 Field Platforms

Various scales and data types● Cell, organ, plant, population● Images, hyperspectral, spectral, sensors, human readings...

Thousands of micro-plots

time

11 avril 2013

Pascal Neveu 14

2 Green house Platforms

-1

Time (d)

0 20 40 60 80 1000

10

20

30

40

50

60

Pla

nt b

iom

ass

g

Various scales and data types

time

11 avril 2013

Pascal Neveu 15

2 « Omics » platforms

●

Grinding weighting (-80°C)

ExtractionFractionation

PipettingIncubation Reading

Various data complex types

Composition and structure

Quantification of metabolites and enzyme activities

11 avril 2013

Pascal Neveu 16

Data management challengesin Phenome: Volume growth

40 Tbytes in 2013, 100 Tbytes in 2014, …

● Volume is a relative concept– Exponential growth makes hard

● Storage● Management● Analysis

Phenome HPC and Storage→ Grid computing (FranceGrille, EGI)– On-demand infrastructure and Elasticity– Virtualization technologies – Data-Based parallelism

(same operation on different data)

11 avril 2013

Pascal Neveu 17

Data management challengesin Phenome: Variety

● Produced by different communities (geneticians, ecophysiologists, farmers, breeders, etc)

● Data integration needs extensive connections and associations to other types of data

● (environments, individuals, populations, etc.)● Different semantics, data schemas, …

Extremely diverse data

→ Web API, Ontology sets, NoSQL and Semantic Web methods

11 avril 2013

Pascal Neveu 18

Data management in Phenome: Velocity

– Green House platforms produce tens of thousands images/day (200 days/year)

– Field platforms produce tens of thousands images/day(100 days/year)

– Omic platforms produce tens of Gbytes/day(300 days/year)

Scientific Workflow – OpenAlea /provenance module (Virtual Plant INRIA team) – Scifloware (Zenith INRIA team)

11 avril 2013

Pascal Neveu 19

Data management challengesin Phenome: Validity

Data cleaning

● Automatically diagnose and manage:– Consistency? Duplicates? Wrong?– Annotation consistency?– Outliers? – Disguised missing data?– ...

Some approaches – Unsupervised Curve clustering (Zenith INRIA team)– Curve fitting over dynamic constraints– Image Clustering

11 avril 2013

Pascal Neveu 20

Conclusion

High throughput phenotyping data:

– Hard to produce

– Hard to manage

– Hard to analyse

→ We need new approaches

Thank you for your attention

Technology

Webinar@AIMS: INRA's Big Data Perspectives and Implementation Challenges