View
602
Download
4
Embed Size (px)
Citation preview
INRA's Big Data perspectives and implementation challenges
Pascal NeveuUMR MISTEA
INRA - Montpellier
11 avril 2013
Pascal Neveu 2
What is INRA?
French Institute for Agricultural Research
• Specialties:
– Agriculture– Environment– Food
● 10000 Persons (4000 researchers)
● 18 Centres (> 40 sites)
11 avril 2013
Pascal Neveu 3
What is INRA ?
Technical staff (4000) --> data producer
INRA Data characteristics:
- Spread over territory
- Many disciplines
- National and international collaborations
Pascal Neveu / AG MIA 2014 4
Raise integrated issues and challenges:
– How to adapt agriculture to climate change?
– How agriculture impacts environment?
– Agroecology «producing and supplying food in a different way »
– Global food security and needs of adaptation
– Plant treatment and food safety
– ...
Agronomic Sciences
11 avril 2013
Pascal Neveu 5
Data challenges in ScienceModern science must deal with:
● More experimental data
● A lot of datasets available on the Web
● More collaborative and integrative approaches
→ Management, sharing and data analysis play an increasing role in research
Discover, combine and analysethese data
→ Big Data Challenges
11 avril 2013
Pascal Neveu 6
Big Data
A definition: data sets grow so large and complex that it becomes impossible to process using traditional data processing methods (Management and Analysis)
Make data valuable (information, knowledge, decision support)
11 avril 2013
Pascal Neveu 7
Agronomic Big Data V characteristics
● Volume: massive data and growing size → hard to store, manage and analyze
● Variety and Complexity: different sources, scales, disciplines different semantics, schemas and formats etc. → hard to understand, combine, integrate
● Velocity: speed of data generation → have to be processed on line
● Validity, Veracity, Vulnerability, Volatility, Visibility, Visualisation, etc.
11 avril 2013
Pascal Neveu 8
Why Big Data is important in Agronomic Sciences?
Production of a lot of heterogeneous data for understanding
● Open new insights
● Allow to know:
– Which theories are consistent and which ones are not!
– When data did not quite match what we expect…
→ Decision support needs (integrative and predictive approaches)
11 avril 2013
Pascal Neveu 9
Illustration: Plant Phenotyping
Environment
Plant Genotype
Phenotype
Interactions
11 avril 2013
Pascal Neveu 10
Searching for the most adapted genotypes
High throughput
Various Environments
Many Plant Genotypes
High frequency anddynamic trait observations of a lot of Phenotypes
Interactions
11 avril 2013
Pascal Neveu 11
Why high throughput phenotyping is important for agriculture?
● Adaptation to climate change
● More efficient use of natural resources (including water and soil) in our farming practices
● Sustainable management and equity
● Food securityCrop performance (yields are globally decreasing)
● …
Genotyping and Phenotyping
Plant phenotyping has become a bottleneck for progress in plant science and plant breeding
11 avril 2013
Pascal Neveu 12
Phenome High throughput plant phenotyping
French Infrastructure 9 multi-species platforms
● 2 green house platforms
● 5 field platforms
● 2 omics platforms
11 avril 2013
Pascal Neveu 13
5 Field Platforms
Various scales and data types● Cell, organ, plant, population● Images, hyperspectral, spectral, sensors, human readings...
Thousands of micro-plots
time
11 avril 2013
Pascal Neveu 14
2 Green house Platforms
-1
Time (d)
0 20 40 60 80 1000
10
20
30
40
50
60
Pla
nt b
iom
ass
g
Various scales and data types
time
11 avril 2013
Pascal Neveu 15
2 « Omics » platforms
●
Grinding weighting (-80°C)
ExtractionFractionation
PipettingIncubation Reading
Various data complex types
Composition and structure
Quantification of metabolites and enzyme activities
11 avril 2013
Pascal Neveu 16
Data management challengesin Phenome: Volume growth
40 Tbytes in 2013, 100 Tbytes in 2014, …
● Volume is a relative concept– Exponential growth makes hard
● Storage● Management● Analysis
Phenome HPC and Storage→ Grid computing (FranceGrille, EGI)– On-demand infrastructure and Elasticity– Virtualization technologies – Data-Based parallelism
(same operation on different data)
11 avril 2013
Pascal Neveu 17
Data management challengesin Phenome: Variety
● Produced by different communities (geneticians, ecophysiologists, farmers, breeders, etc)
● Data integration needs extensive connections and associations to other types of data
● (environments, individuals, populations, etc.)● Different semantics, data schemas, …
Extremely diverse data
→ Web API, Ontology sets, NoSQL and Semantic Web methods
11 avril 2013
Pascal Neveu 18
Data management in Phenome: Velocity
– Green House platforms produce tens of thousands images/day (200 days/year)
– Field platforms produce tens of thousands images/day(100 days/year)
– Omic platforms produce tens of Gbytes/day(300 days/year)
Scientific Workflow – OpenAlea /provenance module (Virtual Plant INRIA team) – Scifloware (Zenith INRIA team)
11 avril 2013
Pascal Neveu 19
Data management challengesin Phenome: Validity
Data cleaning
● Automatically diagnose and manage:– Consistency? Duplicates? Wrong?– Annotation consistency?– Outliers? – Disguised missing data?– ...
Some approaches – Unsupervised Curve clustering (Zenith INRIA team)– Curve fitting over dynamic constraints– Image Clustering
11 avril 2013
Pascal Neveu 20
Conclusion
High throughput phenotyping data:
– Hard to produce
– Hard to manage
– Hard to analyse
→ We need new approaches
Thank you for your attention