Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Big Data and Official Statistics:
Scanner Data from DB to BD
Francesco Altarocca
[email protected] - Istat
CENTRE of EXCELLENCE ON DATA WAREHOUSING
WORKSHOP WARSAW, POLAND 21 & 22 NOVEMBER 2018
Introduction
• In this talk we present the issues and challenges related to
dealing with datasets of big size such as those involved in
the Scanner Data project at Istat
• We illustrate the IT context of the project, some aspects of
the database implementation and some issues we had
• We also show how we approach the problem in this context
in a completely new way (for the official statistics). The
solutions introduced are part of a larger scope approach to
the modernisation of tools and techniques used for data
storage and processing in Istat (Big Data and Data Science
in NSIs)
2
About Scanner Data
30% of all stores available from Nielsen from all 107 provinces
Space occupation of 1 year of microdata in the database
(some data are sent more times)
1,3 Billion yearly records
2000 Stores
530,000 GTIN (elementary product)
It is one of the sources for the
calculation of consumer price index
420Gb
All available chains
70 Million montly records
20 Million elementary and 170K aggregated indices
(2 times a month – provisional and definitive release)
3
About Scanner Data and BD
Space occupation of 1 year of microdata in the Cloudera system
With Cloudera and
RDD about
• Some additional space is used and than released
• It is not possible to update records only append
• Some data are replicated
• Less meta-data are used compared to DB
• Not all ACID properties are guaranteed
80Gb
4
IT Architecture: from DB to BD
Views
Acquisition Storing Data
Processing
Data Access
and Analysis
SFTP Server
Control
dashboardOracle DB
Hadoop
BI toolMicrostrategy
Data
Flow
5
Production Data Platform
• The actual platform for the production architecture based on Big Data tools
– 7 nodes Hadoop Cluster (Cloudera) (21 nodes in the nextfuture)
– Hadoop: parallel storage and processing platform, de-facto standard for Big Data
• Some Features:
– All historical data always online for interactive analysis
– Possibility of retaining historical data indefinitely
– Costruction of a global historical data warehouse of pricesdata
– Easier to perform large-scale analysis
– The output of one phase is the input for another step (e.g. making base for next year in a mobile base approach)
6
Load Pre-process
Data Ingestion
SFTP SAS
Control Dashboard
• Data is sent by Nielsen in form of compressed text files via
SFTP
• Received data are handled by programs written in Java
- Load: performs integrity checks on received files, loads
data in the DB, logs received files and estimates discounts
- Pre-process: performs quality checks at record level,
discards dirty data
• The whole acquisition process is controlled by a web
dashboard
7
Load Pre-process
Data Access and Analysis
SFTP
Views
Reports and Visualizations
DB
SAS
Microstrategy
Data can be accessed in two ways:
• Extraction from the DB
- Materialized views were created in order to facilitate import
in SAS
• Use of a business analytics tool (MicroStrategy) for reporting,
visualization and browsing of the data
In both access modes the results of common interrogations were
pre-computed at different levels of aggregations and were
provided as views or reports
8
Bad data- GTIN Turnover and quantity- GTIN information- STORE information
Check 2
Controlli
formali post
caricamento
valid
Data processing
Table
Report
Check 2
Controlli
formali post
caricamento
Check 2
Controlli
formali post
caricamento
Check 2
Quality Checks on
loaded data in Oracle DB
No
t valid
Storing in tables: elementary data and metadata
Check 2
Quality Checks on
loaded data in Oracle DB
Good data- GTIN Turnover and quantity- GTIN information- STORE information
9
valid
Data processing
Table
Report
Storing in tables: processed data
No
t valid
Check 3
Individuazione
dei dati non
ammissibili
Weekly prices
Data processing
1Weekly prices calculation
Monthly microindices
Data processing
2Microindices calculation
Check 3
Outlier detection
Data processing
3Indices aggregation
Monthly aggregated
indices
10
Comparison: DB vs BD
• Ingestion and preliminary check (from 12-24h to less than 1h
(no need to create indexes, constraints, …)
• Elaborations phases: from hours to minutes
• Many integrated tools for the elaboration and ingestion (hi-
level Hive, Impala, Sqoop; low-level: Scala, Java, Python)
• High scalability: adding new node as dataset and elaboration
increase
• Need some trick to manage update (logical update: new
update records override oldest; creating a new one dataset if
it is possible)
• Need some knowledge of the platform and new paradigms of
data elaboration
11
Big Data environment
• Big Data environment offer more flexibility in manipulation
data and less rigidity regarding to other approach because
relax some constraints. This type of approach is ideal for a try
error paradigm (easy and many way to try something and
speed) typically used in a scientific workflow
• Big Data can also support very well production thanks to its
stability, maturity and a wide range of tools and techniques
typically used in production scenarios
12
Ingestion
Data Warehouse Architecture: evolution phase
SFTP
Reports and
Visualizations
Control DashboardSAS
Processing
of indexes
Hadoop
Extraction for
offline analysis
Enhanced data warehouse
Current data Historical data
13
Conclusions
• The scanner data project had represented a challenge for experimenting new approaches in the IT support to analysis and production and to overcome some issue
• Objective is get faster results, more efficient processes and more data available
• The concept of «Big Data» is not merely a matter of size but rather of new opportunities
• Technology can give the answers, now it’s time to make new questions
• As new data and information are available, more complex models can be managed
14