CENTRE of EXCELLENCE ON DATA WAREHOUSING WORKSHOP … · CENTRE of EXCELLENCE ON DATA WAREHOUSING WORKSHOP WARSAW, POLAND 21 & 22 NOVEMBER 2018. Introduction • In this talk we present

Big Data and Official Statistics:

Scanner Data from DB to BD

Francesco Altarocca

[email protected] - Istat

CENTRE of EXCELLENCE ON DATA WAREHOUSING

WORKSHOP WARSAW, POLAND 21 & 22 NOVEMBER 2018

Introduction

• In this talk we present the issues and challenges related to

dealing with datasets of big size such as those involved in

the Scanner Data project at Istat

• We illustrate the IT context of the project, some aspects of

the database implementation and some issues we had

• We also show how we approach the problem in this context

in a completely new way (for the official statistics). The

solutions introduced are part of a larger scope approach to

the modernisation of tools and techniques used for data

storage and processing in Istat (Big Data and Data Science

in NSIs)

2

About Scanner Data

30% of all stores available from Nielsen from all 107 provinces

Space occupation of 1 year of microdata in the database

(some data are sent more times)

1,3 Billion yearly records

2000 Stores

530,000 GTIN (elementary product)

It is one of the sources for the

calculation of consumer price index

420Gb

All available chains

70 Million montly records

20 Million elementary and 170K aggregated indices

(2 times a month – provisional and definitive release)

3

About Scanner Data and BD

Space occupation of 1 year of microdata in the Cloudera system

With Cloudera and

RDD about

• Some additional space is used and than released

• It is not possible to update records only append

• Some data are replicated

• Less meta-data are used compared to DB

• Not all ACID properties are guaranteed

80Gb

4

IT Architecture: from DB to BD

Views

Acquisition Storing Data

Processing

Data Access

and Analysis

SFTP Server

Control

dashboardOracle DB

Hadoop

BI toolMicrostrategy

Data

Flow

5

Production Data Platform

• The actual platform for the production architecture based on Big Data tools

– 7 nodes Hadoop Cluster (Cloudera) (21 nodes in the nextfuture)

– Hadoop: parallel storage and processing platform, de-facto standard for Big Data

• Some Features:

– All historical data always online for interactive analysis

– Possibility of retaining historical data indefinitely

– Costruction of a global historical data warehouse of pricesdata

– Easier to perform large-scale analysis

– The output of one phase is the input for another step (e.g. making base for next year in a mobile base approach)

6

Load Pre-process

Data Ingestion

SFTP SAS

Control Dashboard

• Data is sent by Nielsen in form of compressed text files via

SFTP

• Received data are handled by programs written in Java

- Load: performs integrity checks on received files, loads

data in the DB, logs received files and estimates discounts

- Pre-process: performs quality checks at record level,

discards dirty data

• The whole acquisition process is controlled by a web

dashboard

7

Load Pre-process

Data Access and Analysis

SFTP

Views

Reports and Visualizations

DB

SAS

Microstrategy

Data can be accessed in two ways:

• Extraction from the DB

- Materialized views were created in order to facilitate import

in SAS

• Use of a business analytics tool (MicroStrategy) for reporting,

visualization and browsing of the data

In both access modes the results of common interrogations were

pre-computed at different levels of aggregations and were

provided as views or reports

8

Bad data- GTIN Turnover and quantity- GTIN information- STORE information

Check 2

Controlli

formali post

caricamento

valid

Data processing

Table

Report

Check 2

Controlli

formali post

caricamento

Check 2

Controlli

formali post

caricamento

Check 2

Quality Checks on

loaded data in Oracle DB

No

t valid

Storing in tables: elementary data and metadata

Check 2

Quality Checks on

loaded data in Oracle DB

Good data- GTIN Turnover and quantity- GTIN information- STORE information

9

valid

Data processing

Table

Report

Storing in tables: processed data

No

t valid

Check 3

Individuazione

dei dati non

ammissibili

Weekly prices

Data processing

1Weekly prices calculation

Monthly microindices

Data processing

2Microindices calculation

Check 3

Outlier detection

Data processing

3Indices aggregation

Monthly aggregated

indices

10

Comparison: DB vs BD

• Ingestion and preliminary check (from 12-24h to less than 1h

(no need to create indexes, constraints, …)

• Elaborations phases: from hours to minutes

• Many integrated tools for the elaboration and ingestion (hi-

level Hive, Impala, Sqoop; low-level: Scala, Java, Python)

• High scalability: adding new node as dataset and elaboration

increase

• Need some trick to manage update (logical update: new

update records override oldest; creating a new one dataset if

it is possible)

• Need some knowledge of the platform and new paradigms of

data elaboration

11

Big Data environment

• Big Data environment offer more flexibility in manipulation

data and less rigidity regarding to other approach because

relax some constraints. This type of approach is ideal for a try

error paradigm (easy and many way to try something and

speed) typically used in a scientific workflow

• Big Data can also support very well production thanks to its

stability, maturity and a wide range of tools and techniques

typically used in production scenarios

12

Ingestion

Data Warehouse Architecture: evolution phase

SFTP

Reports and

Visualizations

Control DashboardSAS

Processing

of indexes

Hadoop

Extraction for

offline analysis

Enhanced data warehouse

Current data Historical data

13

Conclusions

• The scanner data project had represented a challenge for experimenting new approaches in the IT support to analysis and production and to overcome some issue

• Objective is get faster results, more efficient processes and more data available

• The concept of «Big Data» is not merely a matter of size but rather of new opportunities

• Technology can give the answers, now it’s time to make new questions

• As new data and information are available, more complex models can be managed

14

Thanks

Questions?

[email protected]

15

Documents

CENTRE of EXCELLENCE ON DATA WAREHOUSING WORKSHOP … · CENTRE of EXCELLENCE ON DATA WAREHOUSING WORKSHOP WARSAW, POLAND 21 & 22 NOVEMBER 2018. Introduction • In this talk we present