Multi-technique data analytics workflow using a Logical ... · 3 Data Warehouse 2.0 visions: B....

Multi-technique data analytics workflow using a Logical Data Warehouse architecture:

web mining use case Antonio Laureti Palma, ISTAT, …@istat.it Summary: - A Logical Data Warehouse schema - Predictive modelling - Use case: SBS-ICT by web mining daWos Amsterdam, 11-12 September 2018

Total Quality Management

Antonio Laureti Palma , daWos - Amsterdam, 11-12 September 2018

Data Warehouse 2.0

visions: B. Immon: “The data warehouse of next-generation, while still building on the founding principles of an enterprise version of truth and a “single” data repository must address the needs of data of new types, new volumes, new data-quality levels, new performance needs, new metadata, and new user requirements.” K. Krishnan: “The next-generation data warehouse architecture will be complex from a physical architecture deployment, consisting of a myriad of technologies, and will be data-driven from an integration perspective, extremely flexible, and scalable from a data architecture perspective.”

Logical DWH

New sources increase complexity of IT components move the DWH architectures toward logical architectures

The Logical DWH is a new management architecture combining the strengths of traditional repository warehouses with alternative data management and access strategy

A Logical DWH is an evolution and augmentation of DWH practices, not a replacement

Data Virtualization enables Logical DWH

4 Antonio Laureti Palma , daWos - Amsterdam, 11-12 September 2018

S-DWH (RDBMS)

alysis/Data M

data virtu

alization

collect

machine learning

distributed data store (NoSql/Spark/Hadoop)

scraper

Logical DWH Example: a possible data virtualization architecture:

S-DWH (RDBMS) Stat-DWH (RDBMS)

LSDW - Logical Statistical Data Warehouse

Logical Statistical Data Warehouse: a virtual central statistical data store based on logical layers for managing all available data of interest, improving to: produce the necessary information, (re)use data to create new data/new outputs, perform data analytics, execute analysis, produce reports, support dashboard tools

LSDW Architecture domains:

Functional domain Technology domain Data domain

LSDW functional domains Functional layers: processes, actions or tasks

OPERATIONAL DATA

DATA WAREHOSING

INTERPRETATION LAYER

ACCESS LAYER

INTEGRATION LAYER

SOURCES LAYER

OPERATIONAL DATA

DATA WAREHOUSE

INTERPRETATION AND ANALYSIS LAYER

ACCESS LAYER

INTEGRATION LAYER

SOURCES LAYER COLLECT

PROCESS

ANALYZE

DISSEMINATE

SURVEY

COLLECT

PROCESS

ANALYZE

DISSEMINATE

COLLECT

PROCESS

ANALYZE

DISSEMINATE

BIG DATA

LSDW - functional layers vs Data Sources:

preprocessing learning

prediction

learning

algorithm

training

labeled dataset

dataset

labeled

dataset

Flow diagram example of predictive modelling

evaluations

preprocessing learning prediction analysis

SOURCE INTERPRETATION INTEGRATION

surveys

admins

big data learning

ACCESS

data mining

reports

dashboard

analysis

data mining

scraper primary

labels

data mart

LSDH layers: predictive modelling

Case Study: SBS-ICT by Web Mining The case study focuses on the use of survey data as a ground

truth to create a classification model enabling the prediction of variables on Enterprises ICT Survey.

Items: analysis units ICT Enterprises

ICT variables involved: web ordering, presence in social media, job advertisements

Web scraped content from a URL-list

predictor target variables: add to cart; shop online; account; order; job opportunities; career; job;…

ML supervised learning models for data classification

Web Mining: SBS-ICT data processing

analysis prediction learning preprocessing

SOURCE LAYER INTERPRETATION LAYER

ACCESS LAYER

INTEGRATION LAYER

NLP: text mining

learning models evaluation matrix

tokenization

lemmatization

classifications (LR, SVM, RF)

POS tagging

summarization

ML data

Analysis

Web Mining on LSDW layers

web scraping

URLs validation

URLs retrieval

text documents

Data Mart - ICT

Register

DW-Thematic

Python

Thank you for your attention

Antonio Laureti Palma, ISTAT, ….@istat.it

“Multi-technique data analytics workflow using a Logical Data Warehouse

architecture: web mining use case”

Antonio Laureti Palma, ISTAT, lauretip@istat.it

Antonio Laureti Palma , Q2018 - Kraków, Poland. 26-29 June 2018

data warehouse operational data store

surveys

admins

big data learning

ACCESS

data mining

reports

dashboard

analysis

preparation

scraper primary

labels

data mart

LSDH: Flow diagram of predictive modelling

distributed database

my question: what is the difference between Analytics and Analysis?

Analysis is ”A careful study of something to learn about its parts, what they do and how they are related to each other”

Analytics is “the method of logical analysis”

-> Therefore, we do analysis using analytics. big data analytics, method of logical analysis on Big Data.

Introduces epistemological changes in the design of new possible official statistical production processes that could force to an relevant infrastructure change

Big Data analytics

Big Data processing Data processing can be defined as the collection, processing, and management of data resulting in information generation to end consumers. Traditional data processing life cycle: - first analyze the transactional data and create a set of requirements, which leads

to data discovery and data model creation, - then, a database structure is created to process the data. Big Data data processing life cycle: - first, the data is collected and loaded to a target platform where a data structure

for the content is created, a metadata layer is applied to the data, - the data is then transformed and analyzed to provide insights into the data and

any associated context.

Antonio Laureti Palma, CoE S-DWH, CM Rome April 12th 2017

Big Data processing life cycle

The first step after acquisition of big data is to perform “data discovery”; this can be automated using algorithms:

- Text mining

- Data mining

- Pattern processing

- Statistical models

- Mathematical models

Data Analytic Data

Analysis

Analytics layer To create the foundational structure for data analysis, you need to have subject-matter experts who can understand the different layers of data being integrated and what granularity levels of integration can be completed to create the holistic picture. Big Data analytics can be defined as the combination of traditional

analytics and data mining techniques along with large volumes of data

Data discovery for analytics can be defined in these distinct steps: Data tagging is the process of creating an identifying link on the

data for metadata integration. Data classification is the process of creating subsets of value pairs

for data processing and integration. Data modeling is the process of creating a model for data

visualization or analytics.

Big-Data and S-DWH integration Inbound data processing

Big Data integration strategies 1°) S-DWH data bus based:

Big Data integration strategies 1°) S-DWH data bus based: a data bus is developed using metadata and semantic technologies, which will create a data integration environment for data exploration and processing. A simple layer or an overwhelmingly complex layer of processing. Pros: Scalable design for RDBMS and Big Data processing. Reduced overload on processing. Heterogeneous physical architecture deployment. Cons: Data bus architecture can become increasingly complex. Possible poor metadata architecture due to multiple layers of data

processing. Data integration can become a performance bottleneck.

Big Data integration strategies 2°) S-DWH data connector

Big Data integration strategies 2°) S-DWH data connector, this connecter is a bridge to exchange data between the two platforms. Pros: Scalable design for RDBMS and Big Data processing. Modular data integration architecture. Heterogeneous physical architecture deployment, providing best-in-class

integration at the data processing layer. Metadata and MDM solutions can be held with relative ease across the

solution. Cons: Performance of the Big Data connector is the biggest area of weakness. Data integration and query scalability can become complex.

Big Data integration strategies 3°) S-DWH based on big data appliances

Big Data integration strategies 3°) S-DWH based on big data appliances; these appliances are configured to handle the rigors of workloads and complexities of Big Data and the current RDBMS architecture Pros: Scalable design and modular data integration architecture. Heterogeneous physical architecture deployment, providing best-in-class integration

at the data processing layer. Custom configured to suit the processing rigors as required for each organization. Cons: Customized configuration can be maintenance-heavy. Data integration and query scalability can become complex as the configuration

changes over a period of time.

Big Data integration strategies 4°) S-DWH based data virtualization

Big Data integration strategies 4°) S-DWH based data virtualization, allows to solve the data integration challenge while leveraging all the investments on the current infrastructure trough a semantic data integration architecture. Pros: Extremely scalable and flexible architecture. Workload optimized. Easy to maintain. Lower initial cost of deployment. Cons: Lack of governance can create too many silos and degrade performance. Complex query processing can become degraded over a period of time. Performance at the integration layer may need periodic maintenance.

Big Data definitions… Big Data can be defined as volumes of data available in varying

degrees of complexity, generated at different velocities and varying degrees of ambiguity, that cannot be processed using traditional technologies, processing methods, algorithms, or any commercial off-the-shelf solutions.

In statistics we may speak about “four V” (by Diego Kuonen):

volume

variety

velocity

veracity

IT items Stat items

Big Data definitions… Volume: amount of data with respect to the number of observations, size of the data, but also with respect to the number of variables, dimensionality of the data; Variety: data in many forms, i.e. different types of data (e.g. structured, semi-structured and unstructured; data sources (e.g. internal, external, open, public); data resolutions and data granularities; Velocity: data in motion, i.e. the speed by which data are generated and need to be handled (e.g. streaming data from machines, sensors and social data); Veracity: data in doubt, i.e. the varying levels of noise and processing errors, including the reliability, capability and validity of the data.

New class of challenges and issues on Big Data 1/2.

i. Data does not have a finite architecture.

ii. Data can have multiple formats, semi-structured or unstructured.

iii. Data is self-contained and needs several external business to interpret and process the data.

iv. Data has no specificity with volume or complexity.

v. Data is not relational.

vi. Data has a minimal or zero concept of referential integrity.

vii. Data depends on metadata for creating context.

New class of challenges and issues on Big Data. 2/2

viii. Data needs more analytical processing.

ix. Data needs multiple cycles of processing, but each cycle needs to be processed in one pass due to the size of the data.

x. Data needs business rules for processing like we handle structured data today, but these rules need to be created in a rules engine architecture rather than the database or the ETL tool.

xi. Data needs more governance than data in the database.

xii. Data has no defined quality.

Big Data workloads The major areas where workload definitions are important include:

Data is file based for acquisition and storage.

Data processing will happen in three steps: • Discovery, in this step the data is analyzed and categorized. The data will

need to be processed and computed where it is and not moved across the network.

• Analytics, in this step the data is converted to metrics, structured format and extracting for processing to the data warehouse or analytical engines.

• Analysis, in this step the data is associated with master data and metadata. This will require minimal transformation and movement of data across the network.

Maintain file system–driven consistency, due to no database involved in the processing of Big Data.

Big Data query workloads are more program execution of MapReduce code, which is completely opposite of executing SQL and optimizing for SQL performance.

New DWH, key IT challenges The users of a data warehouse and the downstream business intelligence and analytics applications measure the efficiency and effectiveness as units of speed, both on the inbound and outbound sides of the data warehouse.

Data loading: data quality, slowly changing dimensional data, master data management (MDM), metadata management, transformation and processing.

Availability is a benchmark, both due the loading process and the infrastructure as a whole.

Data volumes, due to: analytics, compliance requirements, legal requirements, data security, business users, social media, nonspecific requirements.

Storage performance, the issue is both at the data architecture and storage architecture.

Query performance, for ad-hoc queries and analytical queries, due to thei nondeterministic nature.

Data transport, aspect of performance that can improve efficient processing of data transportation from one layer to another and its subsequent availability.

38 Antonio Laureti Palma, CoE S-DWH, CM Rome April 12th 2017

Component of the new DWH Analytics layer Technology layer Data Layer

Data layer (1/2) The data layer in the new platform includes:

i. Legacy data, that include structured and semi-structured formats of data, stored online or offline (census, socioeconomic, urban planning, etc..)

ii. Transactional (OLTP) data, in the new platform all transactional data can be loaded and all these segments of data can be used in creating a powerful back-end data platform that analyzes data and organizes it at every processing step.

iii. Unstructured data, the next-generation platform will provide interfaces to investigate into the content by navigating it based on user-defined rules for processing. The output of content processing will be used in defining and designing analytics for exploration mining of unstructured data.

Data layer (2/2) The data layer in the new platform includes:

iv. Video, there are three components in a video, the content, the audio, and the associated metadata. The new data platforms, however, provide the infrastructure necessary to process this data (i.e. automobile traffic analysis).

v. Audio, extracts data can be processed and stored as contextual data associated with the metadata in the next-generation data warehouse; i.e. data from call centers.

vi. Images, static images carry a lot of data that can be very useful in government agencies (geospatial integration), and other areas.

vii. Numerical/patterns/graphs, sensor data, stock market data, scientific data, cellular tower data, GPS data and other such data occur and repeat their manifests in periodic time intervals. Processing such data and integrating the results with the data warehouse will provide analytical opportunities to perform correlation or cluster analysis.

Technology layer

i. RDBMS

ii. Hadoop

iii. NoSQL

iv. MDM solutions (Master Data Management)

v. Metadata solutions

vi. Semantic technologies

vii. Rules engines

viii. Data mining algorithms

ix. Text mining algorithms

x. Data discovery technologies

xi. Data visualization technologies

xii. Reporting and analytical technologies

Case Studies

population statistics from mobile phone traffic: “Persons and Places” project, OD matrix by mobile phone data

business statistics produced by web mining: survey ICT, variable estimations by using internet data

DWH IT environment (distributed computing platform):

Oracle Exadata Database Machine Software language: Py-Spark MLLib-Spark, Scikit-learn HUE (Hadoop User Experience):

Editors for Hive, Impala, Spark, SQL Browser and Scheduler of jobs and workflows for HDFS, SQL Tables,..

Hadoop/Spark based infrastructure based on 8 nodes

Invest in new IT tools and methodology

case study 1 population statistics from mobile phone traffic

The case study focuses on an ISTAT project “Persons and Places” which compares two approaches to mobility profile estimation: based on administrative archives based on mobile phone data

Items: analysis units: resident, embedded and daily city users OD matrix of daily mobility at municipality level calling data from mobile phone CDRs (Call Detail Record) classification based on unsupervised learning process comparison of estimates

Antonio Laureti Palma , Q2018 - Kraków, Poland. 26-29 June 2018 45

case study 1: Logical S-DWH processing lifecycle

integration data discovery stage

source collect stage

Individual Call Profiles

HDFS prototype extractions K-Means algorithm

prototype labelling

label propagation 1-Nearest-Neighbor

operational layers data warehouse layers

interpretation analysis stage

access

archetype definitions

ICP DWH

MPT-OD matrix

P&P-OD matrix

population DWH

distributed computing platform

Plotly PyLib

population register

Preprocessing Learning Evaluation Prediction

learning

algorithm training

labeled dataset

dataset

labeled

dataset

test labels

Flow diagram of predictive modelling

Logical DWH

Data Virtualization enables Logical DWH

focusing more on the logic of information than data structures means adding semantic data abstraction based on:

virtual (any data) management

high quality level of metadata

active system self-monitoring

distributed processes (parallel-processing)

service level tracking

47 Antonio Laureti Palma , Q2018 - Kraków, Poland. 26-29 June 2018

Big Data processing Data processing can be defined as the collection, processing, and management of data resulting in information generation to end consumers. Traditional data processing life cycle: - first analyze the transactional data and create a set of requirements, which leads

to data discovery and data model creation, - then, a database structure is created to process the data. Big Data data processing life cycle: - first, the data is collected and loaded to a target platform where a data structure

for the content is created, a metadata layer is applied to the data, - the data is then transformed and analyzed to provide insights into the data and

any associated context.

Big Data integration strategies 2°) S-DWH data connector

Big Data integration strategies 3°) S-DWH based on big data appliances

Logical S-DWH layered architecture: ML Flow diagram

data warehouse operational data store

surveys

admins

big data learning

ACCESS

data mining

reports

dashboard

analysis

data mining

scraper primary

labels

data mart

LSDH layers: predictive modelling

Preprocessing Learning Evaluation Prediction

learning

algorithm training

labeled dataset

dataset

labeled

dataset

test labels

Flow diagram: predictive modelling

Multi-technique data analytics workflow using a Logical ... · 3 Data Warehouse 2.0 visions: B....

Documents

Oracle Data Warehouse Pack - sematec · Oracle Data Warehouse Pack (Oracle Data warehouse Fundamentals + Oracle Data Integrator Student Guide 1,2)Oracle Data warehouse Fundamentals:هرود

Fundamentos de Data Warehouse. Contenidos Introducción Introducción Arquitectura del data warehouse Arquitectura del data warehouse Estructura de

Data Warehouse Overview - WordPress.com · Data Warehouse Overview Srini Rengarajan. Please mute Your cell! Agenda • Data Warehouse Architecture • Approaches to build a Data Warehouse

The Data Warehouse and Technology - Building the Data Warehouse

User Data Warehouse Warehouse DBMS A DBMS B DBMS C Database Data warehouse example

Syamsul Data Warehouse bagian I - ilmukomputer.orgilmukomputer.org/wp-content/uploads/2009/06/syamsul-data-warehouse...Perbedaan Istilah Data Warehouse dan Data Mining Data warehouse

Lecture 08 - External Data and the Data Warehouse - Building the Data Warehouse

Oracle Data Warehouse Mit Big Data neue Horizonte für das Data Warehouse ermöglichen Alfred Schlaucher, Detlef Schroeder DATA WAREHOUSE

Granularity in the Data Warehouse - Building the Data Warehouse

Data Warehouse

Database and Data warehouse 2. DATA WAREHOUSE AND OLAPcontents.kocw.net/KOCW/document/2014/koreasejong/... · 2016-09-09 · Data Warehouse Data mart A subset of data warehouse relevant

OCFS Data Warehouse Reporting: PowerPlayaccessing the Data Warehouse. What is the OCFS data warehouse? The OCFS data warehouse is a collection of data retrieved from the Connections

Data warehouse design, data warehousing concepts, agile data warehouse

Data Warehouse Services Data Warehouse Tip Sheet

DATA WAREHOUSE Características de un Data Warehouse

Data Warehouse. DATA WAREHOUSE 20 de Octubre 2006

DATA WAREHOUSE DESIGN - unibo.itsrizzi/PDF/DWtutorial.pdf · DATA WAREHOUSE DESIGN ... data Access Data mining Data Warehouse What-If ... M. Ross, W. Thornthwaite. The data Warehouse

Data Warehouse, Data Mart, OLAP, dan Data Miningdocshare02.docshare.tips/files/31332/313323824.pdf · Arsitektur Dasar Data Warehouse ... administrasi dan manajemen data warehouse:

Oracle Data Warehouse Pack - sematec · Oracle Data Warehouse Pack (Data warehouse Fundamentals Oracle + Oracle Data Integrator Student Guide 1,2) Oracle Data warehouse Fundamentals:هرود

The Data Warehouse Environment - Building the Data WareHouse