View
6
Download
0
Category
Preview:
Citation preview
Multi-technique data analytics workflow using a Logical Data Warehouse architecture:
web mining use case Antonio Laureti Palma, ISTAT, …@istat.it Summary: - A Logical Data Warehouse schema - Predictive modelling - Use case: SBS-ICT by web mining daWos Amsterdam, 11-12 September 2018
2
ESS
Vis
ion
20
20
Total Quality Management
Antonio Laureti Palma , daWos - Amsterdam, 11-12 September 2018
3
Data Warehouse 2.0
visions: B. Immon: “The data warehouse of next-generation, while still building on the founding principles of an enterprise version of truth and a “single” data repository must address the needs of data of new types, new volumes, new data-quality levels, new performance needs, new metadata, and new user requirements.” K. Krishnan: “The next-generation data warehouse architecture will be complex from a physical architecture deployment, consisting of a myriad of technologies, and will be data-driven from an integration perspective, extremely flexible, and scalable from a data architecture perspective.”
Antonio Laureti Palma , daWos - Amsterdam, 11-12 September 2018
Logical DWH
New sources increase complexity of IT components move the DWH architectures toward logical architectures
The Logical DWH is a new management architecture combining the strengths of traditional repository warehouses with alternative data management and access strategy
A Logical DWH is an evolution and augmentation of DWH practices, not a replacement
Data Virtualization enables Logical DWH
4 Antonio Laureti Palma , daWos - Amsterdam, 11-12 September 2018
S-DWH (RDBMS)
5
An
alysis/Data M
inin
g/R
epo
rts
data virtu
alization
collect
machine learning
distributed data store (NoSql/Spark/Hadoop)
WEB
scraper
Logical DWH Example: a possible data virtualization architecture:
S-DWH (RDBMS) Stat-DWH (RDBMS)
Antonio Laureti Palma , daWos - Amsterdam, 11-12 September 2018
6
LSDW - Logical Statistical Data Warehouse
Logical Statistical Data Warehouse: a virtual central statistical data store based on logical layers for managing all available data of interest, improving to: produce the necessary information, (re)use data to create new data/new outputs, perform data analytics, execute analysis, produce reports, support dashboard tools
Antonio Laureti Palma , daWos - Amsterdam, 11-12 September 2018
7
LSDW Architecture domains:
Pic
ture
fro
m:
Kri
sh K
rish
na
n-
Da
ta W
are
ho
use
in t
he
eag
e o
f B
ig D
ata
Functional domain Technology domain Data domain
Antonio Laureti Palma , daWos - Amsterdam, 11-12 September 2018
8
LSDW functional domains Functional layers: processes, actions or tasks
STAT
ISTI
CA
L D
ATA
WA
REH
OU
SE
OPERATIONAL DATA
DATA WAREHOSING
INTERPRETATION LAYER
ACCESS LAYER
INTEGRATION LAYER
SOURCES LAYER
Antonio Laureti Palma , daWos - Amsterdam, 11-12 September 2018
STAT
ISTI
CA
L D
ATA
WA
REH
OU
SE
OPERATIONAL DATA
DATA WAREHOUSE
INTERPRETATION AND ANALYSIS LAYER
ACCESS LAYER
INTEGRATION LAYER
SOURCES LAYER COLLECT
PROCESS
ANALYZE
DISSEMINATE
SURVEY
COLLECT
PROCESS
ANALYZE
DISSEMINATE
ADMIN
COLLECT
PROCESS
ANALYZE
DISSEMINATE
BIG DATA
9
LSDW - functional layers vs Data Sources:
Antonio Laureti Palma , daWos - Amsterdam, 11-12 September 2018
preprocessing learning
prediction
learning
algorithm
training
labeled dataset
dataset
labeled
dataset
final
model
test
Flow diagram example of predictive modelling
evaluations
Antonio Laureti Palma , daWos - Amsterdam, 11-12 September 2018
preprocessing learning prediction analysis
11
SOURCE INTERPRETATION INTEGRATION
ETL
surveys
admins
big data learning
ACCESS
data mining
reports
dashboard
analysis
data mining
scraper primary
labels
data mart
LSDH layers: predictive modelling
Antonio Laureti Palma , daWos - Amsterdam, 11-12 September 2018
12
Case Study: SBS-ICT by Web Mining The case study focuses on the use of survey data as a ground
truth to create a classification model enabling the prediction of variables on Enterprises ICT Survey.
Items: analysis units ICT Enterprises
ICT variables involved: web ordering, presence in social media, job advertisements
Web scraped content from a URL-list
predictor target variables: add to cart; shop online; account; order; job opportunities; career; job;…
ML supervised learning models for data classification
Antonio Laureti Palma , daWos - Amsterdam, 11-12 September 2018
Web Mining: SBS-ICT data processing
Antonio Laureti Palma , daWos - Amsterdam, 11-12 September 2018
analysis prediction learning preprocessing
14
SOURCE LAYER INTERPRETATION LAYER
ACCESS LAYER
R
SAS
INTEGRATION LAYER
NLP: text mining
learning models evaluation matrix
tokenization
lemmatization
classifications (LR, SVM, RF)
POS tagging
summarization
ML data
Analysis
Web Mining on LSDW layers
web scraping
URLs validation
URLs retrieval
text documents
Data Mart - ICT
Register
DW-Thematic
Antonio Laureti Palma , daWos - Amsterdam, 11-12 September 2018
Python
Thank you for your attention
Antonio Laureti Palma, ISTAT, ….@istat.it
“Multi-technique data analytics workflow using a Logical Data Warehouse
architecture: web mining use case”
Antonio Laureti Palma, ISTAT, lauretip@istat.it
preprocessing learning prediction analysis
Antonio Laureti Palma , Q2018 - Kraków, Poland. 26-29 June 2018
17
data warehouse operational data store
SOURCE INTERPRETATION INTEGRATION
ETL
surveys
admins
big data learning
ACCESS
data mining
reports
dashboard
analysis
preparation
scraper primary
labels
data mart
LSDH: Flow diagram of predictive modelling
distributed database
18
my question: what is the difference between Analytics and Analysis?
Analysis is ”A careful study of something to learn about its parts, what they do and how they are related to each other”
Analytics is “the method of logical analysis”
-> Therefore, we do analysis using analytics. big data analytics, method of logical analysis on Big Data.
Introduces epistemological changes in the design of new possible official statistical production processes that could force to an relevant infrastructure change
Antonio Laureti Palma , Q2018 - Kraków, Poland. 26-29 June 2018
Big Data analytics
19
Big Data processing Data processing can be defined as the collection, processing, and management of data resulting in information generation to end consumers. Traditional data processing life cycle: - first analyze the transactional data and create a set of requirements, which leads
to data discovery and data model creation, - then, a database structure is created to process the data. Big Data data processing life cycle: - first, the data is collected and loaded to a target platform where a data structure
for the content is created, a metadata layer is applied to the data, - the data is then transformed and analyzed to provide insights into the data and
any associated context.
Antonio Laureti Palma, CoE S-DWH, CM Rome April 12th 2017
20
Big Data processing life cycle
The first step after acquisition of big data is to perform “data discovery”; this can be automated using algorithms:
- Text mining
- Data mining
- Pattern processing
- Statistical models
- Mathematical models
Antonio Laureti Palma, CoE S-DWH, CM Rome April 12th 2017
Data Analytic Data
Analysis
21
Analytics layer To create the foundational structure for data analysis, you need to have subject-matter experts who can understand the different layers of data being integrated and what granularity levels of integration can be completed to create the holistic picture. Big Data analytics can be defined as the combination of traditional
analytics and data mining techniques along with large volumes of data
Data discovery for analytics can be defined in these distinct steps: Data tagging is the process of creating an identifying link on the
data for metadata integration. Data classification is the process of creating subsets of value pairs
for data processing and integration. Data modeling is the process of creating a model for data
visualization or analytics.
Antonio Laureti Palma, CoE S-DWH, CM Rome April 12th 2017
22
Big-Data and S-DWH integration Inbound data processing
Antonio Laureti Palma, CoE S-DWH, CM Rome April 12th 2017
23
Big Data integration strategies 1°) S-DWH data bus based:
Antonio Laureti Palma, CoE S-DWH, CM Rome April 12th 2017
24
Big Data integration strategies 1°) S-DWH data bus based: a data bus is developed using metadata and semantic technologies, which will create a data integration environment for data exploration and processing. A simple layer or an overwhelmingly complex layer of processing. Pros: Scalable design for RDBMS and Big Data processing. Reduced overload on processing. Heterogeneous physical architecture deployment. Cons: Data bus architecture can become increasingly complex. Possible poor metadata architecture due to multiple layers of data
processing. Data integration can become a performance bottleneck.
Antonio Laureti Palma, CoE S-DWH, CM Rome April 12th 2017
25
Big Data integration strategies 2°) S-DWH data connector
Antonio Laureti Palma, CoE S-DWH, CM Rome April 12th 2017
26
Big Data integration strategies 2°) S-DWH data connector, this connecter is a bridge to exchange data between the two platforms. Pros: Scalable design for RDBMS and Big Data processing. Modular data integration architecture. Heterogeneous physical architecture deployment, providing best-in-class
integration at the data processing layer. Metadata and MDM solutions can be held with relative ease across the
solution. Cons: Performance of the Big Data connector is the biggest area of weakness. Data integration and query scalability can become complex.
Antonio Laureti Palma, CoE S-DWH, CM Rome April 12th 2017
27
Big Data integration strategies 3°) S-DWH based on big data appliances
Antonio Laureti Palma, CoE S-DWH, CM Rome April 12th 2017
28
Big Data integration strategies 3°) S-DWH based on big data appliances; these appliances are configured to handle the rigors of workloads and complexities of Big Data and the current RDBMS architecture Pros: Scalable design and modular data integration architecture. Heterogeneous physical architecture deployment, providing best-in-class integration
at the data processing layer. Custom configured to suit the processing rigors as required for each organization. Cons: Customized configuration can be maintenance-heavy. Data integration and query scalability can become complex as the configuration
changes over a period of time.
Antonio Laureti Palma, CoE S-DWH, CM Rome April 12th 2017
29
Big Data integration strategies 4°) S-DWH based data virtualization
Antonio Laureti Palma, CoE S-DWH, CM Rome April 12th 2017
30
Big Data integration strategies 4°) S-DWH based data virtualization, allows to solve the data integration challenge while leveraging all the investments on the current infrastructure trough a semantic data integration architecture. Pros: Extremely scalable and flexible architecture. Workload optimized. Easy to maintain. Lower initial cost of deployment. Cons: Lack of governance can create too many silos and degrade performance. Complex query processing can become degraded over a period of time. Performance at the integration layer may need periodic maintenance.
Antonio Laureti Palma, CoE S-DWH, CM Rome April 12th 2017
Antonio Laureti Palma , Q2018 - Kraków, Poland. 26-29 June 2018
31
32
Big Data definitions… Big Data can be defined as volumes of data available in varying
degrees of complexity, generated at different velocities and varying degrees of ambiguity, that cannot be processed using traditional technologies, processing methods, algorithms, or any commercial off-the-shelf solutions.
In statistics we may speak about “four V” (by Diego Kuonen):
volume
variety
velocity
veracity
IT items Stat items
Antonio Laureti Palma, CoE S-DWH, CM Rome April 12th 2017
33
Big Data definitions… Volume: amount of data with respect to the number of observations, size of the data, but also with respect to the number of variables, dimensionality of the data; Variety: data in many forms, i.e. different types of data (e.g. structured, semi-structured and unstructured; data sources (e.g. internal, external, open, public); data resolutions and data granularities; Velocity: data in motion, i.e. the speed by which data are generated and need to be handled (e.g. streaming data from machines, sensors and social data); Veracity: data in doubt, i.e. the varying levels of noise and processing errors, including the reliability, capability and validity of the data.
Antonio Laureti Palma, CoE S-DWH, CM Rome April 12th 2017
34
New class of challenges and issues on Big Data 1/2.
i. Data does not have a finite architecture.
ii. Data can have multiple formats, semi-structured or unstructured.
iii. Data is self-contained and needs several external business to interpret and process the data.
iv. Data has no specificity with volume or complexity.
v. Data is not relational.
vi. Data has a minimal or zero concept of referential integrity.
vii. Data depends on metadata for creating context.
Antonio Laureti Palma, CoE S-DWH, CM Rome April 12th 2017
35
New class of challenges and issues on Big Data. 2/2
viii. Data needs more analytical processing.
ix. Data needs multiple cycles of processing, but each cycle needs to be processed in one pass due to the size of the data.
x. Data needs business rules for processing like we handle structured data today, but these rules need to be created in a rules engine architecture rather than the database or the ETL tool.
xi. Data needs more governance than data in the database.
xii. Data has no defined quality.
Antonio Laureti Palma, CoE S-DWH, CM Rome April 12th 2017
36
Big Data workloads The major areas where workload definitions are important include:
Data is file based for acquisition and storage.
Data processing will happen in three steps: • Discovery, in this step the data is analyzed and categorized. The data will
need to be processed and computed where it is and not moved across the network.
• Analytics, in this step the data is converted to metrics, structured format and extracting for processing to the data warehouse or analytical engines.
• Analysis, in this step the data is associated with master data and metadata. This will require minimal transformation and movement of data across the network.
Maintain file system–driven consistency, due to no database involved in the processing of Big Data.
Big Data query workloads are more program execution of MapReduce code, which is completely opposite of executing SQL and optimizing for SQL performance.
Antonio Laureti Palma, CoE S-DWH, CM Rome April 12th 2017
Antonio Laureti Palma , Q2018 - Kraków, Poland. 26-29 June 2018
37
New DWH, key IT challenges The users of a data warehouse and the downstream business intelligence and analytics applications measure the efficiency and effectiveness as units of speed, both on the inbound and outbound sides of the data warehouse.
Data loading: data quality, slowly changing dimensional data, master data management (MDM), metadata management, transformation and processing.
Availability is a benchmark, both due the loading process and the infrastructure as a whole.
Data volumes, due to: analytics, compliance requirements, legal requirements, data security, business users, social media, nonspecific requirements.
Storage performance, the issue is both at the data architecture and storage architecture.
Query performance, for ad-hoc queries and analytical queries, due to thei nondeterministic nature.
Data transport, aspect of performance that can improve efficient processing of data transportation from one layer to another and its subsequent availability.
38 Antonio Laureti Palma, CoE S-DWH, CM Rome April 12th 2017
39
Component of the new DWH Analytics layer Technology layer Data Layer
Antonio Laureti Palma, CoE S-DWH, CM Rome April 12th 2017
40
Data layer (1/2) The data layer in the new platform includes:
i. Legacy data, that include structured and semi-structured formats of data, stored online or offline (census, socioeconomic, urban planning, etc..)
ii. Transactional (OLTP) data, in the new platform all transactional data can be loaded and all these segments of data can be used in creating a powerful back-end data platform that analyzes data and organizes it at every processing step.
iii. Unstructured data, the next-generation platform will provide interfaces to investigate into the content by navigating it based on user-defined rules for processing. The output of content processing will be used in defining and designing analytics for exploration mining of unstructured data.
Antonio Laureti Palma, CoE S-DWH, CM Rome April 12th 2017
41
Data layer (2/2) The data layer in the new platform includes:
iv. Video, there are three components in a video, the content, the audio, and the associated metadata. The new data platforms, however, provide the infrastructure necessary to process this data (i.e. automobile traffic analysis).
v. Audio, extracts data can be processed and stored as contextual data associated with the metadata in the next-generation data warehouse; i.e. data from call centers.
vi. Images, static images carry a lot of data that can be very useful in government agencies (geospatial integration), and other areas.
vii. Numerical/patterns/graphs, sensor data, stock market data, scientific data, cellular tower data, GPS data and other such data occur and repeat their manifests in periodic time intervals. Processing such data and integrating the results with the data warehouse will provide analytical opportunities to perform correlation or cluster analysis.
Antonio Laureti Palma, CoE S-DWH, CM Rome April 12th 2017
42
Technology layer
i. RDBMS
ii. Hadoop
iii. NoSQL
iv. MDM solutions (Master Data Management)
v. Metadata solutions
vi. Semantic technologies
vii. Rules engines
viii. Data mining algorithms
ix. Text mining algorithms
x. Data discovery technologies
xi. Data visualization technologies
xii. Reporting and analytical technologies
Antonio Laureti Palma, CoE S-DWH, CM Rome April 12th 2017
Antonio Laureti Palma , Q2018 - Kraków, Poland. 26-29 June 2018
43
Case Studies
population statistics from mobile phone traffic: “Persons and Places” project, OD matrix by mobile phone data
business statistics produced by web mining: survey ICT, variable estimations by using internet data
DWH IT environment (distributed computing platform):
Oracle Exadata Database Machine Software language: Py-Spark MLLib-Spark, Scikit-learn HUE (Hadoop User Experience):
Editors for Hive, Impala, Spark, SQL Browser and Scheduler of jobs and workflows for HDFS, SQL Tables,..
Hadoop/Spark based infrastructure based on 8 nodes
Invest in new IT tools and methodology
44
case study 1 population statistics from mobile phone traffic
The case study focuses on an ISTAT project “Persons and Places” which compares two approaches to mobility profile estimation: based on administrative archives based on mobile phone data
Items: analysis units: resident, embedded and daily city users OD matrix of daily mobility at municipality level calling data from mobile phone CDRs (Call Detail Record) classification based on unsupervised learning process comparison of estimates
Invest in new IT tools and methodology
Antonio Laureti Palma , Q2018 - Kraków, Poland. 26-29 June 2018
Antonio Laureti Palma , Q2018 - Kraków, Poland. 26-29 June 2018 45
case study 1: Logical S-DWH processing lifecycle
CDR
integration data discovery stage
source collect stage
Individual Call Profiles
HDFS prototype extractions K-Means algorithm
prototype labelling
label propagation 1-Nearest-Neighbor
RDD
operational layers data warehouse layers
interpretation analysis stage
access
archetype definitions
ICP DWH
MPT-OD matrix
R
SAS
P&P-OD matrix
population DWH
distributed database
distributed computing platform
Plotly PyLib
population register
Invest in new IT tools and methodology
Preprocessing Learning Evaluation Prediction
learning
algorithm training
labeled dataset
dataset
labeled
dataset
final
model
test labels
Flow diagram of predictive modelling
Logical DWH
Data Virtualization enables Logical DWH
focusing more on the logic of information than data structures means adding semantic data abstraction based on:
virtual (any data) management
high quality level of metadata
active system self-monitoring
distributed processes (parallel-processing)
service level tracking
47 Antonio Laureti Palma , Q2018 - Kraków, Poland. 26-29 June 2018
48
Big Data processing Data processing can be defined as the collection, processing, and management of data resulting in information generation to end consumers. Traditional data processing life cycle: - first analyze the transactional data and create a set of requirements, which leads
to data discovery and data model creation, - then, a database structure is created to process the data. Big Data data processing life cycle: - first, the data is collected and loaded to a target platform where a data structure
for the content is created, a metadata layer is applied to the data, - the data is then transformed and analyzed to provide insights into the data and
any associated context.
Antonio Laureti Palma, CoE S-DWH, CM Rome April 12th 2017
49
Big Data integration strategies 2°) S-DWH data connector
Antonio Laureti Palma, CoE S-DWH, CM Rome April 12th 2017
50
Big Data integration strategies 3°) S-DWH based on big data appliances
Antonio Laureti Palma, CoE S-DWH, CM Rome April 12th 2017
Antonio Laureti Palma , Q2018 - Kraków, Poland. 26-29 June 2018
51
Logical S-DWH layered architecture: ML Flow diagram
preprocessing learning prediction analysis
Antonio Laureti Palma , Q2018 - Kraków, Poland. 26-29 June 2018
52
data warehouse operational data store
SOURCE INTERPRETATION INTEGRATION
ETL
surveys
admins
big data learning
ACCESS
data mining
reports
dashboard
analysis
data mining
scraper primary
labels
data mart
LSDH layers: predictive modelling
distributed database
Preprocessing Learning Evaluation Prediction
learning
algorithm training
labeled dataset
dataset
labeled
dataset
final
model
test labels
Flow diagram: predictive modelling
Recommended