Upload
nicolas-sarramagna
View
67
Download
0
Embed Size (px)
Citation preview
( Big ) Data Management
Storage
Global Concepts in 5 slides 2016
Nicolas SARRAMAGNA
https://fr.linkedin.com/pub/nicolas-sarramagna/19/941/587
COMPAGNIE PLASTIC OMNIUM
CONFIDENTIAL
Storage in Data Management 3
DATA MANAGEMENT Multiples modules
BIG DATA Velocity, Volume, Variety, Veracity, Value
Collect
Storage
Data Mining /
Machine Learning
Data Viz
Governance
Security
Master Data
Data quality
COMPAGNIE PLASTIC OMNIUM
CONFIDENTIAL
Storage – What / Why 4
STORAGE = ABILITIES TO Store all type of data
structured data (like RDBMS)
semi-structured (xml, json formats, mails, html pages, logs, sensors data)
unstructured data (text, files, videos, images)
Volume : storage of Tera, Peta octets
Enrich and categorize data with metadata
STORAGE = ALLOW TO Cross data exploration, to do analysis and data mining -> new insights, break silos
Deliver Business data as self-service
Relieve RDBMS and DataWarehouse of cold data and binary data
Support RDBMS and DataWarehouse as a staging area for unstructured content
STORAGE = USAGE OF A DATA LAKE Complete the architecture of the data, not replace it
Large-scale storage repository
Volume - Variety
COMPAGNIE PLASTIC OMNIUM
CONFIDENTIAL
Storage – Data lake location
5
Existing tools & solutions
Complete the architecture
Traditional
Operational
Data Sources
(structured data)
Data
Warehouse
Dedicated for specific needs :
analysis, performance, security
DataMart
DataMart
DataMart
DataMart
Data Lake
- unstructured
- semi-structured
- structured
- metadata
DataMart
DataMart
DataMart Feed Archive
query
Feed,
Bind,
Archive
query
query
Feed Feed
Feed
(push / pull)
Data sources not yet
used for BI
Schema on write
Schema on read
RDBMS
RDBMS / NoSql
New business
applications
dashboards
dashboards
COMPAGNIE PLASTIC OMNIUM
CONFIDENTIAL
Storage – How – Technical perspective
6
USAGE OF HADOOP Standard product to put in place a data lake
HADOOP Distributed File System (cluster) + parallel processing engine + additional tools (cf image below)
Not use collection of servers but collection of co-located cpu, ram and local disks : commodity hardware, low cost
Horizontal elasticity : master node / data nodes architecture
Shared nothing : when a node breaks down, no data is lost. Each data node is independent.
Design for failure : when a node breaks down, the cluster continues to work
COMPAGNIE PLASTIC OMNIUM
CONFIDENTIAL
Storage – How – Functional perspective
7
MANAGE THE DATA LAKE NOT TO BE A SWAMP Start small and smart, do not bring everything
Classification of data (by metadata) is a mandatory
Querying data available for all layers
Integrated Data
ILM Information Lifecycle Management
Metadata
Governance & Security
query
query
Qualified Data Collaborative Data
Data Lake
Raw data
Operational metadata Contextual medata
Data quality
Operational Reports
Aggregated, summarized,
classification data
BI Self Service
query
COMPAGNIE PLASTIC OMNIUM
CONFIDENTIAL
Storage - References 8
REFERENCES
https://www.linkedin.com/pulse/how-create-data-lake-vivek-kumar-singh?trk=pulse-det-nav_art
http://fr.slideshare.net/mrm0/building-the-enterprise-data-lake-a-look-at-architecture
http://www.kdnuggets.com/2015/09/data-lake-vs-data-warehouse-key-differences.html
http://fr.slideshare.net/CasertaConcepts/hadoop-and-your-data-warehouse
https://www.safaribooksonline.com/library/view/hdinsight-essentials-/9781784399429/ch05s06.html