Upload
natalino-busa
View
1.597
Download
0
Embed Size (px)
DESCRIPTION
An overview about several technologies which contribute to the landscape of Big Data. An intro about the technology challenges of Big Data, follow by key open-source components which help out in dealing with various big data aspects such as OLAP, Real-Time Online Analytics, Machine Learning on Map-Reduce. I conclude with an enumeration of the key areas where those technologies are most likely unleashing new opportunity for various businesses.
Citation preview
Big dataThe technology landscape and its applications.
Natalino Busa - 12 Feb. 2013
Outline
● Big Data: Who are thou?● Big Data: The technology landscape
● Hadoop: Overview● Analytics & Machine Learning● Opportunities
Natalino Busa - 12 Feb. 2013
Hype cycle on new IT technologies
Gartner 2012
Natalino Busa - 12 Feb. 2013
What is big data?
Velocity Diversity Volume
Hardware Software Services
BIG DATA
DATA (structured and un-structured, Logs, ETL, social)
Marketing (e.g. Unica)Analytics (Tableau)Modeling (SAS)
RDBMSOLAPMessaging
Infrastructure(Private) CloudNetworking
Natalino Busa - 12 Feb. 2013
Big Data Heat map
Natalino Busa - 12 Feb. 2013
How big is big?
ARI = # Rows × # Columns Time (secs)
Where # Rows = Number of records being analyzed
# Columns = Number of variables captured in each record
Time (secs) = The timeframe within which to complete the analysis
SkyTree (tm) defines: Analytics Requirements Index (ARI)
Example: For each view (1000 views/sec) produce a personalized banner I need to analyze 100 variables on 1000 records (historic data) every 1 ms
ARI = (1000*100)/0.001 = 100 M values/sec
Natalino Busa - 12 Feb. 2013
What data?
Big Data can imply:
● Complex Data refactoring in Batch (lots of rows)● Real-Time Event Processing (high-speed responses)● Multidimensional analisys (lots of parameters)
● ... or any of those three
Natalino Busa - 12 Feb. 2013
Parameters Entities
Res
pons
e tim
e
More data
Database Databases Federated Data Aggregated Data Linked Data Just Data
Structured Unstructured
customerscustomers +products
customers +products +surveys
customers +products +surveys +transactions
customers +products +surveys +transactions +social messages
● in today's IT environments there is a gradual shift from structured data to unstructured data
RDBMS are well suited to deal with structured data -> but: more and complex ETL, how to deal with new data (structures) ?
Map-Reduce and noSQL systems are good with unstructured data -> but: how to we query and analyze this data?
Natalino Busa - 12 Feb. 2013
Big Data: how to deal with it
● Big Data at rest (storage, access) ● Big Data in motion (streaming, dataflows)
● Big Data analytics (OLAP, OTAP, BI)● Big Data modeling (predictive, machine learning)
Natalino Busa - 12 Feb. 2013
Big Data at rest
Analytical RDBMSs (EDW) Oracle, IBM, and various MPP's
Hadoop Distributed Systems HDFS (distributed file system) Hbase (Big Table)
HDFSLogs
Batch Real-time
EDW EDW
Analytics
EDW
Cassandra HBase
Natalino Busa - 12 Feb. 2013
● Traditional EDW and Distributed BigData / NoSQL solutions are complementary to each other.
● These systems do not exclude each others and can coexist to form a fullenterprise level solution.
Big Data at rest
No need to get everything out of the hadoop ecosystem:
NoSQL DBMSs: Couchbase ( ++ reads, caching) Cassandra ( ++ writes, OLAP)
... hybrid solutions are also possible:
HDFS + Cassandra : in-memory analytics + large DFSHDFS + Solr/Lucene: fast text search on a distributed file system
Natalino Busa - 12 Feb. 2013
Big Data in motion
Stream processing // Dataflow architectures
Used to support the automatic analysis of data-in-motion in real-time or near real-time.
- Identify meaningful patterns - Trigger action to respond to them as quickly as possible.
- Storm (from twitter) dataflow processing framework ++ multi-language
- Akka (from typesafe) dataflow actor framework ++ speed
Both are:Distributed, fault-tolerant, streaming
Natalino Busa - 12 Feb. 2013
Big Data Landscape
HDFS
Logs Hbase
EDWsqoop
hiho
flume
REST
scribe
Cassandra
Hive
Pig
MapR
OTAP Impala
SAS, R over HDFS Mahout
OLAP
BI
STORM
Natalino Busa - 12 Feb. 2013
● Real-Time Analytics● Streaming
● Batch Analytics● Visualization● Monitoring● Marketing
Machine Learning on Big Data
FS
Unstructured
Unstructured
Dat
a In
terfa
ces
Lambda Architecture
Logic layerSoftware as a Servicee.g realt-time predictor
Natalino Busa - 12 Feb. 2013from http://www.manning.com/marz/
Why do machine learning on big data
Natalino Busa - 12 Feb. 2013
http://www.skytree.net/why-do-machine-learning-on-big-data/
Machine Learning: What?
SIMILARITY SEARCH
Similarity search provides a way to find the objects that are the most similar, in an overall sense, to the object(s) of interest.
Natalino Busa - 12 Feb. 2013
PREDICTIVE ANALYTICS
Predictive analytics is the science of analyzing current and historical facts/data to make predictions about future events.
CLUSTERING AND SEGMENTATION
Cluster analysis and segmentation represents a purely data driven approach to grouping similar objects, behaviors, or whatever is represented by the data.
From http://www.skytree.net/why-do-machine-learning-on-big-data/use-cases/
Word Counting on Map Reduce
Natalino Busa - 12 Feb. 2013
Machine learning on Map Reduce
Natalino Busa - 12 Feb. 2013
From http://www.slideshare.net/hadoop/modeling-with-hadoop-kdd2011
Machine learning on Map Reduce
Natalino Busa - 12 Feb. 2013From http://www.slideshare.net/hadoop/modeling-with-hadoop-kdd2011
Machine Learning: Use Cases
E-Commerce / E-Tailing● Product Recommendation Engines● Cross Channel Analytics● Events/Activity Behavior Segmentation
Product Marketing● Campaign management and optimization● Market and consumer segmentations● Pricing Optimization
Customer Marketing● Customer Churn Management● (Mobile) User Behavior Prediction● Offer Personalization
Natalino Busa - 12 Feb. 2013
Big Data: Opportunities
Unstructured Data● Clustering● Distributed processing● Distributed Storage
Modeling & Analytics● Distributed Machine Learning● Fast Online Analytics Cubes
Streaming and Real-Time processing● Build RT profiles● Decision trees and Predictions● Offer Personalization
Natalino Busa - 12 Feb. 2013
Thanks
linkedin:
www.linkedin.com/in/natalinobusa
blog:
www.natalinobusa.com