22
Big data The technology landscape and its applications. Natalino Busa - 12 Feb. 2013

Big data landscape

Embed Size (px)

DESCRIPTION

An overview about several technologies which contribute to the landscape of Big Data. An intro about the technology challenges of Big Data, follow by key open-source components which help out in dealing with various big data aspects such as OLAP, Real-Time Online Analytics, Machine Learning on Map-Reduce. I conclude with an enumeration of the key areas where those technologies are most likely unleashing new opportunity for various businesses.

Citation preview

Page 1: Big data landscape

Big dataThe technology landscape and its applications.

Natalino Busa - 12 Feb. 2013

Page 2: Big data landscape

Outline

● Big Data: Who are thou?● Big Data: The technology landscape

● Hadoop: Overview● Analytics & Machine Learning● Opportunities

Natalino Busa - 12 Feb. 2013

Page 3: Big data landscape

Hype cycle on new IT technologies

Gartner 2012

Natalino Busa - 12 Feb. 2013

Page 4: Big data landscape

What is big data?

Velocity Diversity Volume

Hardware Software Services

BIG DATA

DATA (structured and un-structured, Logs, ETL, social)

Marketing (e.g. Unica)Analytics (Tableau)Modeling (SAS)

RDBMSOLAPMessaging

Infrastructure(Private) CloudNetworking

Natalino Busa - 12 Feb. 2013

Page 5: Big data landscape

Big Data Heat map

Natalino Busa - 12 Feb. 2013

Page 6: Big data landscape

How big is big?

ARI = # Rows × # Columns Time (secs)

Where # Rows = Number of records being analyzed

# Columns = Number of variables captured in each record

Time (secs) = The timeframe within which to complete the analysis

SkyTree (tm) defines: Analytics Requirements Index (ARI)

Example: For each view (1000 views/sec) produce a personalized banner I need to analyze 100 variables on 1000 records (historic data) every 1 ms

ARI = (1000*100)/0.001 = 100 M values/sec

Natalino Busa - 12 Feb. 2013

Page 7: Big data landscape

What data?

Big Data can imply:

● Complex Data refactoring in Batch (lots of rows)● Real-Time Event Processing (high-speed responses)● Multidimensional analisys (lots of parameters)

● ... or any of those three

Natalino Busa - 12 Feb. 2013

Parameters Entities

Res

pons

e tim

e

Page 8: Big data landscape

More data

Database Databases Federated Data Aggregated Data Linked Data Just Data

Structured Unstructured

customerscustomers +products

customers +products +surveys

customers +products +surveys +transactions

customers +products +surveys +transactions +social messages

● in today's IT environments there is a gradual shift from structured data to unstructured data

RDBMS are well suited to deal with structured data -> but: more and complex ETL, how to deal with new data (structures) ?

Map-Reduce and noSQL systems are good with unstructured data -> but: how to we query and analyze this data?

Natalino Busa - 12 Feb. 2013

Page 9: Big data landscape

Big Data: how to deal with it

● Big Data at rest (storage, access) ● Big Data in motion (streaming, dataflows)

● Big Data analytics (OLAP, OTAP, BI)● Big Data modeling (predictive, machine learning)

Natalino Busa - 12 Feb. 2013

Page 10: Big data landscape

Big Data at rest

Analytical RDBMSs (EDW) Oracle, IBM, and various MPP's

Hadoop Distributed Systems HDFS (distributed file system) Hbase (Big Table)

HDFSLogs

Batch Real-time

EDW EDW

Analytics

EDW

Cassandra HBase

Natalino Busa - 12 Feb. 2013

● Traditional EDW and Distributed BigData / NoSQL solutions are complementary to each other.

● These systems do not exclude each others and can coexist to form a fullenterprise level solution.

Page 11: Big data landscape

Big Data at rest

No need to get everything out of the hadoop ecosystem:

NoSQL DBMSs: Couchbase ( ++ reads, caching) Cassandra ( ++ writes, OLAP)

... hybrid solutions are also possible:

HDFS + Cassandra : in-memory analytics + large DFSHDFS + Solr/Lucene: fast text search on a distributed file system

Natalino Busa - 12 Feb. 2013

Page 12: Big data landscape

Big Data in motion

Stream processing // Dataflow architectures

Used to support the automatic analysis of data-in-motion in real-time or near real-time.

- Identify meaningful patterns - Trigger action to respond to them as quickly as possible.

- Storm (from twitter) dataflow processing framework ++ multi-language

- Akka (from typesafe) dataflow actor framework ++ speed

Both are:Distributed, fault-tolerant, streaming

Natalino Busa - 12 Feb. 2013

Page 13: Big data landscape

Big Data Landscape

HDFS

Logs Hbase

EDWsqoop

hiho

flume

REST

scribe

Cassandra

Hive

Pig

MapR

OTAP Impala

SAS, R over HDFS Mahout

OLAP

BI

STORM

Natalino Busa - 12 Feb. 2013

● Real-Time Analytics● Streaming

● Batch Analytics● Visualization● Monitoring● Marketing

Machine Learning on Big Data

FS

Unstructured

Unstructured

Dat

a In

terfa

ces

Page 14: Big data landscape

Lambda Architecture

Logic layerSoftware as a Servicee.g realt-time predictor

Natalino Busa - 12 Feb. 2013from http://www.manning.com/marz/

Page 15: Big data landscape

Why do machine learning on big data

Natalino Busa - 12 Feb. 2013

http://www.skytree.net/why-do-machine-learning-on-big-data/

Page 16: Big data landscape

Machine Learning: What?

SIMILARITY SEARCH

Similarity search provides a way to find the objects that are the most similar, in an overall sense, to the object(s) of interest.

Natalino Busa - 12 Feb. 2013

PREDICTIVE ANALYTICS

Predictive analytics is the science of analyzing current and historical facts/data to make predictions about future events.

CLUSTERING AND SEGMENTATION

Cluster analysis and segmentation represents a purely data driven approach to grouping similar objects, behaviors, or whatever is represented by the data.

From http://www.skytree.net/why-do-machine-learning-on-big-data/use-cases/

Page 17: Big data landscape

Word Counting on Map Reduce

Natalino Busa - 12 Feb. 2013

Page 18: Big data landscape

Machine learning on Map Reduce

Natalino Busa - 12 Feb. 2013

From http://www.slideshare.net/hadoop/modeling-with-hadoop-kdd2011

Page 19: Big data landscape

Machine learning on Map Reduce

Natalino Busa - 12 Feb. 2013From http://www.slideshare.net/hadoop/modeling-with-hadoop-kdd2011

Page 20: Big data landscape

Machine Learning: Use Cases

E-Commerce / E-Tailing● Product Recommendation Engines● Cross Channel Analytics● Events/Activity Behavior Segmentation

Product Marketing● Campaign management and optimization● Market and consumer segmentations● Pricing Optimization

Customer Marketing● Customer Churn Management● (Mobile) User Behavior Prediction● Offer Personalization

Natalino Busa - 12 Feb. 2013

Page 21: Big data landscape

Big Data: Opportunities

Unstructured Data● Clustering● Distributed processing● Distributed Storage

Modeling & Analytics● Distributed Machine Learning● Fast Online Analytics Cubes

Streaming and Real-Time processing● Build RT profiles● Decision trees and Predictions● Offer Personalization

Natalino Busa - 12 Feb. 2013

Page 22: Big data landscape

Thanks

linkedin:

www.linkedin.com/in/natalinobusa

blog:

www.natalinobusa.com