The Holy Grail of Data Analytics

Preview:

Citation preview

THE HOLY GRAIL OF DATA ANALYTICS

Dan Lynn, CEO

• Data Services • Data Strategy • Data Integration / BI / Analytics • Modernize Data Infrastructures • Custom Applications & APIs

• Distributed over 6 states! • Fully-virtualized staff

www.agildata.com

Dan LynnCEO

Co-Founder @ FullContact 15 years building data systems Techstars 2011dan@agildata.com

www.agildata.comAll product names, logos, and brands are property of their respective owners. All company, product and service names used are for identification purposes only. Use of these names, logos, and brands does not imply endorsement.

Free MySQL Performance Analyzer

www.agildata.com/gibbs

AgilData Scalable Cluster

TRADE-OFFS

OLTP vs OLAP

OLTP OVERVIEW• “Online Transaction Processing”

• Database is optimized for low latency access to current data

• Short transactions (INSERT, UPDATE, DELETE)

• High concurrency

• Examples:

• Add item to shopping cart

• Reset password

OLAP OVERVIEW• Online Analytical Processing

• Database is optimized for aggregation of historical data

• Aggregations can span millions or billions of records

• Low(er) concurrency

• Examples:

• What is our average shopping cart size, grouped by week and by affiliate?

• What are the top 5 paths that users take when navigating our website?

HOW DATABASES OPTIMIZE FOR OLTP

• Optimized for reading or updating an entire row • (e.g. the full customer record)

• Data is written to and read from disk on a row-by-row basis.

• Indexes are used to construct full business object from multiple tables via JOINs. • (e.g. SELECT*FROMorderoJOINcustomercONc.id=o.customer_id)

• Hadoop and NoSQL systems generally behave the same.

• Scan performance is limited

HOW DATABASES OPTIMIZE FOR OLAP

• Optimized for aggregating columns • (e.g. SELECTAVG(unit_price*qty)FROMorder_lineGROUPBYc.id)

• Data is laid out on disk on a per-column basis. • Great for scans, not so good for random row-level access

• Doesn’t support random UPDATEs

HOW HADOOP OPTIMIZES FOR OLAP

• Data is partitioned in HDFS in append-only blocks of ~64MB.

• These blocks are spread out across the cluster.

• Processing (i.e. queries) is sent to the data, instead of bringing the data to the application for processing.

• Columnar data formats like Parquet can be stored on HDFS for very fast scan performance.

• Updates are very expensive.

Scan Performance

VS

DATABASE

Updatability

THE LAMBDA ARCHITECTURE

Kafka, etc…

Data Stream

Write to HDFS Batch Computation(MapReduce, Spark)

Batch Views

Speed Layer(Storm, Spark Streaming, Flink, etc…)

Real-time views

Serving Layer(HBase, MySQL,

PostgreSQL, etc…)

THE LAMBDA ARCHITECTURE

• Apache Project (incubating)

• Started at Cloudera, growing industry adoption.

• Currently v0.9.1

• 1.0 release likely coming out in September 2016

Source: http://www.slideshare.net/cloudera/kudu-new-hadoop-storage-for-fast-analytics-on-fast-data

APACHE KUDU USE CASES• Online Reporting

• Examples: Operational Data Store, Customer-facing analytics, real-time dashboards

• Workload: Inserts, updates, scans, random lookups

• Time Series • Examples: Market analytics, fraud section, risk monitoring, message queueing

• Workload: Inserts, updates, scans, random lookups

• Machine Data Analysis

• Examples: Network threat detection, devops monitoring and alerting

• Workload: Inserts, scans, random lookups

THE ROAD AHEAD

THE ROAD AHEAD

• Reactive processing

• Dynamic / intelligent indexing

• High performance mutable message queueing

LINKS

• Kudu project website:http://kudu.apache.org/

• Details about OLTP vs OLAP workloadshttp://datawarehouse4u.info/OLTP-vs-OLAP.html

• Analyst perspective on Kuduhttp://www.dbms2.com/2015/09/28/introduction-to-cloudera-kudu/

www.agildata.com

dan@agildata.com

@danklynn

Thanks!

CREDITS

• Grail image: https://upload.wikimedia.org/wikipedia/commons/1/10/London-Victoria_and_Albert_Museum-Grail-02.jpg

• Balanced scales:https://commons.wikimedia.org/wiki/File:Balanced_scale_of_Justice.svg

Recommended