57
© comScore, Inc. Proprietary. Syncsort & MapR @ comScore Michael Brown, CTO | July 9 th , 2014

How to Suceed in Hadoop

Embed Size (px)

Citation preview

Page 1: How to Suceed in Hadoop

© comScore, Inc. Proprietary.

Syncsort & MapR @ comScore

Michael Brown, CTO | July 9th, 2014

Page 2: How to Suceed in Hadoop

© comScore, Inc. Proprietary.© comScore, Inc. Proprietary.

The comScore Story

Analytics for a Digital World™

Page 3: How to Suceed in Hadoop

© comScore, Inc. Proprietary. 3

The Digital World is Complex

V0113

Page 4: How to Suceed in Hadoop

© comScore, Inc. Proprietary. 4

comScore’s Mission

Be the Leader in Digital Media Analytics.

Measure all forms of media—content and advertising—at scale, across all platforms, in real-time, globally.

Page 5: How to Suceed in Hadoop

© comScore, Inc. Proprietary. 5

comScore Brings it Together

TabletPC/Mac TV SmartphoneGaming

V0113

Page 6: How to Suceed in Hadoop

© comScore, Inc. Proprietary. 6

comScore is a leading internet technology company thatprovides Analytics for a Digital World™

NASDAQ SCOR

Clients 2,400+ Worldwide

Employees 1,200+

Headquarters Reston, Virginia, USA

Global Coverage Measurement from 172 Countries; 44 Markets Reported

Local Presence 32 Locations in 23 Countries

V0113

Page 7: How to Suceed in Hadoop

© comScore, Inc. Proprietary. 7

Providing Analytics For More Than 2,400+ Clients Globally

Media Agencies Telecom/Mobile Financial Retail Travel CPG Health Technology

V0113

Page 8: How to Suceed in Hadoop

© comScore, Inc. Proprietary. 8

CensusTags & Data Feeds

PanelsPC, iOS, Android

SurveyNon-behavioral elements

MethodsAggregation DictionariesTaxonomies

SyndicatedData

Platform

Media MetrixvCE

Collection Calibration Delivery

Con

sulti

ng

Ana

lysi

s

ModelsWeightingProjection

De-DuplicationAttribution

Turning Big Data into Powerful Insight

Client AnalyticsPlatform

Digital Analytix

Page 9: How to Suceed in Hadoop

© comScore, Inc. Proprietary. 9

Page 10: How to Suceed in Hadoop

© comScore, Inc. Proprietary. 10

Panel Heat Map

Page 11: How to Suceed in Hadoop

© comScore, Inc. Proprietary. 11

Average Records Captured per Day (2005-2009)

-

200,000,000

400,000,000

600,000,000

800,000,000

1,000,000,000

1,200,000,000

1,400,000,000

1,600,000,000

1,800,000,0009/

26/2

005

10/2

6/20

0511

/26/

2005

12/2

6/20

051/

26/2

006

2/26

/200

63/

26/2

006

4/26

/200

65/

26/2

006

6/26

/200

67/

26/2

006

8/26

/200

69/

26/2

006

10/2

6/20

0611

/26/

2006

12/2

6/20

061/

26/2

007

2/26

/200

73/

26/2

007

4/26

/200

75/

26/2

007

6/26

/200

77/

26/2

007

8/26

/200

79/

26/2

007

10/2

6/20

0711

/26/

2007

12/2

6/20

071/

26/2

008

2/26

/200

83/

26/2

008

4/26

/200

85/

26/2

008

6/26

/200

87/

26/2

008

8/26

/200

89/

26/2

008

10/2

6/20

0811

/26/

2008

12/2

6/20

081/

26/2

009

2/26

/200

93/

26/2

009

Page 12: How to Suceed in Hadoop

© comScore, Inc. Proprietary. 12

CENSUS

Unified Digital Measurement™ (UDM) Establishes Platform For Panel + Census Data Integration

Adopted by 90% of Top 100 U.S. Media Properties

PANEL

Unified Digital Measurement (UDM)Patent-Pending Methodology

Global PERSONMeasurement

Global DEVICEMeasurement

V0411

Page 13: How to Suceed in Hadoop

© comScore, Inc. Proprietary. 13

Beacon Heat Map

Page 14: How to Suceed in Hadoop

© comScore, Inc. Proprietary. 14

Monthly Records Collection

Billion

200 Billion

400 Billion

600 Billion

800 Billion

1,000 Billion

1,200 Billion

1,400 Billion

1,600 Billion

1,800 Billion

2,000 Billion

# of

reco

rds

Beacon RecordsPanel Records

Total records collected in June 2014 = 1,726,563,202,649Total records collected YTD 2014 = 10,037,131,368,475

Page 15: How to Suceed in Hadoop

© comScore, Inc. Proprietary.

DMX @ comScore

Page 16: How to Suceed in Hadoop

© comScore, Inc. Proprietary. 16

DMX use at comScore

Purchased our first 4 licenses in 2000!

We use DMX from Syncsort across hundreds of servers for efficient data processing and aggregation.

We currently run over 100+ unique jobs every day.

With these jobs we process over 150 billion rows of data through DMX!

Connect

Design

Process Accelerate

Page 17: How to Suceed in Hadoop

© comScore, Inc. Proprietary. 17

Compression w/Sorting

Compress Log Files when processing large volumes of log dataSeveral advantages to Sorting Data First: Reduces the size of the data Improves application performance

Examples: 1 Hour of one source of our data 2,315 GB raw (2.9 billion rows) Standard compression of time ordered data is 509 GB (22% of original) Standard compression on a sorted set is 324 GB (14% of original)

When applied to all our sources we save 5.0 TB per day 155 TB per month 460 TB per quarter

Page 18: How to Suceed in Hadoop

© comScore, Inc. Proprietary.

Hadoop @ comScore

Page 19: How to Suceed in Hadoop

© comScore, Inc. Proprietary. 19

Why Hadoop?

• comScore built our own distributed computing stack in 2002.

• In 2009 we decided it was better to leverage the efforts of the Hadoop community instead of building our own stack.

• We recognized the benefit of switching to Hadoop which would allow for seamless scaling of our infrastructure to meet the needs of the business.

• Hadoop allows us to add compute, storage and memory linearly and allows you to process things at tremendous scale.

• Partnered with SyncSort on their Hadoop efforts from Oct 2010

• Evaluated the beta of MapR in the fall of 2011

Page 20: How to Suceed in Hadoop

© comScore, Inc. Proprietary. 20

90 Days of Data

1,148

1,919

3,049

4,8625,084

Trillion

1,000 Trillion

2,000 Trillion

3,000 Trillion

4,000 Trillion

5,000 Trillion

6,000 Trillion

2009 2010 2011 2012 2013 2014 2016

Page 21: How to Suceed in Hadoop

© comScore, Inc. Proprietary. 21

High Level Data Flow

Panel

Census

Custom Code +

ADW

EDW

Delivery

Page 22: How to Suceed in Hadoop

© comScore, Inc. Proprietary. 22

Our Cluster

Production Hadoop Cluster 400+ nodes: Mix of Dell 720xd, R710 and R510 servers Each R720xd has (24x1.2TB drives; 128GB RAM; 32 cores) 13,800+ total CPUs 31.6 TB total memory 8.2 PB total disk space Our distro is MapR M5 2.1.3

Page 23: How to Suceed in Hadoop

© comScore, Inc. Proprietary.

Leveraging Partitions from MapR

Page 24: How to Suceed in Hadoop

© comScore, Inc. Proprietary.

Page 25: How to Suceed in Hadoop

© comScore, Inc. Proprietary.

Validation Funnel & Target Effectiveness

Page 26: How to Suceed in Hadoop

© comScore, Inc. Proprietary. 26

Our growth

As our volume has grown we have the following stats: Over 683 billion events per month Daily Aggregate 1.8 billion 160 billion aggregate records for 92 days 146K Campaigns Over 50 countries We see 15 billion distinct cookies in a month We only need to output 26 million rows

Page 27: How to Suceed in Hadoop

© comScore, Inc. Proprietary. 27

Solution to reduce the shuffle

The Problem: Most aggregations within comScore can not take advantage of combiners, leading to large shuffles and

job performance issues

The Idea: Partition and sort the data by cookie on a daily basis Create a custom InputFormat to merge daily partitions for monthly aggregations

Page 28: How to Suceed in Hadoop

© comScore, Inc. Proprietary. 28

Custom Input Format with Map Side Aggregation

CB

Mapper MapperMapperMap Map Map

Reduce ReduceReduce

BA AC

A B C

A B C

Combiner Combiner Combiner

A B C

Page 29: How to Suceed in Hadoop

© comScore, Inc. Proprietary. 29

Risks for Partitioning

Data locality Custom InputFormat requires reading blocks of the partitioned data over the network This was solved using a feature of the MapR file system. We created volumes and set the chunk size to

zero which guarantees that the data written to a volume will stay on one node

Map failures might result in long run times Size of the map inputs is no longer set by block size This was solved by creating a large number (10K) of volumes to limit the size of data processed by each

mapper

Page 30: How to Suceed in Hadoop

© comScore, Inc. Proprietary. 30

Partitioning Summary

Benefits: A large portion of the aggregation can be completed in the map phase Applications can now take advantage of combiners Shuffles sizes are minimal

Results: Took a job from 35 hours to 3 hours with no hardware changes

Page 31: How to Suceed in Hadoop

© comScore, Inc. Proprietary.

DMX-h @ comScore

Page 32: How to Suceed in Hadoop

© comScore, Inc. Proprietary. 32

Reasons for comScore selecting DMX-h

Performance

• DMX-h as the pluggable sort in Hadoop allows us to increase throughput on it’s existing platform; this reduces capital and ongoing operational expenses

• The increase in throughput allows us to also deliver our data more quickly to our customers. These things make the data more valuable to our clients.

Speed of Development

• The ability to quickly build out applications in the DMX-h GUI allows us to iterate and respond quicker to the needs of the business.

• The ease of development also allows us to democratize the access to the Hadoop platform by leveraging a point and click GUI.

Page 33: How to Suceed in Hadoop

© comScore, Inc. Proprietary. 33

Performance - DMx Pluggable Sort Testing Results

First Comparison Run on our Dev Cluster

Pig scripts and called with SyncSort plug in

GroupBy / Distinct Operations• Counting uniques• These have large shuffle steps which leads to more data to sort.• Observed up to a 20% decrease in job runtime

Filter Operations• Searching for a specific value• Observed a 5% – 10% decrease in job runtime• Dependent on type of filter and size of job output

40GB compressed data, base run is 86 min, test run is 68 min; Savings of 20%

Results from 7 Nodes; 56 cores; 433 GB RAM; 28 TB disk; MapR M5 3.0.2; DMX-h 7.12

Page 34: How to Suceed in Hadoop

© comScore, Inc. Proprietary. 34

Speed of Development - POC

We took an existing process that runs in our Hadoop cluster and converted that to DMX-h to validate the new capabilities.

The existing process:

• Written in 75 lines of Pig with 3 Java UDFs

• Developed in about 25 hours

• Processes 3.5 billion input rows per day

• Takes 35 minutes to run on a daily basis

Page 35: How to Suceed in Hadoop

© comScore, Inc. Proprietary. 35

DMXh-Process

Page 36: How to Suceed in Hadoop

© comScore, Inc. Proprietary. 36

Speed of Development - POC

The new process in DMX-h:

• Developed a new job with 13 tasks

• No Java UDF required

• Runs on the same data and in the same environment.

• Developed in 12 hours.

• Runs in 11 minutes! 1/3 of the time of the Pig & Java code.

Page 37: How to Suceed in Hadoop

© comScore, Inc. Proprietary. 37

Useful Factoids

Visit www.comscoredatamine.com or follow @datagems for the latest gems.

Colorful, bite-sized graphical representations of the best discoveries we unearth.

Page 38: How to Suceed in Hadoop

© comScore, Inc. Proprietary. 38

Thank You!

Michael BrownCTOcomScore, Inc.

[email protected]

Page 39: How to Suceed in Hadoop

© 2014 MapR Technologies 1© 2014 MapR Technologies

Page 40: How to Suceed in Hadoop

© 2014 MapR Technologies 2

Today’s Presenters

Steve WooledgeVP - Product Marketing

@swooledge

Jorge LopezDirector - Product Marketing

@zanilli

Mike Brown CTO

Page 41: How to Suceed in Hadoop

© 2014 MapR Technologies 3© 2014 MapR Technologies

comScore

Page 42: How to Suceed in Hadoop

© comScore, Inc. Proprietary.

Syncsort & MapR @ comScore

• Michael Brown, CTO | July 9th, 2014

Page 43: How to Suceed in Hadoop

© 2014 MapR Technologies 5© 2014 MapR Technologies

Leveraging MapR and Syncsort

Page 44: How to Suceed in Hadoop

© 2014 MapR Technologies 6

Big Data is Overwhelming Traditional Systems

• Mission-critical reliability• Transaction guarantees• Deep security• Real-time performance• Backup and recovery

• Interactive SQL• Rich analytics• Workload management• Data governance• Backup and recovery

Enterprise Data

Architecture

1TRENDTREND

ENTERPRISE USERS

OPERATIONAL SYSTEMS

ANALYTICALSYSTEMS

PRODUCTIONREQUIREMENTS

PRODUCTIONREQUIREMENTS

OUTSIDE SOURCES

Page 45: How to Suceed in Hadoop

© 2014 MapR Technologies 7

Hadoop: The Disruptive Technology at the Core of Big DataTRENDTREND

JOB TRENDS FROM INDEED.COM

Jan ‘06 Jan ‘12 Jan ‘14Jan ‘07 Jan ‘08 Jan ‘09 Jan ‘10 Jan ‘11 Jan ‘13

2

Page 46: How to Suceed in Hadoop

© 2014 MapR Technologies 8

OPERATIONAL SYSTEMS

ANALYTICALSYSTEMS

ENTERPRISE USERS

1REALITYREALITY

• Data staging• Archive

• Data transformation• Data exploration

• Streaming, interactions

Hadoop Relieves the Pressure from Enterprise Systems

2 Interoperability

1 Reliability and DR

4 Supports operations and analytics

3 High performance

Keys for Production Success

Page 47: How to Suceed in Hadoop

© 2014 MapR Technologies 9

FOUNDATION

Architecture Matters for Success2REALITYREALITY

Data protection& security

High performance

Multi-tenancy

Operational & Analytical Workloads

Open standards for integration

NEW APPLICATIONS SLAs TRUSTED INFORMATION LOWER TCO

Page 48: How to Suceed in Hadoop

© 2014 MapR Technologies 10

The Power of the Open Source Community

Man

agem

ent

Man

agem

ent

MapR Data Platform

APACHE HADOOP AND OSS ECOSYSTEM

Security

YARN

Pig

Cascading

Spark

Batch

Spark Streaming

Storm*

Streaming

HBase

Solr

NoSQL & Search

Juju

Provisioning &

coordination

Savannah*

Mahout

MLLib

ML, Graph

GraphX

MapReduce v1 & v2

EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS

Workflow & Data

GovernanceTez*

Accumulo*

Hive

Impala

Shark

Drill*

SQL

Sentry* Oozie ZooKeeperSqoop

Knox* WhirrFalcon*Flume

Data Integration& Access

HttpFS

Hue

* Certification/support planned for 2014

Page 49: How to Suceed in Hadoop

© 2014 MapR Technologies 11

MapR Distribution for Hadoop

Man

agem

ent

Man

agem

ent

MapR Data Platform

APACHE HADOOP AND OSS ECOSYSTEM

Security

YARN

Pig

Cascading

Spark

Batch

Spark Streaming

Storm*

Streaming

HBase

Solr

NoSQL & Search

Juju

Provisioning &

coordination

Savannah*

Mahout

MLLib

ML, Graph

GraphX

MapReduce v1 & v2

EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS

Workflow & Data

GovernanceTez*

Accumulo*

Hive

Impala

Shark

Drill*

SQL

Sentry* Oozie ZooKeeperSqoop

Knox* WhirrFalcon*Flume

Data Integration& Access

HttpFS

Hue

* Certification/support planned for 2014

• High availability • Data protection• Disaster recovery

• Standard file access• Standard database

access• Pluggable services• Broad developer

support

• Enterprise securityauthorization

• Wire-level authentication

• Data governance

• Ability to support predictive analytics, real-time database operations, and support high arrival rate data

• Ability to logically divide a cluster to support different use cases, job types, user groups, and administrators

• 2X to 7X higher performance

• Consistent, low latency

Enterprise-grade Security OperationalPerformance Multi-tenancyInteroperability

Page 50: How to Suceed in Hadoop

© 2014 MapR Technologies 12

MapR: Best Solution for Customer Success

Top Ranked Exponential Growth

500+ Customers

PremierInvestors

3X3X bookings Q1 ‘13 – Q1 ‘14

80%80% of accounts expand 3X

90%90% software licenses

< 1%< 1% lifetime churn

> $1B> $1B in incremental revenuegenerated by 1 customer

Page 51: How to Suceed in Hadoop

© 2014 MapR Technologies 13

MapR and Syncsort Reference Architecture

SourcesRELATIONAL, SAAS, MAINFRAME

DOCUMENTS, EMAILS

LOG FILES, CLICKSTREAMS

BLOGS, TWEETS,LINK DATA

DATA MARTS DATA WAREHOUSE

MapR Data Platform

Business Intelligence / Visualization

MapR-DB MapR-FS

Batch(MR, Spark, Hive, Pig,

…)

Interactive(Impala, Drill, …)

Streaming(Spark Streaming,

Storm…)

MAPR DISTRIBUTION FOR HADOOP

Page 52: How to Suceed in Hadoop

© 2014 MapR Technologies 14

Do You Know Syncsort?

• Syncsort provides fast, secure, enterprise‐grade software spanning “Big Iron to Big Data” 

• Fastest sort technology in the market• Powering 50% of mainframes’ sort

• A history of innovation• 25+ issued & pending patents

• Large global customer base• 12,000+ deployments in 80 countries and serving 87 of the Fortune 100

• First‐to‐market, fully integrated approach to Hadoop ETL

• Top 7 contributors to Hadoop. Based on number of lines of code changed in 2013

Our customers are achieving the impossible, every day!

Our customers are achieving the impossible, every day!

Key Partners

Page 53: How to Suceed in Hadoop

© 2014 MapR Technologies 15

The Hadoop Challenge

PROCESS

Sort

JoinAggregate Copy

Merge

DISTRIBUTECOLLECT

Most organizations use Hadoop to…

EExtract

TTransform

LLoad

Page 54: How to Suceed in Hadoop

© 2014 MapR Technologies 16

Turning Hadoop into a Feature-rich ETL Solution

Collect• Broad based connectivity with automated parallelism • Best in class mainframe data access & translationProcess & Distribute• No manual coding. GUI for developing & maintaining MR jobs• No code generation. Engine runs natively on each node• Develop & test locally in Windows; run natively on Hadoop

Optimize & Secure• Faster throughput per node• Full support for Kerberos & LDAP• Web‐based monitoring console• Sort‐work compression for storage savings

DMX‐h 

ETL

Collect Process & Distribute

Optimize& Secure

Page 55: How to Suceed in Hadoop

© 2014 MapR Technologies 17

A Roadmap to Hadoop Success

Agile Data Exploration & Visualization

Next‐gen Analytics

Cheap Storage

Offload Data Warehouse

Enabling The

Data‐driv

en Organiza

tion

Solving The Intractable

IT Problem

17

Page 56: How to Suceed in Hadoop

© 2014 MapR Technologies 18

MapR + Syncsort Solutions

Data Warehouse Optimization

Click‐stream Analysis

Mainframe Offload

Shift ELT Workloads to Hadoop

Access, Translate & Analyze Mainframe Data with Hadoop

Collect, Process & Analyze More Data from Your Website

Page 57: How to Suceed in Hadoop

© 2014 MapR Technologies 19

Q & AEngage with us!

1. Download the MapR Sandbox for Hadoop: www.mapr.com/sandbox

2. Try Syncsort’s Hadoop ETL in the MapR Sandbox: www.syncsort.com/mapr

3. Learn best practices for Hadoop ETL: www.mapr.com/EDH