How to Suceed in Hadoop

© comScore, Inc. Proprietary.

Syncsort & MapR @ comScore

Michael Brown, CTO | July 9th, 2014

© comScore, Inc. Proprietary.© comScore, Inc. Proprietary.

The comScore Story

Analytics for a Digital World™

© comScore, Inc. Proprietary. 3

The Digital World is Complex

V0113


comScore’s Mission

Be the Leader in Digital Media Analytics.

Measure all forms of media—content and advertising—at scale, across all platforms, in real-time, globally.


comScore Brings it Together

TabletPC/Mac TV SmartphoneGaming

V0113


comScore is a leading internet technology company thatprovides Analytics for a Digital World™

NASDAQ SCOR

Clients 2,400+ Worldwide

Employees 1,200+

Headquarters Reston, Virginia, USA

Global Coverage Measurement from 172 Countries; 44 Markets Reported

Local Presence 32 Locations in 23 Countries

V0113


Providing Analytics For More Than 2,400+ Clients Globally

Media Agencies Telecom/Mobile Financial Retail Travel CPG Health Technology

V0113


CensusTags & Data Feeds

PanelsPC, iOS, Android

SurveyNon-behavioral elements

MethodsAggregation DictionariesTaxonomies

SyndicatedData

Platform

Media MetrixvCE

Collection Calibration Delivery

Con

sulti

ng

Ana

lysi

s

ModelsWeightingProjection

De-DuplicationAttribution

Turning Big Data into Powerful Insight

Client AnalyticsPlatform

Digital Analytix



Panel Heat Map


Average Records Captured per Day (2005-2009)

-

200,000,000

400,000,000

600,000,000

800,000,000

1,000,000,000

1,200,000,000

1,400,000,000

1,600,000,000

1,800,000,0009/

26/2

005

10/2

6/20

0511

/26/

2005

12/2

6/20

051/

26/2

006

2/26

/200

63/

26/2

006

4/26

/200

65/

26/2

006

6/26

/200

67/

26/2

006

8/26

/200

69/

26/2

006

10/2

6/20

0611

/26/

2006

12/2

6/20

061/

26/2

007

2/26

/200

73/

26/2

007

4/26

/200

75/

26/2

007

6/26

/200

77/

26/2

007

8/26

/200

79/

26/2

007

10/2

6/20

0711

/26/

2007

12/2

6/20

071/

26/2

008

2/26

/200

83/

26/2

008

4/26

/200

85/

26/2

008

6/26

/200

87/

26/2

008

8/26

/200

89/

26/2

008

10/2

6/20

0811

/26/

2008

12/2

6/20

081/

26/2

009

2/26

/200

93/

26/2

009


CENSUS

Unified Digital Measurement™ (UDM) Establishes Platform For Panel + Census Data Integration

Adopted by 90% of Top 100 U.S. Media Properties

PANEL

Unified Digital Measurement (UDM)Patent-Pending Methodology

Global PERSONMeasurement

Global DEVICEMeasurement

V0411


Beacon Heat Map


Monthly Records Collection

Billion

200 Billion

400 Billion

600 Billion

800 Billion

1,000 Billion

1,200 Billion

1,400 Billion

1,600 Billion

1,800 Billion

2,000 Billion

# of

reco

rds

Beacon RecordsPanel Records

Total records collected in June 2014 = 1,726,563,202,649Total records collected YTD 2014 = 10,037,131,368,475


DMX @ comScore


DMX use at comScore

Purchased our first 4 licenses in 2000!

We use DMX from Syncsort across hundreds of servers for efficient data processing and aggregation.

We currently run over 100+ unique jobs every day.

With these jobs we process over 150 billion rows of data through DMX!

Connect

Design

Process Accelerate


Compression w/Sorting

Compress Log Files when processing large volumes of log dataSeveral advantages to Sorting Data First: Reduces the size of the data Improves application performance

Examples: 1 Hour of one source of our data 2,315 GB raw (2.9 billion rows) Standard compression of time ordered data is 509 GB (22% of original) Standard compression on a sorted set is 324 GB (14% of original)

When applied to all our sources we save 5.0 TB per day 155 TB per month 460 TB per quarter


Hadoop @ comScore


Why Hadoop?

• comScore built our own distributed computing stack in 2002.

• In 2009 we decided it was better to leverage the efforts of the Hadoop community instead of building our own stack.

• We recognized the benefit of switching to Hadoop which would allow for seamless scaling of our infrastructure to meet the needs of the business.

• Hadoop allows us to add compute, storage and memory linearly and allows you to process things at tremendous scale.

• Partnered with SyncSort on their Hadoop efforts from Oct 2010

• Evaluated the beta of MapR in the fall of 2011


90 Days of Data

1,148

1,919

3,049

4,8625,084

Trillion

1,000 Trillion

2,000 Trillion

3,000 Trillion

4,000 Trillion

5,000 Trillion

6,000 Trillion

2009 2010 2011 2012 2013 2014 2016


High Level Data Flow

Panel

Census

Custom Code +

ADW

EDW

Delivery


Our Cluster

Production Hadoop Cluster 400+ nodes: Mix of Dell 720xd, R710 and R510 servers Each R720xd has (24x1.2TB drives; 128GB RAM; 32 cores) 13,800+ total CPUs 31.6 TB total memory 8.2 PB total disk space Our distro is MapR M5 2.1.3


Leveraging Partitions from MapR



Validation Funnel & Target Effectiveness


Our growth

As our volume has grown we have the following stats: Over 683 billion events per month Daily Aggregate 1.8 billion 160 billion aggregate records for 92 days 146K Campaigns Over 50 countries We see 15 billion distinct cookies in a month We only need to output 26 million rows


Solution to reduce the shuffle

The Problem: Most aggregations within comScore can not take advantage of combiners, leading to large shuffles and

job performance issues

The Idea: Partition and sort the data by cookie on a daily basis Create a custom InputFormat to merge daily partitions for monthly aggregations


Custom Input Format with Map Side Aggregation

CB

Mapper MapperMapperMap Map Map

Reduce ReduceReduce

BA AC

A B C

A B C

Combiner Combiner Combiner

A B C


Risks for Partitioning

Data locality Custom InputFormat requires reading blocks of the partitioned data over the network This was solved using a feature of the MapR file system. We created volumes and set the chunk size to

zero which guarantees that the data written to a volume will stay on one node

Map failures might result in long run times Size of the map inputs is no longer set by block size This was solved by creating a large number (10K) of volumes to limit the size of data processed by each

mapper


Partitioning Summary

Benefits: A large portion of the aggregation can be completed in the map phase Applications can now take advantage of combiners Shuffles sizes are minimal

Results: Took a job from 35 hours to 3 hours with no hardware changes


DMX-h @ comScore


Reasons for comScore selecting DMX-h

Performance

• DMX-h as the pluggable sort in Hadoop allows us to increase throughput on it’s existing platform; this reduces capital and ongoing operational expenses

• The increase in throughput allows us to also deliver our data more quickly to our customers. These things make the data more valuable to our clients.

Speed of Development

• The ability to quickly build out applications in the DMX-h GUI allows us to iterate and respond quicker to the needs of the business.

• The ease of development also allows us to democratize the access to the Hadoop platform by leveraging a point and click GUI.


Performance - DMx Pluggable Sort Testing Results

First Comparison Run on our Dev Cluster

Pig scripts and called with SyncSort plug in

GroupBy / Distinct Operations• Counting uniques• These have large shuffle steps which leads to more data to sort.• Observed up to a 20% decrease in job runtime

Filter Operations• Searching for a specific value• Observed a 5% – 10% decrease in job runtime• Dependent on type of filter and size of job output

40GB compressed data, base run is 86 min, test run is 68 min; Savings of 20%

Results from 7 Nodes; 56 cores; 433 GB RAM; 28 TB disk; MapR M5 3.0.2; DMX-h 7.12


Speed of Development - POC

We took an existing process that runs in our Hadoop cluster and converted that to DMX-h to validate the new capabilities.

The existing process:

• Written in 75 lines of Pig with 3 Java UDFs

• Developed in about 25 hours

• Processes 3.5 billion input rows per day

• Takes 35 minutes to run on a daily basis


DMXh-Process


Speed of Development - POC

The new process in DMX-h:

• Developed a new job with 13 tasks

• No Java UDF required

• Runs on the same data and in the same environment.

• Developed in 12 hours.

• Runs in 11 minutes! 1/3 of the time of the Pig & Java code.


Useful Factoids

Visit www.comscoredatamine.com or follow @datagems for the latest gems.

Colorful, bite-sized graphical representations of the best discoveries we unearth.


Thank You!

Michael BrownCTOcomScore, Inc.

[email protected]

© 2014 MapR Technologies 1© 2014 MapR Technologies

© 2014 MapR Technologies 2

Today’s Presenters

Steve WooledgeVP - Product Marketing

@swooledge

Jorge LopezDirector - Product Marketing

@zanilli

Mike Brown CTO


comScore


Syncsort & MapR @ comScore

• Michael Brown, CTO | July 9th, 2014


Leveraging MapR and Syncsort


Big Data is Overwhelming Traditional Systems

• Mission-critical reliability• Transaction guarantees• Deep security• Real-time performance• Backup and recovery

• Interactive SQL• Rich analytics• Workload management• Data governance• Backup and recovery

Enterprise Data

Architecture

1TRENDTREND

ENTERPRISE USERS

OPERATIONAL SYSTEMS

ANALYTICALSYSTEMS

PRODUCTIONREQUIREMENTS

PRODUCTIONREQUIREMENTS

OUTSIDE SOURCES


Hadoop: The Disruptive Technology at the Core of Big DataTRENDTREND

JOB TRENDS FROM INDEED.COM

Jan ‘06 Jan ‘12 Jan ‘14Jan ‘07 Jan ‘08 Jan ‘09 Jan ‘10 Jan ‘11 Jan ‘13

2


OPERATIONAL SYSTEMS

ANALYTICALSYSTEMS

ENTERPRISE USERS

1REALITYREALITY

• Data staging• Archive

• Data transformation• Data exploration

• Streaming, interactions

Hadoop Relieves the Pressure from Enterprise Systems

2 Interoperability

1 Reliability and DR

4 Supports operations and analytics

3 High performance

Keys for Production Success


FOUNDATION

Architecture Matters for Success2REALITYREALITY

Data protection& security

High performance

Multi-tenancy

Operational & Analytical Workloads

Open standards for integration

NEW APPLICATIONS SLAs TRUSTED INFORMATION LOWER TCO


The Power of the Open Source Community

Man

agem

ent

Man

agem

ent

MapR Data Platform

APACHE HADOOP AND OSS ECOSYSTEM

Security

YARN

Pig

Cascading

Spark

Batch

Spark Streaming

Storm*

Streaming

HBase

Solr

NoSQL & Search

Juju

Provisioning &

coordination

Savannah*

Mahout

MLLib

ML, Graph

GraphX

MapReduce v1 & v2

EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS

Workflow & Data

GovernanceTez*

Accumulo*

Hive

Impala

Shark

Drill*

SQL

Sentry* Oozie ZooKeeperSqoop

Knox* WhirrFalcon*Flume

Data Integration& Access

HttpFS

Hue

* Certification/support planned for 2014


MapR Distribution for Hadoop

Man

agem

ent

Man

agem

ent

MapR Data Platform

APACHE HADOOP AND OSS ECOSYSTEM

Security

YARN

Pig

Cascading

Spark

Batch

Spark Streaming

Storm*

Streaming

HBase

Solr

NoSQL & Search

Juju

Provisioning &

coordination

Savannah*

Mahout

MLLib

ML, Graph

GraphX

MapReduce v1 & v2

EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS

Workflow & Data

GovernanceTez*

Accumulo*

Hive

Impala

Shark

Drill*

SQL

Sentry* Oozie ZooKeeperSqoop

Knox* WhirrFalcon*Flume

Data Integration& Access

HttpFS

Hue

* Certification/support planned for 2014

• High availability • Data protection• Disaster recovery

• Standard file access• Standard database

access• Pluggable services• Broad developer

support

• Enterprise securityauthorization

• Wire-level authentication

• Data governance

• Ability to support predictive analytics, real-time database operations, and support high arrival rate data

• Ability to logically divide a cluster to support different use cases, job types, user groups, and administrators

• 2X to 7X higher performance

• Consistent, low latency

Enterprise-grade Security OperationalPerformance Multi-tenancyInteroperability


MapR: Best Solution for Customer Success

Top Ranked Exponential Growth

500+ Customers

PremierInvestors

3X3X bookings Q1 ‘13 – Q1 ‘14

80%80% of accounts expand 3X

90%90% software licenses

< 1%< 1% lifetime churn

> $1B> $1B in incremental revenuegenerated by 1 customer


MapR and Syncsort Reference Architecture

SourcesRELATIONAL, SAAS, MAINFRAME

DOCUMENTS, EMAILS

LOG FILES, CLICKSTREAMS

BLOGS, TWEETS,LINK DATA

DATA MARTS DATA WAREHOUSE

MapR Data Platform

Business Intelligence / Visualization

MapR-DB MapR-FS

Batch(MR, Spark, Hive, Pig,

…)

Interactive(Impala, Drill, …)

Streaming(Spark Streaming,

Storm…)

MAPR DISTRIBUTION FOR HADOOP


Do You Know Syncsort?

• Syncsort provides fast, secure, enterprise‐grade software spanning “Big Iron to Big Data”

• Fastest sort technology in the market• Powering 50% of mainframes’ sort

• A history of innovation• 25+ issued & pending patents

• Large global customer base• 12,000+ deployments in 80 countries and serving 87 of the Fortune 100

• First‐to‐market, fully integrated approach to Hadoop ETL

• Top 7 contributors to Hadoop. Based on number of lines of code changed in 2013

Our customers are achieving the impossible, every day!

Our customers are achieving the impossible, every day!

Key Partners


The Hadoop Challenge

PROCESS

Sort

JoinAggregate Copy

Merge

DISTRIBUTECOLLECT

Most organizations use Hadoop to…

EExtract

TTransform

LLoad


Turning Hadoop into a Feature-rich ETL Solution

Collect• Broad based connectivity with automated parallelism • Best in class mainframe data access & translationProcess & Distribute• No manual coding. GUI for developing & maintaining MR jobs• No code generation. Engine runs natively on each node• Develop & test locally in Windows; run natively on Hadoop

Optimize & Secure• Faster throughput per node• Full support for Kerberos & LDAP• Web‐based monitoring console• Sort‐work compression for storage savings

DMX‐h

ETL

Collect Process & Distribute

Optimize& Secure


A Roadmap to Hadoop Success

Agile Data Exploration & Visualization

Next‐gen Analytics

Cheap Storage

Offload Data Warehouse

Enabling The

Data‐driv

en Organiza

tion

Solving The Intractable

IT Problem

17


MapR + Syncsort Solutions

Data Warehouse Optimization

Click‐stream Analysis

Mainframe Offload

Shift ELT Workloads to Hadoop

Access, Translate & Analyze Mainframe Data with Hadoop

Collect, Process & Analyze More Data from Your Website


Q & AEngage with us!

1. Download the MapR Sandbox for Hadoop: www.mapr.com/sandbox

2. Try Syncsort’s Hadoop ETL in the MapR Sandbox: www.syncsort.com/mapr

3. Learn best practices for Hadoop ETL: www.mapr.com/EDH

Software

How to Suceed in Hadoop