Upload
steven-totman
View
2.250
Download
1
Embed Size (px)
Citation preview
Steve Totman
Director Of Strategy
Syncsort
March 20th 2013
Making Hadoop Ready for Prime TimeHadoop Summit Amsterdam March 2013
Photo Credit Aaron Sikkink http://www.flickr.com/people/housequakecom/
2
Syncsort Confidential and Proprietary - do not copy or distribute 3
The Big Data Continuum
Syncsort Confidential and Proprietary - do not copy or distribute 4
EvolvedDynamicPlateauingAdvancingTraditional
BI
Data Awakening
Big
Dat
a C
on
tin
uu
m
Early Hadoop adoption prototyping & experimentation
Hand-coding:SQL, JCL. Basic ETL Tools
Standardization & Heavy Platforms. Demand for MF data
Hitting arch limits + exponential costs. Growing MIPS
Big Data is the new standard for both MF & open systems data
Ch
alle
nge
s
Long development
cycles
Unsustainable costs
Hadoopconnectivity &
sort gaps
Efficiency, ETL &
skills gaps
Hand-coding
nightmare
Value MaxMin
Inte
grat
ing
Big
Dat
a… S
mar
ter
DMExpress
MFX
SQL Migration Hadoop ETLHadoop Sort
& ConnectivityETL & Rehosting
OptimizationHigh-
performance ETL
Syncsort Confidential and Proprietary - do not copy or distribute 5
Mandatory sort steps in MapReduce processing
Syncsort Confidential and Proprietary - do not copy or distribute 6
7
Smart Contributions to Improve Hadoop
8 Sy
nc
so
JIRA
4807 Allow MapOutputBuffer to be pluggable
4808 Allow Reduce-side merge to be pluggable
4809 Make classes required for 2454 public
4812 Create reduce input merger plug-in
Description
…and more!!
4842 Shuffle race can hang reducer
2461 HDFS file name globbing in libhdfs
4482 Backport of 2454 to MapReduce 1 & 1.2
Native Sort:
Nχot modular
Lχimited capabilities
Dχifficult to fine-tune & configure (requires coding & compilation)
Native
Sort
HadoopNode
Native
Sort
HadoopNode
Contribution:
Modular
Extensible
Configurable through use of external sorters on MapReduce nodes
Native
Sort
HadoopNode
Native
Sort
HadoopNode
First Included - Hadoop distribution, CDH4.2, on February 26th
Syncsort Confidential and Proprietary - do not copy or distribute 9
0
50
100
150
200
250
0 1000 2000 3000 4000 5000
Elap
sed
Tim
e (
min
)
File Size (GB)
TeraSort Benchmark
Benefits to the Community
JOIN
MERGE
AGGREGRATION
CDC
COMPRESSION
LOOKUPRANK
MATCH
Syncsort Confidential and Proprietary - do not copy or distribute 10
50%Data Access:
Today
Run
Ma
infr
am
es
•HDFS Connectivity•Mainframe•Teradata•Files•RDBMS, Appliances
Syncsort. A Bridge to Scalable, Cost-effective Big Data
Syncsort Confidential and Proprietary - do not copy or distribute 11
Connect Pre-process Facilitate Optimize•Sort, Join•Aggregate•Compress•Partition
•Graphical UI•No Manual Coding•No Tuning
•Up to 6x Faster Load•Up to 2x Faster Sort•Faster MapReduce Jobs
•Less Storage
Over 40 Years Solving Big Data Challenges with Fast. Efficient. Simple.
Cost Effective DI Technology
© comScore, Inc. Proprietary. 12
Hourly Load into comScore’s Hadoop Cluster
SyncSort’s DMExpress saves comScore over 4TB of data per day!
That’s 1460TB a year -1.42 Petabytes
-
50,000,000,000
100,000,000,000
150,000,000,000
200,000,000,000
250,000,000,000
300,000,000,000
350,000,000,000
400,000,000,000
450,000,000,000
500,000,000,000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Input Data in Bytes Output Data in Bytes
© comScore, Inc. Proprietary. 13
comScore’s Daily Trend of Event Volume
0
1,000,000,000
2,000,000,000
3,000,000,000
4,000,000,000
5,000,000,000
6,000,000,000
0
10,000,000,000
20,000,000,000
30,000,000,000
40,000,000,000
50,000,000,000
60,000,000,000
# o
f p
an
el
reco
rds
# o
f cen
su
s r
eco
rds
Beacon Records Panel Records
Please Attend Mike Brown’s Session Analyzing 1.4
Trillion Events with Hadoop Tomorrow
© comScore, Inc. Proprietary. 14Syncsort Confidential and Proprietary - do
not copy or distribute
(No elephants were harmed duringthe creation of this talk but someare now a lot faster & meaner)
Please visit our booth to register for a free evaluation