View
838
Download
0
Category
Preview:
Citation preview
Your Trusted Third Party in the Digital Age™
Scalding on Tez
Twitter HQ, July 14th, 2015
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
2
• Who’s this guy?• How did we come to use Scalding?• Scalding on Tez: the Mini-HOWTO• In practice• Tips and Tricks• All aboard: how?• Performance
Agenda
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
3
WHO’S THIS GUY?
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
4Images: Amos Evans / « Rama » / Marcin Wichary // Wikipedia
• I’m 39• My oldest
computer is 33
Who’s this guy?8-bit Basic(s) Z80
assembly
Turbo Pascal
C++
PythonJava
ISO CNC
C#Scala
Still afraid of Shapeless
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
5
HOW DID WE COME TO SCALDING?
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
6
• A Trusted Third Party – Data escrow, controlled execution– Independent re-computation– Privacy & Personal Data compliance
assessment
• Big Data Services for Entertainment–Metadata enrichment– IP use certification– Dataset analysis as a service
Why Scalding?Transparency Rights Management:
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
7
Why Scalding?« Big Data Services for Entertainment » - a Use Case
Digital Service Provider Report
Copyright Owners / Collective
Management Organizations
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
8
Why Scalding?« Big Data Services for Entertainment » - a Use Case
Digital Service Provider Report
Copyright Owners / Collective
Management Organizations
Data Improvement
Automatic Data Feed
(« in your format »)
Independent Report
Conformance Report
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
9
• September 2013: SQL Server overheats• October 2013: using Lingual
12 SQL steps + bash scripts
• September 2014: Cascading + Java• September 28th: tried out Scalding• November 2014: delivered first results
on Scalding• April 2015: First success on
Scalding+Tez
Why Scalding?Dataset analysis (from YouTube monthly reports)
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
10
Our system…
Jenkins
git
Mesos
Chronos Marathon
YARN 2.6.0
HDFS 2.6.0
Debian Debian Debian DebianDebian
Ansi
ble
APP
scalding
cascading
YARNRM
APP (WS)
Akka Spray
Myriad
Artifactory 4-way Non-Reg
Jenkins Slave
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
11
Our system…7 machines, and still a lot of things to discover
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
12
SCALDING ON TEZ, THE MINI-HOWTO
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
13
• Step 0: Prerequisites:– A YARN cluster– Cascading 3.0– TEZ runtime lib in HDFS– A version of scalding with fabric
selection
Scalding on Tez, the mini-howto
(2.6.0)
0.6.2-SNAPSHOT
0.13.1 + PR1220
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
14https://github.com/cchepelov/wcplus/blob/master/build.sbt
Scalding on Tez, the mini-HOWTO• Step 1: build.sbt
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
15
Scalding on Tez, the mini-HOWTO• Step 1: build.sbt (redux)
1. Regain control on what libraries are included
2. Exclude some « long transitive » dependencies that pull in junk
3. Put in the desired fabric, in a configurable way sbt --DCASCADING_FABRIC=hadoop clean assembly
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
16
Scalding on Tez, the mini-HOWTO• Step 1bis: assembly.sbt
We’re using fatjars to simplify deployment.
Because of jar hell, we « need » a complicated assembly.sbt
https://github.com/cchepelov/wcplus/blob/master/assembly.sbt
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
17https://github.com/cchepelov/wcplus/blob/master/src/main/scala/com/transparencyrights/demo/wcplus/CommonJob.scala
Scalding on Tez, the mini-HOWTO• Step 2: a few job flags
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
18
• tez.task.resource.memory.mb– As large as you can afford to give, per CPU per
node– The more memory, the less Tez needs to spill
intermediates to disk
• tez.container.max.java.heap.fraction– Defaults (1024MiB * 0.8) assume the JVM’s Native
memory requirements don’t exceed 208 MiB– Scalding + the Scala runtime + Cascading on top
of Tez seems to require more. YARN kills offenders switftly!
– The 460MiB figure we’re using (1024+512)*(1-0.7) may be a bit wasteful
• Step 2: a few job flags (continued)
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
19
THAT’S IT.
(ALMOST)
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
20
IN PRACTICE…
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
21
« A VERSION OF SCALDING WITH FABRIC SELECTION »
WAIT, WHAT?
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
22
Scalding traditional --local and --hdfs flags:– Uses either LocalFlowConnector or
HadoopFlowConnector– Types are hard-coded
Cascading 2.5 introduced a new fabric concept. You can run either with cascading-hadoop or with cascading-hadoop2-mr1. But:– Incompatible jars (can’t load both)–Main types visible to Scalding are different
In practice« A version of scalding with fabric selection » Wait, What?
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
23
PR1220: No longer hardcodes « either Local or
Hadoop 1.X » Enables supplying any flow connector
implementation, as long as the jar’s around.
--hdfs to be deprecated as an alias to --hadoop1
Still built against Cascading 2.6
In practice« A version of scalding with fabric selection » Wait, What?
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
24
« STILL BUILT ON CASCADING 2.6 »
WHY?
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
25
Cascading 3.0 has carefully updated some argument types to prepare for the futureThis is source- and binary-compatible:
In practice« Still built on Cascading 2.6 »
Scala enforces generic type safety, and the Cascading 3.0 upgrades are not legal with scalac. But they still are with the JVM…
libra
ryco
nsum
er
Libr
ary
V2Sa
me
cons
umer
In Java
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
26
Scalding will require some adjustment to become compatible with the java-level source upgrades.
Can this happen without breaking scalding application source code ?
In practice… Going to native Cascading 3.0 ?
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
27
GUAVA
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
28
GUAVAGUAVA
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
29
• Guava is a nice library…… of little use in Scala (?)
• In a Scalding/Cascading/Tez JVM, multiple versions of guava are required. Each layer depends on its own version.About every single version from 11.0 to 16.0.2
• There have been breaking changes (method renames & removals) in guava 13
• These happen on really mundane objects (Closeable, Stopwatch), but they’re major troublemakers
In practice…Guava.
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
30
• Asking Apache to quickly upgrade to guava 18, or Google to re-introduce deprecated interfaces… probably not immediate
• Solution: Frankenguava.
In practice…Guava Hell: a temporary solution
Guava 18.0 JAR
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
31
• Asking Apache to quickly upgrade to guava 18, or Google to re-introduce deprecated interfaces… probably not immediate
• Solution: Frankenguava.
In practice…Guava Hell: a temporary solution
Guava 18.0 JAR
Stopwatch & Closeables
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
32
• Asking Apache to quickly upgrade to guava 18, or Google to re-introduce deprecated interfaces… probably not immediate
• Solution: Frankenguava.
In practice…Guava Hell: a temporary solution
Guava 18.0 JAR
Stopwatch & Closeables including
deprecated overloads
Stopwatch & Closeables
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
33
• Step 1: Post-prepare the Tez runtime
• Step 2: Enforce the use of the appropriate guava
In practice…Frankenguava: howto
• Build tez from source• Unpack runtime jar from tez-dist• Remove guava• Put frankenguava• Repack• Deploy on HDFS
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
34
CASCADING’S TEZ*REGISTRY
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
35
• Cascading 3.0 uses a set of mapping registries to convert cascading patterns into the back-end API.
The Tez registries are new, and distinct from the MR registries
• The Tez registries are hardened against Concurrent’s extensive test library, which is built on years of MR experience. Tez has its own trouble spots.
Beware of hash joins.
• It works fine now, but getting the scalding test library onboard will help a long way.
In practice…Cascading’s Tez*Registry
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
36
• It works mostly fine now, but getting the scalding test library onboard will help a long way.
In practice…Cascading’s Tez*Registry
Last-minute update:
.filterWithValue / .mapWithValue currently crash the Cascading planner (as of 3.0.1)
(implementation uses a HashJoin)
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
37
AN EXAMPLE
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
38
A small test:
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
39
A small test: « wc plus »
70 books1.1M lines10M words56M bytes
Word, relative frequency,
deviation from median relative freq
Two Words, relative frequency,
deviation from median relative freq
Ten Words, relative frequency,
deviation from median relative freq
ComputeFrequencies
Ignoring things that are more frequent
than 80% of the maxword frequency
All Expressions (1-W to 10-W),
relative frequency, deviation from median relative freq
…
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
40
A small test: « wc plus »
70 books1.1M lines10M words56M bytes
Word, relative frequency,
deviation from median relative freq
Two Words, relative frequency,
deviation from median relative freq
Ten Words, relative frequency,
deviation from median relative freq
ComputeFrequencies
Ignoring things that are more frequent
than 80% of the maxword frequency
All Expressions (1-W to 10-W),
relative frequency, deviation from median relative freq
…
No .filterWithValue / .mapWithValue for now
Roulex45 / Wikipedia
count
count
count
count
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
41
A small test: « wc plus »
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
42
TIPS & TRICKS
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
43
Run your job with
-Dcascading.planner.plan.path=/tmp/path/to/plan.lst
The planner will output a lot of useful files. One of them is…/$(Job)/4-final-flow-steps/0000-step-node-sub-graph.dot
Run that file through graphvizdot –O –Tpdf 0000-step-node-sub-graph.dot
or, if the PDF is illegible, Firefox’s great at zooming into SVG files:
dot –O –Tsvg 0000-step-node-sub-graph.dot
Tips & Tricks0000-step-node-sub-graph.dot
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
44
Tips & Tricks0000-step-node-sub-graph.dot
This is how TEZ names our stuff !
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
45
MR– One flow, many (MANY)
independent steps– One or more operators
per step– Step-to-step
communications involve disk (HDFS)
– Each step is independent as far as MR is concerned
– Step scheduling managed from outside the cluster, by Cascading
TEZ– One flow, one DAG. A DAG
includes several nodes.– One or more operators per
node– Node-to-Node
communications managed by TEZ. Memory, direct network or disk as necessary
– YARN sees one « Application » per flow
– Node scheduling managed by TEZ DAG AppMaster
Tips & TricksMajor differences between how a cascading job gets mapped to MR and to TEZ:
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
46
Tips & Tricksyarn-swimlanes.sh
• A tool included in the tez source distribution, in tez-tools/swimlanes (bash + python)
• Requires YARN ATS to work« yarn logs –applicationId application_1345431315_1511 » must work
• Reports, in a GANTT chart, the per-container occupation
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
47
Tips & Tricksyarn-swimlanes.sh (2)
application_1435150225179_0474.svg
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
48
Tips & Tricksyarn-swimlanes.sh (3)
time
containers
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
49
Tips & TricksConsider using .forceToDisk to ensure work is balanced within the DAG
890 seconds
160 seconds
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
50
Tips & TricksConsider using .forceToDisk to ensure work is balanced within the DAG
890 seconds 160 seconds
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
51
• .forceToDisk really means « don’t merge those two TEZ nodes » which implies « manage appropriate data transmission between these two nodes »
• TextFile & other FixedPathSource friends don’t seem to automatically spread out work as well as they used to (huh?)
• YMMV, WIP.
Tips & Tricks• Consider using .forceToDisk to ensure
work is balanced within the DAG
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
52
ALL ABOARD: HOW?
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
53
• A build of scalding against Cascading 3.0.x Fabric-switching logic Get the test library to pass also on Tez Some applications might still uncover new mapping issues
increased community test case experience ???
• Getting the « guava mess » fixed Ideally all of Apache goes to recent guavas Enforced shading of Guava across the whole stack? Failing that, automated runtime patcher? (my « build stuff » partner makes me write: OSGI/Java9) ???
• Except for that, Tez is really easy for a YARN shop. Drop it in, and it runs!
All aboard: how?Smoothening up the UX for us app developers
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
54
PERFORMANCE
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
55
PerformanceMR vs TEZ
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
56
PerformanceMR vs TEZ; to scale
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
57
PerformanceMR vs TEZ; TO SCALE!!!
MR run time:14:22 (wall)12:49 (cluster time)5:43:26 (total CPU)
TEZ run time:4:03(wall)2:50(cluster time)1:25:35 (total CPU)
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
58
PerformanceOutput of tez-tool « yarn-swimlanes.sh »
• 1 « swimlane » per active container• 1 colour per DAG Vertex (the black dots are actually the Vertex ID) • Container occupation is pretty good while there is work to do• (not demonstrated here) containers die when they are idle.
This is good!
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
59
CONCLUSION
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
60
As a conclusion…A lot of effort so far…
…but worth it!
Images: Nicholas Babaian // Flickr. Marathon du Médoc 2008
Cop
yri
gh
t ©
20
15
Tra
nsp
are
ncy
Rig
hts
Man
ag
em
en
t. A
ll ri
gh
ts r
ese
rved
61
THANKS!
For building that techFor helping outFor your attention today
Recommended