Upload
william-yetman
View
332
Download
0
Embed Size (px)
DESCRIPTION
How Ancestry supports our DNA results processing with Hadoop, HBase, and Azkaban. This shows how we evolved and learned while supporting the business
Citation preview
Scaling AncestryDNA with the Hadoop EcosystemJune 5, 2014
What Ancestry uses from the Hadoop ecosystem●Hadoop, HDFS, and MapReduce
●HBase●Columnar, NoSQL data store, unlimited rows and columns
●Azkaban●Workflow
2
What will this presentation cover?●Describe the problem
●Discoveries with DNA
●Three key steps in the pipeline process
●Measure everything principle
●Three steps with Hadoop●Hadoop as a job scheduler for the ethnicity step
●Scaling matching step
●MapReduce implementation of phasing
●Performance
●What comes next?
3
Discoveries with DNA●Autosomal DNA test that analyzes 700,000 SNPs
●Over 400,000 DNA samples in our database
●Identified 30 million relationships that connect the genotyped members through shared ancestors
4Jan-12 Apr-12 Jul-12 Oct-12 Jan-13 Apr-13 Jul-13 Oct-13 Jan-14 Apr-14 -
50,000
100,000
150,000
200,000
250,000
300,000
350,000
400,000
450,000
-
5,000,000
10,000,000
15,000,000
20,000,000
25,000,000
30,000,000 DNA DB Size Cousin Matches
Dat
abas
e Si
ze (#
of s
ampl
es)
Cous
in m
atch
es (4
th o
r clo
ser)
Three key steps in the pipeline
●What is a pipeline?
1. Ethnicity (AdMixture)
2. Matching (GERMLINE and Jermline)
3. Phasing (Beagle and Underdog)
●First pipeline executed on a single, beefy box.● Only option is to scale vertically
5
InitResults
Processing Finalize
“Beefy Box”
24 cores, 256G mem
Measure everything principle
●Start time, end time, duration in seconds, and sample count for every step in the pipeline. Also the full end-to-end processing time.
●Put the data in pivot tables and graphed each step
●Normalize the data (sample size was changing)
●Use the data collected to predict future performance
6
Challenges and pain pointsPerformance degrades when DNA pool grows●Static (by batch size)
●Linear (by DNA pool size)
●Quadratic (matching related steps) – time bomb
7
Ethnicity step on HadoopUsing Hadoop as a job scheduler to scale AdMixture
8
First step with Hadoop●What was in place?
●Smart engineers with no Hadoop experience
●Pipeline running on a single computer that would not scale
●New business that needed a scalable solution to grow
First step using Hadoop●Run AdMixture step in parallel on Hadoop
●Self contained program with set inputs and outputs
●Simple MapReduce implementation
●Experience running jobs on Hadoop
●Freed up CPU and memory on the single computer for the other steps 9
What did we do? (Don’t cringe…)
1. Mapper -> Key: User ID, Ethnicity Result
2. Reducer -> Key: User ID, Array [Ethnicity Result]10
DNA Hadoop Cluster(40 nodes)
Pipeline Beefy Box(Process Step Control)
For a batch of 1000Submit 40 jobs with 25 samples per job
AdMixture Job #1
AdMixture Job #2
AdMixture Job #40
...
Mapper that forks AdMixture
sequentually for each sample Reducer #1
Reducer #2
Reducer #3
...
Results go to a simple reducer that merges them into a
single results file
(Shuffle)
Performance results●Went from processing 500 samples in 20 hours
to processing 1,000 samples in 2 ½ hours●Reduced Beagle phasing step by 4 hours
11
010,00020,00030,00040,00050,00060,00070,00080,00090,000
100,000
AdMixture Time (sec)Sum of Run Size Admixture Time
Provided valuable experience and bought us time
H1 ReleaseAdMixture on
Hadoop
GERMLINE to JermlineMoving the matching step to MapReduce and HBase
12
Introducing … GERMLINE!●GERMLINE is an algorithm that finds hidden relationships within a pool of DNA
●Also refers to the reference implementation of that algorithm written in C++. You can find it here:
So what’s the problem?●GERMLINE (the implementation) was not meant to be used in an industrial setting
● Stateless, single threaded, prone to swapping (heavy memory usage)
● GERMLINE performs poorly on large data sets
●Our metrics predicted exactly where the process would slow to a crawl
●Put simply: GERMLINE couldn't scale13
http://www1.cs.columbia.edu/~gusev/germline/
GERMLINE run times (in hours)
14
2500
5000
7500
10000
12500
15000
17500
20000
22500
25000
27500
30000
32500
35000
37500
40000
42500
45000
47500
50000
52500
55000
57500
60000
0
5
10
15
20
25H
ours
Samples
Projected GERMLINE run times (in hours)
15
Hou
rs
Samples
2500
7500
12500
17500
22500
27500
32500
37500
42500
47500
52500
57500
62500
67500
72500
77500
82500
87500
92500
97500
10 10 11 11 12 12 13
0
100
200
300
400
500
600
700
GERMLINE run times
Projected GERMLINE run times
700 hours = 29+ days
DNA matching walkthroughSimplified example of showing how the code works
16
DNA matching : How it works
17
Cersei : ACTGACCTAGTTGACJoffrey : TTAAGCCTAGTTGAC
The Input
Cersei Baratheon• Former queen of
Westeros
• Machiavellian manipulator
• Mostly evil, but occasionally sympathetic
Joffrey Baratheon• Pretty much the human
embodiment of evil
• Needlessly cruel
• Kinda looks like Justin Bieber
DNA matching : How it works
18
0 1 2
Cersei : ACTGA CCTAG TTGACJoffrey : TTAAG CCTAG TTGAC
Separate into words
DNA matching : How it works
19
0 1 2
Cersei : ACTGA CCTAG TTGACJoffrey : TTAAG CCTAG TTGAC
ACTGA_0 : CerseiTTAAG_0 : JoffreyCCTAG_1 : Cersei, JoffreyTTGAC_2 : Cersei, Joffrey
Build the hash table
DNA matching : How it works
20
0 1 2
Cersei : ACTGA CCTAG TTGACJoffrey : TTAAG CCTAG TTGAC
ACTGA_0 : CerseiTTAAG_0 : JoffreyCCTAG_1 : Cersei, JoffreyTTGAC_2 : Cersei, Joffrey
Iterate through genome and find matches
Cersei and Joffrey match from position 1 to position 2
21
Does that mean they’re related?
...maybe
?
IBD to relationship estimation●We use the total length of all shared
segments to estimate the relationship between two genetic relatives
●This is basically a classification problem
22
But wait...what about Jaime?
23
Jaime : TTAAGCCTAGGGGCG
Jaime Lannister• Kind of a has-been
• Killed the Mad King
• Has the hots for his sister, Cersei
The way
24
Step one: Update the hash tableCersei Joffrey
2_ACTGA_0 1
2_TTAAG_0 1
2_CCTAG_1 1 1
2_TTGAC_2 1 1
Already stored in HBase
Jaime : TTAAG CCTAG GGGCG New sample to add
Key : [CHROMOSOME]_[WORD]_[POSITION]Qualifier : [USER ID]Cell value : A byte set to 1, denoting that the user has that word at that position on that chromosome
The way
25
Step two: Find matches, update the results table
Already stored in HBase
Jaime and Joffrey match from position 0 to position 1Jaime and Cersei match at position 1 New matches to add
Key : [CHROMOSOME]_[USER ID]Qualifier : [CHROMOSOME]_[USER ID]Cell value : A list of ranges where the two users match on a chromosome
2_Cersei 2_Joffrey
2_Cersei { (1, 2), ...}
2_Joffrey { (1, 2), ...}
The way
26
Results Table2_Cersei 2_Joffrey 2_Jaime
2_Cersei { (1, 2), ...} { (1), ...}
2_Joffrey { (1, 2), ...} { (0,1), ...}
2_Jaime { (1), ...} { (0,1), ...}
Hash TableCersei Joffrey Jaime
2_ACTGA_0 1
2_TTAAG_0 1 1
2_CCTAG_1 1 1 1
2_TTGAC_2 1 1
2_GGGCG_2 1
27
But wait...what about Daenerys, Tyrion, Arya, and Jon Snow?
Run them in parallel with Hadoop!
28
Parallelism with Hadoop●Batches are usually about a thousand people
●Each mapper takes a single chromosome for a single person
●MapReduce jobs:● Job #1: Match words
- Updates the hash table
● Job #2: Match segments
- Identifies areas where the samples match
29
Matching steps on Hadoop
1. Hash Table Mapper -> Breaks input into words and fills the hash table (HBase Table #1)
2. Hash Table Reducer -> default reducer (does nothing)
3. Results Mapper -> For each new user, read hash table, fill in the results table (HBase Table #2)
4. Results Reducer -> Key: Object(User ID #1, User ID #2), Array[Object(chrom + pos, Matching DNA Segment)] 30
DNA Hadoop Cluster(40 nodes)
Pipeline Beefy Box(Process Step Control)
Hash TableMapper #1
Hash TableMapper #2
Hash Table Mapper #N
...
Hash Table Reducer #1
Hash TableReducer #2
Hash Table Reducer #N
...
Results Mapper #1
ResultsMapper #2
ResultsMapper #N
...
ResultsReducer #1
ResultsReducer #2
ResultsReducer #N
...1 2 3 4
Input:User ID, Phased DNA
Run times for matching with
31
Hou
rs
Samples
A 1700% performance improvement over GERMLINE!
2500
7500
12500
17500
22500
27500
32500
37500
42500
47500
52500
57500
62500
67500
72500
77500
82500
87500
92500
97500
102500
107500
112500
117500
0
5
10
15
20
25
Run times for matching (in hours)
32
Hou
rs
Samples
0
20
40
60
80
100
120
140
160
180
GERMLINE run times
Projected GERMLINE run times
Jermline run times
performance ●Science team is sure the Jermline algorithm is linear
●Improving the accuracy●Found a bug in original C++ reference code
●Balancing false positives and false negatives
●Binary version of Jermline●Use less memory and improve speed
●Paper submitted describing the implementation●Releasing as an Open Source project soon
33
Beagle to UnderdogMoving phasing step from a single process to MapReduce
34
Phasing goes to the dogs●Beagle
●Open source, freely available program
●Multi-threaded process that runs on one computer
●More accurate with a large sample set
●Underdog●Does the same statistical calculations with a larger reference set, which increases accuracy
●Carefully split into a MapReduce implementation that allows parallel processing
●Collaboration between the DNA Science and Pipeline Developer Teams
35
What did we do?
1. Window Mapper -> Key: Window ID, Object(User ID, DNA)
2. Window Reducer -> Key: Window ID, Array[Object(User ID, DNA)]
3. Phase Mapper (loads window data) -> Key: User ID, Object(Window ID, Phased DNA)
4. Phase Reducer -> Key: User ID, Array[Object(Window ID, Phased DNA)] 36
DNA Hadoop Cluster(40 nodes)
Pipeline Beefy Box(Process Step Control)
Input is per User:
ATGCATCGTACGGACT...
Window Mapper #1
Window Mapper #2
Window Mapper #N
...
Window Reducer #1
Window Reducer #2
Window Reducer #N
...
Phase Mapper #1
Phase Mapper #2
Phase Mapper #N
...
PhaseReducer #1
PhaseReducer #2
PhaseReducer #N
...1 2 3 4
Load Window Reference Data Once
Underdog performance●Went from 12 hours to process 1,000 samples
to under 25 minutes with a MapReduce implementation
37With improved accuracy!
Underdog replaces Beagle
0
10,000
20,000
30,000
40,000
50,000
60,000
70,000
80,000
Total Run Size Total Beagle-Underdog Duration
Performance and next stepsIncremental change
38
Pipeline steps and incremental change…● Incremental change over time
●Supporting the business in a “just in time” Agile way
39500 11724 24820 41836 61675 88618 1157581442301737192039182326732611992910173207013508543812500
50000
100000
150000
200000
250000
Beagle-Underdog PhasingPipeline FinalizeRelationship ProcessingGermline-Jermline Results ProcessingGermline-Jermline ProcessingBeagle Post PhasingAdmixturePlink PrepPipeline Initialization
Jermline replaces Germline
Ethnicity V2 Release
Underdog Replaces Beagle
AdMixture on Hadoop
…while the business continues to grow rapidly
40Jan-12 Apr-12 Jul-12 Oct-12 Jan-13 Apr-13 Jul-13 Oct-13 Jan-14 Apr-14 -
50,000
100,000
150,000
200,000
250,000
300,000
350,000
400,000
450,000
DNA Database Size#
of p
roce
ssed
sam
ples
)
What’s next? Building different pipelines●Azkaban
●Allows us to easily tie together steps on Hadoop
●Drop different steps in/out and create different pipelines
●Significant improvement over a hand coded pipeline
●Cloud●New algorithm changes will force a complete re-run of the entire DNA pool
●Best example: New matching or ethnicity algorithm will force us to reprocess 400K+ samples
●Solution: Use the cloud for this processing while the current pipeline keeps chugging along41
What’s next? Other areas for improvement●Admixture as a MapReduce implementation
●Last major algorithm that needs to be addressed
●Expect to get performance improvements similar to Underdog
●Matching growth will cause problems●Matches per run increasing
●Change the handoff
42Keep measuring everything and adjust84337 1039201253171471721696101929072155192375382591642821173126173357613589243822950
1,000,000
2,000,000
3,000,000
4,000,000
5,000,000
6,000,000
7,000,000
8,000,000
9,000,000
Jermline Processing Duration Batch Run Size Total Matches Per Run
Questions?
Ancestry is hiring for the DNA Pipeline Team!
Tech Roots Blog: http://blogs.ancestry.com/techroots
Special thanks to the DNA Science and Pipeline Development Teams at Ancestry