Upload
christopher-curtin
View
1.428
Download
1
Tags:
Embed Size (px)
Citation preview
Our Hadoop JourneyChris CurtinHead of Technical Research
Atlanta Hadoop Users Group July 2013
2
About Me• 20+ years in technology• Head of Technical Research at Silverpop (12 + years at
Silverpop)• Built a SaaS platform before the term ‘SaaS’ was being
used• Prior to Silverpop: real-time control systems, factory
automation and warehouse management• Always looking for technologies and algorithms to help
with our challenges• Car nut
3
Silverpop Open Positions• Senior Software Engineer (Java, Oracle, Spring, Hibernate,
MongoDB)• Senior Software Engineer – MIS (.NET stack)• Software Engineer• Software Engineer – Integration Services (PHP, MySQL)• Delivery Manager – Engineering• Technical Lead – Engineering• Technical Project Manager – Integration Services• http://www.silverpop.com – Go to Careers under About
4
About Silverpop• Founded in late 1999, Atlanta based, offices in London,
Germany, Irvine California• Digital Marketing Technology provider, unifying
marketing automation, email, mobile and social.• Track billions of contact events, execute on those
events, send billions of emails• Clients are in marketing departments
5
Challenge from the business• Engage allows clients to define their own database
schema for contact records• No two client’s schemas are the same• Schemas often change weekly/monthly• Contact’s records are ‘point in time’ • Users want to report on value of a contact record
when activity occurred
6
Example• How well did my marketing campaign to my loyalty
clients do last quarter? • Easy question, hard answer
– Contact’s ‘level’ changes throughout the year (Silver to Gold)
– Some piece of data wasn’t known at the time of the email send, but is now
– What do you want to pivot on? Level? Age? Source Code? Time in database?
7
Technical solutions• Traditional Data warehouse• Queries against OLTP or OLAP stores• Customer-specific databases
8
Hadoop• Started working on R&D project in 2008• First raw map/reduce• Some Pig• Some Hive/Hbase • (and several start-ups long since dead …)
• Flexible schema caused problems with most of them
9
First ‘real’ application• Pivot reports against flexible schemas• Per contact, not aggregate• Let the user select any communication(s), see what
user attributes are available to use as pivots• Pivot data is at time of communication, not current
values (slow moving data)• Could be against a few thousand events, to billions
10
First ‘real’ challenges• Flexible schema meant Hbase, Hive etc. wouldn’t work
easily• Flexible schema meant Pig scripts were difficult to
maintain (even generating on the fly)• Need to coordinate multiple steps OUTSIDE of the
Hadoop process• UI• Resource Allocation and control
11
Cascading• Answered a number of problems• Allowed integration with other platforms, even
between M/R jobs– MySQL to find list of supported columns– HDFS to find actual files on disk– JMS for job sourcing/status updates (not implemented)
12
Cascading Dynamic Schema Solution• Allows the definition of schema at run time• Allows definition of steps at run time
– One report may have 10 mailings, another 10,000– 10,000 mailings can’t be run in parallel, so
programmatically create temporary results
13
Sample Cascading Code
14
Client Response• Either got it immediately or didn’t see the need for
something this flexible• Found a reason to talk to others in organization to find
other pivot fields• Most common use case: behaviors based on Source
Code• Turned out to be a weekly/monthly report not a day-to-
day tool• Some used it for ad hoc, but to build a requirement for
their BI teams
15
Profiling Application• Retention is a big theme in marketing• Looking at a single mailing/ad buy etc. showed aggregates
about that slice of time, but are misleading:– Is the 20% who opened that email the same 20% as last
week?– For people in my database for 6 months, how often do they
interact with my marketing?– What is a typical interaction rate for my database?– How many times on average does a contact interact with me
in a month? Who is outside of that rate?• Instead of looking across communication now needed to
look at each contact
16
New technical challenges• Previous report could be broken into specific steps to
reduce volume of events before ‘heavy’ math was done
• New report needs to look at all events together• Quickly overwhelmed scheduler
17
Hadoop Challenges• No schema – external store of mappings• No appending in HDFS – daily integration could be 10MM
rows for a communication or 5• ‘lots of small files’ – thousands of clients with thousands
of communications means millions of files• ETL from Oracle meant concatenating files weekly to keep
count down• Single point of failure (Name Node) took long time to
recover• Non-batch processes, how to schedule jobs on demand?• Hadoop Job History – memory vs. concurrent job tradeoffs
18
MapR• Eventually settled on MapR M3
– Large number of files was main driver– NFS mount is nice feature– Cascading works
• Not without issues– Found several bugs around Volumes in HDFS and log retention
that we had to work around (later fixed)– Can’t copy between volumes using HDFS commands– More complicated for operations to manage (had a CLDB
failure that took a day to recover, mostly us trying to figure out what to do.)
19
Misc. Technical Information• Fair Scheduler
– Our scheduling logic knows how many queues and controls how many jobs can be submitted at the same time
• Mapr ExpressLane is useful for small jobs– Our scheduler knows it is a small job so lets MapR take it
• Mapr’s NFS mount is great– Write directly to it from Java apps instead of HDFS API– Concatenating daily files is a simple Java app now – (Still don’t append to files in HDFS, but could)
• Nagios for monitoring
20
Cluster details• 5 nodes
– 1 admin, 4 workers– 8 core Xeon 16 GB– 5TB usable per box assigned to MapR
• Had 9 nodes, reduced to 5– Cluster was mostly idle due to user’s submittal patterns
(heavy on Tuesdays, 7th day of the month)– Delay to end users was minimal when we reduced the
number of machines
21
Closing the loop• Next logical step was for clients to ask to target the
contacts• The volume of data didn’t make that easily possible• Integrating from Hadoop back to Oracle became an
ETL project– Export from Oracle was single dump, import would be a
job per client.
• Automation of reports (and emailing results) was 2nd most asked for feature
• Lots of support required to know what to do with the results– No easy ‘go do this when you see this in the reports’
22
Current Status• Dozens of monthly users• Some optimizations to toss data early in the import
step for clients not using the tool• Packaging and pricing is vexing the product marketing
team• Runs lights out unless the ETL process breaks
23
Business Challenges• Lots of cool ideas we came up with, even implemented
a few• But end users didn’t know what to do with the data• ‘SaaS-ifying’ is proving difficult
– Multi-tenancy resource management is not available– How to price? End report may have 20 rows but
processed 1BN rows to get there
• If I hear ‘do you do big data’ one more time …
24
Things we are watching• Real-time tools on top of Hadoop (Drill, Impala)• Storm inside of YARN• Storm in general• Integration of Kafka, Storm, Drill/Impala, Hadoop &
MongoDB
25
Information• Slides: http://www.slideshare.net/chriscurtin• Me: [email protected] @ChrisCurtin on twitter
26
Silverpop Open Positions• Senior Software Engineer (Java, Oracle, Spring, Hibernate,
MongoDB)• Senior Software Engineer – MIS (.NET stack)• Software Engineer• Software Engineer – Integration Services (PHP, MySQL)• Delivery Manager – Engineering• Technical Lead – Engineering• Technical Project Manager – Integration Services• http://www.silverpop.com/marketing-company/
careers/open-positions.html