Upload
jeff-patti
View
2.229
Download
2
Embed Size (px)
DESCRIPTION
This talk was prepared for the November 2013 DataPhilly Meetup: Data in Practice ( http://www.meetup.com/DataPhilly/events/149515412/ ) Map Reduce: Beyond Word Count by Jeff Patti Have you ever wondered what map reduce can be used for beyond the word count example you see in all the introductory articles about map reduce? Using Python and mrjob, this talk will cover a few simple map reduce algorithms that in part power Monetate's information pipeline Bio: Jeff Patti is a backend engineer at Monetate with a passion for algorithms, big data, and long walks on the beach. Prior to working at Monetate he performed software R&D for Lockheed Martin, where he worked on projects ranging from social network analysis to robotics.
Citation preview
MapReduce:Beyond Word Count Jeff Patti
https://github.com/jepatti/mrjob_recipes
What is MapReduce?“MapReduce is a programming model for processing large data sets with a parallel, distributed algorithm on a cluster.” - Wikipedia
Map - given a line of a file, yield key: value pairs
Reduce - given a key and all values with that key from the prior map phase, yield key: value pairs
Word CountProblem: count frequencies of words in documents
Word Count Using mrjob def mapper(self, key, line):
for word in line.split():
yield word, 1
def reducer(self, word, occurrences):
yield word, sum(occurrences)
Sample Output"ligula" 4"ligula." 2"lorem" 5"lorem."4"luctus" 3"magna" 5"magna," 3"magnis" 1
Monetate Background● Core products are merchandising,
personalization, testing, etc.● A/B & Multivariate testing to determine
impact of experiments● Involved with >20% of ecommerce spend
each holiday season for the past 2 years running
Monetate Stack● Distributed across multiple availability zones
and regions for redundancy, scaling, and lower round trip times
● Real time decision engine using MySQL● Nightly processing of each days data via
Hadoop using mrjob, a python library for writing mapreduce jobs
Beyond Word Count● Activity stream sessionization● Product recommendations● User behavior statistics
Activity Stream SessionizationGoal: collate user activity, splitting into different sessions if user inactive for more than 5 minutes
Input format: timestamp, user_id
Collate user activity def mapper(self, key, line):
timestamp, user_id = line.split()
yield user_id, timestamp
def reducer(self, uid, timestamps):
yield uid, sorted(timestamps)
Sample Output"998" ["1384389407", "1384389417", "1384389422", "1384389425", "1384390407", "1384390417", "1384391416", "1384392410", "1384392416", "1384395420", "1384396405"]"999" ["1384388414", "1384388425", "1384389419", "1384389420", "1384390420", "1384391415", "1384391418", "1384393413", "1384393425", "1384394426", "1384395416", "1384396415", "1384396422"]
Segment into SessionsMAX_SESSION_INACTIVITY = 60 * 5
...
def reducer(self, uid, timestamps):
timestamps = sorted(timestamps)
start_index = 0
for index, timestamp in enumerate(timestamps):
if index > 0:
if timestamp - timestamps[index-1] >
MAX_SESSION_INACTIVITY:
yield uid, timestamps[start_index:index]
start_index = index
yield uid, timestamps[start_index:]
Sample Output"999"[1384388414, 1384388425]"999"[1384389419, 1384389420]"999"[1384390420]"999"[1384391415, 1384391418]"999"[1384393413, 1384393425]"999"[1384394426]"999"[1384395416]"999"[1384396415, 1384396422]
Product Recommendations Goal: For each product a client sells, generate a ‘people who bought this also bought this’ recommendation
Input: product_id_1, product_id_2, ...
Coincident Purchase Frequency def mapper(self, key, line):
purchases = set(line.split(','))
for p1, p2 in permutations(purchases, 2):
yield (p1, p2), 1
def reducer(self, pair, occurrences):
p1, p2 = pair
yield p1, (p2, sum(occurrences))
Sample output"8" ["5", 11]"8" ["6", 19]"8" ["7", 14]"8" ["9", 11]"9" ["1", 20]"9" ["10", 22]"9" ["11", 21]"9" ["12", 13]
Top Recommendations def reducer(self, purchase_pair, occurrences):
p1, p2 = purchase_pair
yield p1, (sum(occurrences), p2)
def reducer_find_best_recos(self, p1, p2_occurrences):
top_products = sorted(p2_occurrences, reverse=True)[:5]
top_products = [p2 for occurrences, p2 in top_products]
yield p1, top_products
def steps(self):
return [self.mr(mapper=self.mapper, reducer=self.reducer),
self.mr(reducer=self.reducer_find_best_recos)]
Sample Output"7" ["15", "18", "17", "16", "3"]"8" ["14", "15", "20", "6", "3"]"9" ["15", "17", "19", "6", "3"]
Top RecommendationsMulti Account def mapper(self, key, line):
account_id, purchases = line.split()
purchases = set(purchases.split(','))
for p1, p2 in permutations(purchases, 2):
yield (account_id, p1, p2), 1
def reducer(self, purchase_pair, occurrences):
account_id, p1, p2 = purchase_pair
yield (account_id, p1), (sum(occurrences), p2)
2nd step reducer unchanged
Sample Output["9", "20"] ["8", "14", "13", "10", "1"]["9", "3"] ["2", "4", "16", "11", "17"]["9", "4"] ["3", "18", "11", "16", "15"]["9", "5"] ["2", "1", "7", "18", "17"]["9", "6"] ["12", "3", "2", "17", "16"]["9", "7"] ["18", "5", "17", "1", "9"]["9", "8"] ["20", "14", "13", "10", "4"]["9", "9"] ["18", "7", "6", "5", "4"]
User Behavior StatisticsGoal: compute statistics about user behavior (conversion rate & time on site) by account and experiment in an efficient manner
Input:account_id, campaigns_viewed, user_id, purchased?, session_start_time, session_end_time
Statistics PrimerWith sample count, mean, and variance for each side of an experiment we can compute all the statistics our analytics package displays
Statistics Primer (cont.)y = a sessions metric value, ex: time on site● Sample count: count the number of sessions
that viewed the experiment○ sum(y^0)
● Mean: sum the metric / sample count○ sum(y^1)/sum(y^0)
Statistics Primer (cont.)● Variance:
○ Variance = mean of square minus square of mean○ Variance = sum(y^2)/sum(y^0) - (sum(y^1)/sum(y^0)) ^ 2
For each side of an experiment we only need to generate: sum(y^0), sum(y^1), sum(y^2)
Statistics by accountstatistic_rollup/statistic_summarize.py
Sample Output["8", "average session length"] [99, 24463, 7968891]["8", "conversion rate"] [99, 45, 45]["9", "average session length"] [115, 29515, 10071591]["9", "conversion rate"] [115, 55, 55]
Statistics by experimentstatistic_rollup_by_experiment/statistic_summarize.py
Sample Output["9", 0, "average session length"] [32, 8405, 3031009]["9", 0, "conversion rate"] [32, 20, 20]["9", 1, "average session length"] [23, 5405, 1770785]["9", 1, "conversion rate"] [23, 14, 14]["9", 2, "average session length"] [39, 9481, 2965651]["9", 2, "conversion rate"] [39, 20, 20]["9", 3, "average session length"] [25, 6276, 2151014]["9", 3, "conversion rate"] [25, 13, 13]["9", 4, "average session length"] [27, 5721, 1797715]["9", 4, "conversion rate"] [27, 16, 16]
Questions?
?