48
End-to-end Hadoop for search suggestions Guest lecture at RuG

RuG Guest Lecture

Embed Size (px)

DESCRIPTION

Presented at a guest lecture at the Rijksuniversiteit Groningen as part of the web and cloud computing master course. I presented a architecture for and working implementation of doing Hadoop based typeahead style search suggestions. There is a companion github repo with the code and config at: https://github.com/friso/rug (there's no documentation, though).

Citation preview

  • 1. End-to-end Hadoop forsearch suggestionsGuest lecture at RuG

2. WEB AND CLOUD COMPUTING 3. http://dilbert.com/strips/comic/2011-01-07/ 4. CLOUD COMPUTING FORRENTimage: ickr user jurvetson 5. 30+ million users in two yearsWith 3 engineers! 6. Also, 7. A quick survey... 8. This talk:Build working search suggestions (like onbol.com)That update automatically based on userbehaviorUsing a lot of open source (elasticsearch,Hadoop, Redis, Bootstrap, Flume, node.js,jquery)With a code example, not just slidesAnd a live demo of the working thing inaction 9. http://twitter.github.com/bootstrap/ 10. How do we populate this array? 11. possibly many of these HTTP request: /search?q=trisodium+phosphate write to log le: 2012-07-01T06:00:02.500Z /search?q= trisodium%20phosphate serve suggestions for web requestsUSER transport logs to compute cluster push suggestions to DB server this thingmust bepossiblyreally, really terabytes of transform logs to search suggestionsfastthese 12. Search is important! 98% of landing pagevisitors go directly tothe search box 13. http://www.elasticsearch.org/ 14. http://nodejs.org/ 15. Node.jsServer-side JavaScriptBuilt on top of the V8 JavaScript engineCompletely asynchronous / non-blocking /event-driven 16. Non-blocking means function calls never block 17. Distributed filesystemConsists of a single master node and multiple (many) datanodesFiles are split up blocks (64MB by default)Blocks are spread across data nodes in the clusterEach block is replicated multiple times to different data nodes inthe cluster (typically 3 times)Master node keeps track of which blocks belong to a file 18. Accessible through Java APIFUSE (filesystem in user space) driver available to mount asregular FSC API availableBasic command line tools in Hadoop distributionWeb interface 19. File creation, directory listing and other meta data actions gothrough the master node (e.g. ls, du, fsck, create file)Data goes directly to and from data nodes (read, write, append)Local read path optimization: clients located on same machineas data node will always access local replica when possible 20. Name Node/some/file /foo/barHDFS clientcreate leread data Date Node Date NodeDate Node write data DISKDISK DISK Node local HDFS client DISKDISK DISK replicate DISKDISK DISKread data 21. NameNode daemonFilesystem master nodeKeeps track of directories, files and block locationsAssigns blocks to data nodesKeeps track of live nodes (through heartbeats)Initiates re-replication in case of data node lossBlock meta data is held in memoryWill run out of memory when too many files exist 22. DataNode daemonFilesystem worker node / Block serverUses underlying regular FS for storage (e.g. ext4)Takes care of distribution of blocks across disksDont use RAIDMore disks means more IO throughputSends heartbeats to NameNodeReports blocks to NameNode (on startup)Does not know about the rest of the cluster (shared nothing) 23. HDFS triviaHDFS is write once, read many, but has append support innewer versionsHas built in compression at the block levelDoes end-to-end checksumming on all data (on read andperiodically)HDFS is best used for large, unstructured volumes of raw datain BIG files used for batch operationsOptimized for sequential reads, not random access 24. Job input comes from files on HDFSOther formats are possible; requires specialized InputFormatimplementationBuilt in support for text files (convenient for logs, csv, etc.)Files must be splittable for parallelization to workNot all compression formats have this property (e.g. gzip) 25. JobTracker daemonMapReduce master nodeTakes care of scheduling and job submissionSplits jobs into tasks (Mappers and Reducers)Assigns tasks to worker nodesReassigns tasks in case of failureKeeps track of job progressKeeps track of worker nodes through heartbeats 26. TaskTracker daemonMapReduce worker processStarts Mappers en Reducers assigned by JobTrackerSends heart beats to the JobTrackerSends task progress to the JobTrackerDoes not know about the rest of the cluster (shared nothing) 27. NameNodeJobTracker DataNodeDataNodeDataNodeDataNodeTaskTracker TaskTracker TaskTracker TaskTracker 28. Flume is a distributed, reliable, and available service for efciently collecting, aggregating, and moving largeamounts of log data. http://ume.apache.org/ 29. image from http://ume.apache.org 30. image from http://ume.apache.org 31. image from http://ume.apache.org 32. Processing the logs, a simple approach:Divide time into 15 minute bucketsCount # times a search term appears ineach bucketDivide counts by (current time - time ofstart of bucket)Rank terms according to sum of dividedcounts 33. MapReduce, Job 1:FIFTEEN_MINUTES = 1000 * 60 * 15map (key, value) {term = getTermFromLogLine(value)time = parseTimestampFromLogLine(value)bucket = time - (time % FIFTEEN_MINUTES)outputKey = (bucket, term)outputValue = 1emit(outputKey, outputValue)}reduce(key, values) {outputKey = getTermFromKey(key)outputValue = (getBucketFromKey(key), sum(values))emit(outputKey, outputValue)} 34. MapReduce, Job 2:map (key, value) {for each prefix of key {outputKey = prefixcount = getCountFromValue(value)score = count / (now - getBucketFromValue(value))}}reduce(key, values) {for each term in values {storeOrUpdateTermScore(value)}order terms by descending scoresemit(key, top 10 terms)} 35. Processing the logs, a simple solution:Run Job 1 on new input (written by Flume)When nished processing new logs, moveles to another locationRun Job 2 after Job 1 on new and historicoutput of Job 1Push output of Job 2 to a databaseCleanup historic outputs of Job 1 after Ndays 36. A database. Sort of... Redis is an open source, advanced key-value store. It isoften referred to as a data structure server since keys cancontain strings, hashes, lists, sets and sorted sets.http://redis.io/ 37. Insert into Redis:Update all keys after each jobSet a TTL on all keys of one hour; obsoleteentries go away automatically 38. How much history do we need? 39. Improvements:Use HTTP keep-alive to make sure the tcpstays openCreate a diff in Hadoop and only updaterelevant entries in RedisProduction hardening (error handling,packaging, deployment, etc.) 40. Options for better algorithms:Use more signals than only search: Did a user type the text or click a suggestion?Look at cause-effect relationships withinsessions: Was it a successful search? Did the user buy the product?Personalize: Filtering of irrelevant results Re-ranking based on personal behavior 41. Q&Acode is athttps://github.com/friso/rug [email protected]@fzk