Upload
cloudera-inc
View
1.389
Download
1
Embed Size (px)
Citation preview
SolutionWouldn’t it be cool to use lots of EC2 instances
(it’s cheap; nobody will notice)
Wouldn’t it be cool to use Hadoop
(MapReduce Google style is awesome)
Problem Bits Articles are served as PDFs
Really need PDFs from 1851-1981
PDFs are dynamically generated
Free = more traffic
Real deadline
BackgroundWhat goes into making a PDF of a NYTimes.com article?
Each article is made up of many different pieces - multiple columns, different sized headings, multiple pages, photos.
Solution Copy all the source data to S3
Use a bunch of EC2 instances and some Hadoop code to generate all the PDFs
Store the output PDFs in S3
Serve the PDFs out of S3 w/ a signed query string
A Few Details Limited HDFS - everything loaded in and out of S3
Reduce = 0 - only used for some stats and error reporting
Breakdown 4.3 TB of source data into S3
11M PDFS - 1.5 TB output
$240 for EC2 - 24hrs x 100 machines
Clustering Moving beyond simple counting and joining
Join usage data, demographic information, and article meta data
Apply simple k-means clustering