23
1 © 2011 IBM Corporation Eric Eric Baldeschwieler Baldeschwieler VP, VP, Hadoop Hadoop Software Software HADOOP HADOOP YAHOO & YAHOO & USING AND IMPROVING APACHE HADOOP AT YAHOO!

Yahoo & Hadoop

  • View
    2.877

  • Download
    1

Embed Size (px)

DESCRIPTION

 

Citation preview

YahooPresentationBigDataEvent

1 2011 IBM Corporation

Eric Eric BaldeschwielerBaldeschwielerVP, VP, HadoopHadoop Software Software

HADOOP HADOOP YAHOO &YAHOO &

USING AND IMPROVINGAPACHE HADOOP AT YAHOO!

2 2011 IBM Corporation

Brief Overview

Hadoop @ Yahoo!

Hadoop Momentum

The Future of Hadoop

AGENDA

2

3 2011 IBM Corporation

happening whats

- Big Data is here!- unstructured data- petabyte scale- operationally critical

Flickr : sub_lime79

4 2011 IBM Corporation

into insightsturning dataturning data

machine learningmachine learningtime seriestime series

content clusteringcontent clustering

factorization modelsfactorization models

logic regressionlogic regression

Flickr : NASA Goddard Photo and Video

algorithmsalgorithmsuser interest predictionuser interest prediction

ad inventory modelingad inventory modeling

5 2011 IBM Corporation

relevantrelevantmaking YAHOOmaking YAHOO

Flickr : ogimogi

6 2011 IBM Corporation

Poweringhadoop:

science science + + big big data data ++ insight insight = = personal relevancepersonal relevance = = VALUEVALUE

Yahoo!

Flickr : DDFic

7 2011 IBM Corporation

WHAT IS HADOOP?

7

HDFS

MapReduce

Pig HiveCommodity ComputersNetwork

Focus onSimplicityRedundancy ScaleAvailability

Transforms commodity equipment into a service that:HDFS Stores peta bytes of data reliablyMap-Reduce Allows huge distributed computations

Key AttributesRedundant and reliable Doesnt stop or loose data even as hardware failsEasy to program Our rocket scientists use it directly!Very powerful Allows the development of big data algorithms & tools

Batch processing centric

8 2011 IBM Corporation

WHAT HADOOP ISNT

A replacement for relational and data warehouse systems

A transactional / online / serving system A low latency or streaming solution

8

9 2011 IBM Corporation

HADOOP IN THE ENTERPRISE

9

RDMSRDMS EDWEDWData

Marts

Data

Marts

HADOOP CLUSTER(S)

Transactions, Structured Data

Business

ApplicationsWeb Logs, Server Logs,

Social Media, etc

Interactions

Semi-Structured or Un-Structured Data

Business Intelligence ApplicationsBusiness Intelligence Applications

10 2011 IBM Corporation 10

HADOOP @ YAHOO!

11 2011 IBM Corporation 11

HADOOP @YAHOO!

Where Science meets Data

HADOOP CLUSTERSTens of thousands of servers

DATA PIPELINES

CONTENT

DIMENSIONAL DATA

PRODUCTS

APPLIED SCIENCE

Data Analytics Content OptimizationContent Enrichment Yahoo! Mail Anti-Spam Advertising ProductsAd Optimization Ad SelectionBig Data Processing & ETL

User Interest Prediction Ad inventory prediction Machine learning -search ranking Machine learning - ad targetingMachine learning - spam filtering

Terabytes /

Day

(compressed)

10s of Petabytes

12 2011 IBM Corporation

2006 2007 2008 2009 201012

FROM PROJECT TOCORE PLATFORM

170 PB Storage

T

h

o

u

s

a

n

d

s

o

f

S

e

r

v

e

r

s

P

e

t

a

b

y

t

e

s

90

80

70

60

50

40

30

20

10

0

250

200

150

100

50

0

40K+ Servers

5M+ Monthly Jobs

13 2011 IBM Corporation

HADOOP POWERS THEYAHOO! NETWORK

advertising optimizationadvertising optimization

ad selectionad selection

Yahoo! Homepage

machine learning search rankingmachine learning search ranking

ad inventory predictionad inventory prediction

Yahoo! Mail anti-spam

user interest predictionuser interest prediction

audience, ad and search pipelinesaudience, ad and search pipelinesadvertising data systemsadvertising data systems

Content OptimizationContent Optimization

data analyticsdata analytics

13

14 2011 IBM Corporation

twice the engagementtwice the engagement

CASE STUDYYAHOO! HOMEPAGE

14

Personalized

for each visitor

Result:

twice the engagement

+160% clicksvs. one size fits all

+79% clicksvs. randomly selected

+43% clicksvs. editor selected

Recommended links News Interests Top Searches

15 2011 IBM Corporation

CASE STUDYYAHOO! HOMEPAGE

15

Serving Maps Users - Interests

Five Minute Production

Weekly Categorization models

SCIENCEHADOOP

CLUSTER

SERVING SYSTEMS

PRODUCTIONHADOOP

CLUSTER

USER

BEHAVIOR

ENGAGED USERS

CATEGORIZATION

MODELS (weekly)

SERVING

MAPS

(every 5 minutes)USER

BEHAVIOR

Identify user interests using Categorization models

Machine learning to build ever better categorization models

Build customized home pages with latest data (thousands / second)

16 2011 IBM Corporation

CASE STUDYYAHOO! MAIL

Enabling quick response in the spam arms race

450M mail boxes 5B+ deliveries/day

Antispam models retrainedevery few hours on Hadoop

40% less spam than Hotmail and 55% less spam than Gmail

SCIENCE

PRODUCTION

16

17 2011 IBM Corporation

YAHOO! & APACHE HADOOP

17

Yahoo! has contributed 70+% of Apache Hadoop code to date

Hadoop is not our business, but Hadoop is key to our business

Yahoo! benefits from open source eco-system around Hadoop

Hadoop drives revenue at Yahoo! by making our core products better

We need Hadoop to be rock solid

We invest heavily in core Hadoop development

We focus on scalability, reliability, availability

We fix bugs before you see them

We run very large clusters

We have a large QA effort

We run a huge variety of workloads

We are good Apache Hadoop citizens

We contribute our work to Apache

We share the exact code we run

18 2011 IBM Corporation 18

HADOOP MOMENTUM

19 2011 IBM Corporation

HADOOP IS GOINGMAINSTREAM

2007 2008 2009

19

2010

The Datagraph Blog

20 2011 IBM Corporation

THE PLATFORM EFFECTBIRTH OF AN ECOSYSTEM

and other Early AdoptersScale and productize Hadoop

20

Apache Hadoop

Orgs with Internet Scale ProblemsAdd tools / frameworks, enhance Hadoop

Mainstream / Enterprise adoptionDrive further development, enhancements

Enhance

Hadoop

Ecosystem

Service Providers Grow ecosystem - Training, support, enhancements

Virtuous Circle! Investment -> Adoption Adoption -> Investment

21 2011 IBM Corporation 21

THE FUTURE OF HADOOP

22 2011 IBM Corporation

MAKING HADOOP ENTERPRISE-READYWHATS NEXT

22

Hadoop is far from done Current implementation is showing its age Need to address several deficiencies in scalability,

flexibility, ease of use & performanceYahoo! is working on Next Generation of Hadoop

MapReduce: Rewrite to improve performance;pluggable support for new programming models

HDFS: Adding volumes to improve scalability;Flush & sync support for applications that log to HDFS

Apache should remain the hub of Hadoop ecosystem Yahoo! contributes all Hadoop changes back to Apache

Hadoop Everyone benefits from shared neutral foundation

23 2011 IBM Corporation 23

Questions?