Upload
eleanore-lambert
View
217
Download
0
Tags:
Embed Size (px)
Citation preview
Building BI App on Cloud
Rohit ChatterSr. Architect@[email protected]
• Yahoo is the most Visited Site on the Internet– 600M+ Unique Visitors per Month– Billions of Page Views per Day– Billions of Searches per Month– Billions of Emails per Month– Terabytes of Data per Day!
• And we crawl the Web– 100+ Billion Pages– 5+ Trillion Links– Petabytes of data
• Reading 100 Terabytes could be overwhelming
Yahoo! BigData Scale
Types in a search query on Yahoo or affiliate site (aka the Publisher)
Passes search query to the ad platform for servable ad listings
Manages campaigns, creates ad listings, bids for keywords
Ad serving returns relevant & available ads matching the search query
Clicks on Ad
Shows ads returned by ad serving
Yahoo! Search Scale
Daily, Weekly, Monthly & Yearly
Daily, Weekly, Monthly & Yearly
Daily, Hourly, Weekly, Monthly & Yearly
Daily, Weekly, Monthly & Yearly
Daily, Hourly, Weekly, Monthly & Yearly
Performance, Credit Summary
Performance, Budget Headroom, AM performance, competitive analysis
Performance, Feature Adoption
Competitive analysis, cross sell, upsell, performance
Business Model
Business Perfomance monitoring
RDBMS Facts
Home Grown App Level 1 & 2 analysis
Granular aggregates
Home Grown AppWhat if analysis and deep
dive data analysis
Most granular data- event level model
Tactical & Operational reporting
Improvement & Alignment
Excellence & Strategic
Hour Glass Model – A Perspective
Functional ViewFunctional View
Data – 100+ Gigabytes/DayData – 100+ Gigabytes/Day
Hadoop Grid + PIGCloud
Hadoop Grid + PIGCloud
Aggregates & Metadata layerAggregates &
Metadata layer
App Server – BI layerApp Server – BI layer
Data SourceDimension & Fact
Utility ComputingBuild Aggregates
Oracle RDBMSBI Aggregates
(H,D,W,M)
BI Tool/Home Grown
What is computed where
What is computed where
MetricsImpressions, Revenue, Clicks,Conversions, Quality Score,
Top keywords
Rollups, Type 2 Dimension,
Alerts & Messaging
Load balanced webLoad balanced webApache Web Server
Derived Metrics – CTR, Depth, RPM, Coverage
BI on Cloud [1000ft view]
BI on Cloud – Screen Shots
CUBE on Hadoop?
Oracle
ETL/Aggregation
I-CUBE
HADOOP
MicroStrategy
Home Grown Tools
ARTART
Tradition
APOLLO FEEDS
I-CUBE
HADOOP
BI Tool
Home Grown Tools
ARTART
HBASEHBASE
Aggregation in HIVE
Game Changer – Hbase & Schema
HiveserverHiveserverJDBC/ODBCJDBC/ODBC
How we do?
RowKey Day Metrics Week Metrics MTD Metrics SCD Info Offer Stats
OrderId-MMYY D1 D2……..Dn Wx Wx+1 …… Wy Imp Clicks Name Email …
• Htable – Schema Less• Use Hbase Incrementor - incrementColumnValue for Weekly &
MTD• Hive Windowing UDF to generate flattened daily row• Carefully choose Rowkey• SCD – Comes free• Performance – Physical file Hfile by table & Column Family
Number GameSize – 360GB Format – RCFile Rows – 14.7 BilionMappers – 562 Reducers – 436Elapsed Time <= 30 mins
Hadoop/RDBMS
Hadoop/RDBMS
BIG DATA
SLA
Challenge@Hand
What users love? – Excel & Pivot
What if I n
eed to Pivot
Having few M
illion Record
Or maybe Billio
n records
But “Hang” on a minute? – BIG DATA?
Our Answer – Hadoop PivotNumber GameSize – 360GBFormat – RCFileRows – 14.7 BilionMappers – 670Reducers – 30Elapsed Time – 251 secs [< 5 mins]
Voila – Back to Excel
Questions?
Hadoop HDFS – Hourly FeedsHadoop HDFS – Hourly Feeds
Hadoop HDFS Grid – Daily Feeds & AggregatesHadoop HDFS Grid – Daily Feeds & Aggregates
Oracle RAC 8 Node60TB
Oracle RAC 8 Node60TB
Oracle ETL ServerOracle ETL Server
BI App ServerBI App Server
BI Web ServerBI Web Server
App Server ,Grid Launcher BoxApp Server ,Grid Launcher Box
GRID Based ReportWeb Server
GRID Based ReportWeb Server
MetadataMetadata
Unified Web BI PortalUnified Web BI Portal
Web Services Data Access Layer [ ODBC/PL/SQL API]Web Services Data Access Layer [ ODBC/PL/SQL API]
DimensionsHBase
DimensionsHBase
Facts on HDFS [Rcfile]
Facts on HDFS [Rcfile]
OtherToolsOtherTools
TRADITIONAL
GRID
Hive + PIG – Query EngineHive + PIG – Query EngineSchedulerScheduler