© Hortonworks Inc. 2011
Trends and usage of Apache Hadoop
January 2012
Page 1
Eric Baldeschwieler CEO Hortonworks Twitter: @jeric14, @hortonworks
© Hortonworks Inc. 2011
Agenda
• Define terms – What is Hadoop? Why does Hadoop matter?
• What drives Hadoop adoption?
• Observed Trends
Page 2 Architecting the Future of Big Data
© Hortonworks Inc. 2011
Hortonworks Vision
How to achieve that vision??? Enable ecosystem around enterprise-viable platform.
We believe that by 2015, more than half the world's data will be
processed by Apache Hadoop
Page 3
© Hortonworks Inc. 2011
What is Apache Hadoop? • Solution for big data
– Deals with complexities of high volume, velocity & variety of data
• Set of open source projects
• Transforms commodity hardware into a service that: – Stores petabytes of data reliably – Allows huge distributed computations
• Key attributes: – Redundant and reliable (no data loss) – Extremely powerful – Batch processing centric – Easy to program distributed apps – Runs on commodity hardware
Page 4
One of the best examples of open source driving innovation
and creating a market
© Hortonworks Inc. 2011
Zook
eepe
r (C
oord
inat
ion)
Core Apache Hadoop Related Hadoop Projects
HDFS (Hadoop Distributed File System)
MapReduce (Distributed Programing Framework)
Hive (SQL)
Pig (Data Flow)
HCatalog (Table & Schema Management)
Hortonworks Data Platform (HDP) Key Components of “Standard Hadoop” Open Source Stack
HB
ase
(Col
umna
r NoS
QL
Sto
re)
Open APIs for: • Data Integration • Data Movement • App Job Management • System Management
Page 5
© Hortonworks Inc. 2011
Big Data Trailblazers and Use Cases
Page 6
advertising optimization mail anti-spam
video & audio processing ad selection
web search
user interest prediction
customer trend analysis
analyzing web logs
content optimization
data analytics
machine learning
data mining
text mining
social media
© Hortonworks Inc. 2011
Yahoo!, Apache Hadoop & Hortonworks http://www.wired.com/wiredenterprise/2011/10/how-yahoo-spawned-hadoop
Page 7
Hadoop at Yahoo! 40K+ Servers
170PB Storage 5M+ Monthly Jobs 1000+ Active Users
Yahoo! embraced Apache Hadoop, an open source platform, to crunch epic amounts of data using an army of dirt-cheap servers
2006
Yahoo! spun off 22+ engineers into Hortonworks, a company focused on advancing open source Apache Hadoop for the broader market
2011
© Hortonworks Inc. 2011
What drives Hadoop adoption?
Architecting the Future of Big Data Page 8
© Hortonworks Inc. 2011
Market Drivers for Apache Hadoop
9 © Hortonworks Inc. 2011
Gartner predicts 800% data growth over next 5 years
80-90% of data produced today is unstructured
• Business drivers – High-value projects that require use of more data – Belief that there is great ROI in mastering big data
• Financial drivers – Growing cost of data systems as percentage of IT spend – Cost advantage of commodity hardware + open source – Enables departmental-level big data strategies
• Technical drivers – Existing solutions failing under growing requirements
– 3Vs - Volume, velocity, variety – Proliferation of unstructured data
© Hortonworks Inc. 2011
Every Market has Big Data
Page 10
Source: McKinsey & Company report. Big data: The next frontier for innovation, competition, and productivity. May 2011.
Digital data is personal, everywhere, increasingly accessible, and will continue to grow exponentially
© Hortonworks Inc. 2011
Broader Use Case Opportunities
Page 11
Financial Services • Detect/prevent fraud • Model and manage risk • Personalize banking/insurance products • Compliance, Archival, …
Healthcare • Patient monitoring • Predictive modeling • Compliance, Archival, text search • Data driven research
Retail • Behavior analysis • Cross selling, recommendation engines • Optimize pricing, placement, design • Optimize inventory and distribution
Web / Social / Mobile • Sentiment analysis • Web log, image, and video analysis • Personalization • Billing, Reporting, Network Analysis
Manufacturing • Simulation, Analysis, Design • Improve service via product sensor data • “Digital factory” for lean manufacturing
Government • Detect/prevent fraud • Security & Intelligence • Support open data initiatives
© Hortonworks Inc. 2011
Observed Trends
Architecting the Future of Big Data Page 12
© Hortonworks Inc. 2011
Trend: Agile Data
• The old way – Operational systems keep only current records, short history – Analytics systems keep only conformed / cleaned / digested data – Unstructured data locked away in operational silos – Archives offline
– Inflexible, new questions require system redesigns
• The new trend – Keep raw data in Hadoop for a long time – Able to produce a new analytics view on-demand – Keep a new copy of data that was previously on in silos – Can directly do new reports, experiments at low incremental cost – New products / services can be added very quickly – Agile outcome justifies new infrastructure
Page 13 Architecting the Future of Big Data
© Hortonworks Inc. 2011
Traditional Enterprise Data Architecture Data Silos
Page 14
EDW Data Marts
BI / Analytics
Traditional Data Warehouses, BI & Analytics Serving Applications
Web Serving
NoSQLRDMS …
Unstructured Systems
Serving Logs
Social Media
Sensor Data
Text Systems …
Traditional ETL & Message buses
© Hortonworks Inc. 2011
Agile Data Architecture w/Hadoop Connecting All of Your Big Data
Page 15
EDW Data Marts
BI / Analytics
Traditional Data Warehouses, BI & Analytics Serving Applications
Web Serving
NoSQLRDMS …
Unstructured Systems
Serving Logs
Social Media
Sensor Data
Text Systems …
EsTsL (s = Store) Custom Analytics
Traditional ETL & Message buses
© Hortonworks Inc. 2011
Trend: Data driven development
• Limited runtime logic driven by huge lookup tables
• Data computed offline on Hadoop – Machine learning, other expensive computation offline – Personalization, classification, fraud, value analysis…
• Application development requires data science – Huge amounts of actually observed data key to modern services – Hadoop used as the science platform
Page 16 Architecting the Future of Big Data
CASE STUDY YAHOO! HOMEPAGE
17
• Serving Maps • Users -‐ Interests
• Five Minute Produc7on
• Weekly Categoriza7on models
SCIENCE HADOOP
CLUSTER
SERVING SYSTEMS
PRODUCTION HADOOP
CLUSTER
USER BEHAVIOR
ENGAGED USERS
CATEGORIZATION MODELS (weekly)
SERVING MAPS
(every 5 minutes) USER
BEHAVIOR
» Identify user interests using Categorization models
» Machine learning to build ever better categorization models
Build customized home pages with latest data (thousands / second) Copyright Yahoo 2011
© Hortonworks Inc. 2011
CASE STUDY YAHOO! HOMEPAGE
18 Copyright Yahoo 2011
Personalized for each visitor Result: twice the engagement
+160% clicks vs. one size fits all
+79% clicks vs. randomly selected
+43% clicks vs. editor selected
Recommended links News Interests Top Searches
© Hortonworks Inc. 2011
Trend: Specialization of Data Systems
• Hadoop does not replace existing systems – It adds new capabilities to the enterprise – It can offload things that are not done efficiently in current systems
– Especially in scale out situations
• Specialization of traditional data components
– Use OLTP systems just for transactions – Use OLAP systems for interactive analysis
• Hadoop has LOTS of bandwidth to storage and CPU – Pull reporting out OLTP systems – Pull ELT out of OLAP systems
Page 19 Architecting the Future of Big Data
© Hortonworks Inc. 2011
Hadoop and OLTP Systems
Web Site
MPP Processing of Online Transactions
• Mission critical • Manages transactions & serves reports
Page 20
Transaction Processing
Systems
$$$
Reports
Transaction Logs
Hadoop used to Process Reports
• Free up 50+% processing power for transaction processing system
• Significant cost savings due to commodity nature of Hadoop
Web Site
Web Site
© Hortonworks Inc. 2011
Hadoop and OLAP Systems
Mobile
Social
Other logs
Web
Hadoop EDW
Fast loading, raw data staging, ELT & long-term archival
(The Agile Data Zone)
Allow analysts to use tools they know
(Take advantage of huge ecosystem of BI and Analytics tooling)
Online Archival
Page 21
© Hortonworks Inc. 2011
TRENDS: Instrument Clouds of Things
Clouds of things logging to Hadoop Websites
Mobile phones, Enterprise devices…
Page 22
HDFS + Map-Reduce Or HBase
+ Analysis
Things Things
Things Things
Things Things
© Hortonworks Inc. 2011
Trend: Many POCs, Few Production Systems
• The problem – Hadoop is still a young technology – Hard to find knowledgeable staff – Integration with existing systems
• Hadoop market is maturing at speed – Emerging ecosystem of Hadoop platform solutions providers – Apache Hadoop continues to get better – Hadoop training and support available form several vendors
Page 23 Architecting the Future of Big Data
© Hortonworks Inc. 2011
Growth in Hadoop Ecosystem
• Hardware vendors, Public Cloud (IAAS, PAAS) – Storage, Appliances, Preloaded commodity boxes, cloud
• Data Systems – All the major vendors announced Hadoop plans / products in 2011
• BI, Analytics and ETL – Hadoop integrations emerging
• Dedicated Hadoop Applications – Datamere, Karmashere, Platfora, …
• Systems Integrators – Regional and Global providers available
Page 24 Architecting the Future of Big Data
© Hortonworks Inc. 2011
Hadoop Continues to Improve
Page 25
“Hadoop.Now” (Hadoop 1.0)
Most stable version ever HBase, security, WebHDFS
“Hadoop.Next” (Hadoop 0.23)
HA, Next-gen HDFS & MapReduce Extension & Integration APIs
“Hadoop.Beyond” Platform actively evolving
Apache community, including Hortonworks investing to improve Hadoop: • Make Hadoop an Open, Extensible, and Enterprise Viable Platform • Enable More Applications to Run on Apache Hadoop
© Hortonworks Inc. 2011 Page 26 Architecting the Future of Big Data
Hortonworks – Approachable Hadoop • Apache Hadoop Leadership
– Delivered every major release since 0.1 – Driving innovation across entire stack – Experience managing world’s largest
deployment – Access to Yahoo’s 1,000+ Hadoop users
and 40k+ nodes for testing, QA, etc.
• Business Focus – Provide 100% open source product
– Hortonworks Data Platform
– Help customers and partners overcome Hadoop knowledge gaps
– Help organizations successfully develop and deploy solutions based on Hadoop
Evaluate Pilot Production
Expert Role-based Training
Full Lifecycle Support and Services
© Hortonworks Inc. 2011
Trend: Finding More Value Over Time
• Hadoop is usually brought in to solve a specific problem – Build seach indexes for Yahoo – Manage web site logs for Facebook – Users using EC2 to do data processing at Amazon – Simple reporting when existing tools don’t scale
• Once your data is in Hadoop more users find value
• Once you have Hadoop, folks add more data
Page 27 Architecting the Future of Big Data
© Hortonworks Inc. 2011
Thank You! Questions? Eric Baldeschwieler @jeric14 @hortonworks
Page 28