Upload
joey-echeverria
View
291
Download
1
Tags:
Embed Size (px)
Citation preview
2 December 2011
Hadoop in Three Use CasesJoey Echeverria | Solutions [email protected] | @fwiffo
©2011 Cloudera, Inc. All Rights Reserved.2
About Joey
• Solutions Architect• 6 months• 3+ years• Local
Cloudera’s Distribution including Apache Hadoop
Copyright 2011 Cloudera Inc. All rights reserved3
Coordination
Data IntegrationFast Read/Write
Access
Languages / Compilers
Workflow Scheduling Metadata
APACHE ZOOKEEPER
APACHE FLUME*, APACHE SQOOP* APACHE HBASE
APACHE PIG, APACHE HIVE
APACHE OOZIE* APACHE OOZIE* APACHE HIVE
File System Mount UI Framework SDKFUSE-DFS HUE HUE SDK
*currently under incubation in the Apache Software Foundation
Extract, Transform, and Load
Copyright 2011 Cloudera Inc. All rights reserved4
©2011 Cloudera, Inc. All Rights Reserved.5
ETL before Hadoop
Difficult to maintain, not scalable
Logs
Files
Relational Databases
Enterprise Data Warehouse
Custom ETL Scripts
©2011 Cloudera, Inc. All Rights Reserved.6
ETL before Hadoop
May be scalable, expensive
Logs
Files
Relational Databases
Enterprise Data Warehouse SQL:
raw table → warehouse tables
©2011 Cloudera, Inc. All Rights Reserved.7
ETL with Hadoop
Managed, flexible, scalable
Logs
Files
Relational Databases
Enterprise Data Warehouse
Steps
Copyright 2011 Cloudera Inc. All rights reserved8
2. Process
1. In
3. Out
Flume
Copyright 2011 Cloudera Inc. All rights reserved9
Flume
Copyright 2011 Cloudera Inc. All rights reserved10
©2011 Cloudera, Inc. All Rights Reserved.11
ETL with Hadoop
Managed, flexible, scalable
Logs
Files
Relational Databases
Enterprise Data WarehouseFlume
HDFS
Copyright 2011 Cloudera Inc. All rights reserved12
HDFS
Copyright 2011 Cloudera Inc. All rights reserved13
Client
NameNode
DataNode 01
DataNode 05
DataNode 09
DataNode 02
DataNode 06
DataNode 10
DataNode 03
DataNode 07
DataNode 11
DataNode 04
DataNode 08
DataNode 12
open(“file.txt”)
02, 06, 10
data
data data
HDFS
• Distributed• Replication• Bulk I/O• Fault tolerant• Scalable• Append only• Not POSIX
Copyright 2011 Cloudera Inc. All rights reserved14
©2011 Cloudera, Inc. All Rights Reserved.15
ETL with Hadoop
Managed, flexible, scalable
Logs
Files
Relational Databases
Enterprise Data WarehouseFlume HDFS
FUSE-DFS
Copyright 2011 Cloudera Inc. All rights reserved16
FUSE-DFS
• FUSE– User space– File systems
• FUSE-DFS– /hdfs– Mostly transparent
Copyright 2011 Cloudera Inc. All rights reserved17
©2011 Cloudera, Inc. All Rights Reserved.18
ETL with Hadoop
Managed, flexible, scalable
Logs
Files
Relational Databases
Enterprise Data WarehouseFlume
FUSE-DFS
HDFS
Sqoop
Copyright 2011 Cloudera Inc. All rights reserved19
Sqoop
Copyright 2011 Cloudera Inc. All rights reserved20
• SQL to Hadoop• Parallel import• File formats
©2011 Cloudera, Inc. All Rights Reserved.21
ETL with Hadoop
Managed, flexible, scalable
Logs
Files
Relational Databases
Enterprise Data WarehouseFlume
FUSE-DFS
Sqoop
HDFS
Pig
Copyright 2011 Cloudera Inc. All rights reserved22
Pig
• Scripting language• Generates MapReduce jobs• Perl for Hadoop• Great for ETL
Copyright 2011 Cloudera Inc. All rights reserved23
A = LOAD 'data' USING PigStorage() AS (f1:int, f2:int, f3:int);B = GROUP A BY f1;C = FOREACH B GENERATE COUNT ($0);DUMP C;
©2011 Cloudera, Inc. All Rights Reserved.24
ETL with Hadoop
Managed, flexible, scalable
Logs
Files
Relational Databases
Enterprise Data WarehouseFlume
FUSE-DFS
Sqoop
HDFS
Pig
Sqoop with connectors
Copyright 2011 Cloudera Inc. All rights reserved25
Sqoop with connectors
• MySQL*• PostgreSQL*• Teradata*• Netezza*• Oracle*• Couchbase*• Microsoft SQL Server• VoltDB
Copyright 2011 Cloudera Inc. All rights reserved26
*Cloudera certified connector
©2011 Cloudera, Inc. All Rights Reserved.27
ETL with Hadoop
Managed, flexible, scalable
Logs
Files
Relational Databases
Enterprise Data WarehouseFlume
FUSE-DFS
Sqoop
HDFS
Sqoop
Pig
Recommendations
Copyright 2011 Cloudera Inc. All rights reserved28
©2011 Cloudera, Inc. All Rights Reserved.29
Recommendations with Hadoop
Logs
Relational Databases
Web Application
CUSTOMERS
Flume
Copyright 2011 Cloudera Inc. All rights reserved30
©2011 Cloudera, Inc. All Rights Reserved.31
Recommendations with Hadoop
Logs
Relational Databases
Flume
Web Application
CUSTOMERS
HDFS
Copyright 2011 Cloudera Inc. All rights reserved32
©2011 Cloudera, Inc. All Rights Reserved.33
Recommendations with Hadoop
Logs
Relational Databases
Flume HDFS
Web Application
CUSTOMERS
Sqoop
Copyright 2011 Cloudera Inc. All rights reserved34
©2011 Cloudera, Inc. All Rights Reserved.35
Recommendations with Hadoop
Logs
Relational Databases
Flume
Sqoop
HDFS
Web Application
CUSTOMERS
Pig
Copyright 2011 Cloudera Inc. All rights reserved36
©2011 Cloudera, Inc. All Rights Reserved.37
Recommendations with Hadoop
Logs
Relational Databases
Flume
Sqoop
HDFS
Pig
Web Application
CUSTOMERS
Mahout
Copyright 2011 Cloudera Inc. All rights reserved38
Mahout
• Scalable machine learning algorithms– Collaborative Filtering– User and Item based recommenders– K-Means, Fuzzy K-Means clustering– Mean Shift clustering– Singular value decomposition– Complementary Naive Bayes classifier …
Copyright 2011 Cloudera Inc. All rights reserved39
©2011 Cloudera, Inc. All Rights Reserved.40
Recommendations with Hadoop
Logs
Relational Databases
Flume
Sqoop
HDFS
Pig
Web Application
CUSTOMERS
Mahout
MapReduce
Copyright 2011 Cloudera Inc. All rights reserved41
MapReduce
Copyright 2011 Cloudera Inc. All rights reserved42
toOne()
toOne()
toOne()
:1
:1
:1
:1
:1
:1
:1
:1
:1
count():[1,1,1,1]
:[1,1]
:[1,1]
:[1]
count()
:4
:2
:2
:1
shufflemap reduce
MapReduce
• Distributed• Code to data• Reliable• Scalable
Copyright 2011 Cloudera Inc. All rights reserved43
©2011 Cloudera, Inc. All Rights Reserved.44
Recommendations with Hadoop
Logs
Relational Databases
Flume
Sqoop
HDFS
Pig
Web Application
CUSTOMERS
Mahout MapReduce Pig
Oozie
Copyright 2011 Cloudera Inc. All rights reserved45
Oozie
• Workflows• Coordinator
– Triggers
Copyright 2011 Cloudera Inc. All rights reserved46
©2011 Cloudera, Inc. All Rights Reserved.47
Recommendations with Hadoop
Logs
Relational Databases
Flume
Sqoop
HDFS
Pig
Web Application
CUSTOMERS
Mahout MapReduce Pig
Oozie
HBase
Copyright 2011 Cloudera Inc. All rights reserved48
HBase
• Key/value store
• Data stored in HDFS
• Access model is get/put/del– Plus range scans and versions
• Random reads and writes for Hadoop
Copyright 2011 Cloudera Inc. All rights reserved49
©2011 Cloudera, Inc. All Rights Reserved.50
Recommendations with Hadoop
Logs
Relational Databases
Flume
Sqoop
HDFS
Pig
Web Application
CUSTOMERS
Mahout MapReduce Pig
OozieHBase
Business Intelligence
Copyright 2011 Cloudera Inc. All rights reserved51
©2011 Cloudera, Inc. All Rights Reserved.52
Business Intelligence with Hadoop
Logs
Relational Databases
BI / Analytics
ANALYSTS
Flume
Copyright 2011 Cloudera Inc. All rights reserved53
©2011 Cloudera, Inc. All Rights Reserved.54
Business Intelligence with Hadoop
Logs
Relational Databases
Flume
BI / Analytics
ANALYSTS
HDFS
Copyright 2011 Cloudera Inc. All rights reserved55
©2011 Cloudera, Inc. All Rights Reserved.56
Business Intelligence with Hadoop
Logs
Relational Databases
Flume HDFS
BI / Analytics
ANALYSTS
Sqoop
Copyright 2011 Cloudera Inc. All rights reserved57
©2011 Cloudera, Inc. All Rights Reserved.58
Business Intelligence with Hadoop
Logs
Relational Databases
Flume
Sqoop
HDFS
BI / Analytics
ANALYSTS
Hive
Copyright 2011 Cloudera Inc. All rights reserved59
Hive
• Data warehouse• Ad-hoc queries
– Not real-time (minutes)
• SQL• Tables• Joins
Copyright 2011 Cloudera Inc. All rights reserved60
©2011 Cloudera, Inc. All Rights Reserved.61
Business Intelligence with Hadoop
Logs
Relational Databases
Flume
Sqoop
HDFS
BI / Analytics
ANALYSTS
Hive
MapReduce
Copyright 2011 Cloudera Inc. All rights reserved62
©2011 Cloudera, Inc. All Rights Reserved.63
Business Intelligence with Hadoop
Logs
Relational Databases
Flume
Sqoop
HDFS
BI / Analytics
ANALYSTS
Hive MapReduce
Oozie
Copyright 2011 Cloudera Inc. All rights reserved64
©2011 Cloudera, Inc. All Rights Reserved.65
Business Intelligence with Hadoop
Logs
Relational Databases
Flume
Sqoop
HDFS
BI / Analytics
ANALYSTS
Hive
Oozie
MapReduce
HBase
Copyright 2011 Cloudera Inc. All rights reserved66
©2011 Cloudera, Inc. All Rights Reserved.67
Business Intelligence with Hadoop
Logs
Relational Databases
Flume
Sqoop
HDFS HBase
BI / Analytics
ANALYSTS
Hive
Oozie
MapReduce
Hive
Copyright 2011 Cloudera Inc. All rights reserved68
Hive for Business Intelligence
• JDBC– JasperReports*– Pentaho*
• ODBC– MicroStrategy*^
Copyright 2011 Cloudera Inc. All rights reserved69
* Vender certified connector^ Cloudera certified connector
©2011 Cloudera, Inc. All Rights Reserved.70
Business Intelligence with Hadoop
Logs
Relational Databases
Flume
Sqoop
HDFS Hive HBase
BI / Analytics
ANALYSTS
Hive
Oozie
MapReduce
CDH
Copyright 2011 Cloudera Inc. All rights reserved71
Coordination
Data IntegrationFast Read/Write
Access
Languages / Compilers
Workflow Scheduling Metadata
APACHE ZOOKEEPER
APACHE FLUME*, APACHE SQOOP* APACHE HBASE
APACHE PIG, APACHE HIVE
APACHE OOZIE* APACHE OOZIE* APACHE HIVE
File System Mount UI Framework SDKFUSE-DFS HUE HUE SDK
*currently under incubation in the Apache Software Foundation
What’s next?
• Cloudera Training Videos• CDH Virtual Machines• Hadoop: The Definitive Guide, 2nd Edition• Cloudera University
– Developer Training in Columbia, MD• Dec 13-16, Feb 13-16
– Administrator Training in Herndon, VA• Jan 4-6
– Private Training
Copyright 2011 Cloudera Inc. All rights reserved72
We’re Hiring!
• http://www.cloudera.com/company/careers /• Customer Operations
– Customer Operations Engineer– Customer Operations Tools Developer
• Customer Solutions– Solutions Architect
• Engineering– Senior Data Integration Developer– Senior Distributed Systems Engineer– Senior UI Engineer– Software Quality Engineer– Technical Writer
• IT/Operations– Systems Administrator
Copyright 2011 Cloudera Inc. All rights reserved73
74