Upload
emery-patrick
View
215
Download
2
Tags:
Embed Size (px)
Citation preview
Big Data for Relational PractitionersLen WyattProgram ManagerMicrosoft Corporation
DBI225
Agenda
Why all the fuss about Big Data?What Hadoop is – structural and ecosystem overviewHow Hadoop compares with RDBMS environments
Architectural comparisonTooling comparison
NoSQL – what it means in practiceWhat Microsoft is doing
No product announcements today!Building the skills to work in this environment
Why all the fuss about Big Data?
Explosion of data – the data volume issueNew data types – the variety opportunity
In theory, Big Data > HadoopIn practice, Big Data @ Hadoop
Hadoop is a rapidly evolving ecosystemMany componentsMany vendorsHigh rate of change via open-source model
What Hadoop is – a structural overviewHadoop core =
HDFS + MapReduce
NameNode
Giant File110010101001010100101010011001010100101010010101001100101010010101001010100110010101001010100101
HDFSClient NameNode returns
locations of blocks of file
DataNode DataNode DataNode DataNode DataNode
DataNodes return blocks of the file
Hadoop Distributed File System
Distributed for parallelismReplicated for reliability
What Hadoop is – a structural overviewHadoop core = HDFS + MapReduceMapReduce is a programming paradigmDivide big problem into small ones
Run the small tasks
Combine the results
Programmer writes Map()and Reduce() functionsMR framework distributes execution on the cluster
Output
A simple MR examplepublic static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } }} public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); }}
Source: http://wiki.apache.org/hadoop/WordCount
What Hadoop is – an ecosystem overview
As IT folks, we don’t dealwith MR jobs directlyWe use the ecosystem built on that foundationThe functions shouldseem familiar:
Data movementData integrationQuery interfaceDatabase servicesWorkflow management, metadata, monitoring…
What Hadoop is – an ecosystem overviewKey toolsThings we’ll introduce today:
HDFS The underlying file systemHive Metadata and query layer for SQL-like queriesPig Data manipulation: set-based and scripted
More parts of the ecosystemSqoop Data transfer to/from relational DBsHCatalog Metadata services for Hive and PigHBase NoSQL databaseOozie Workflow managementFlume Data ingestion / manipulation … the list goes on … this is a fast-changing area!
demo
A Quick-and-Dirty Data Warehouse in Hadoop
Some ways to think about Architecture OLTP DW
ACID
BASE
SQL Server
HiveHBaseCassand
ra
SQL Server
Hype: No schema needed!More accurately:
Schema-on-write (SQL)Schema-on-read (Hive, Pig, …)
ACID databaseHighly evolved toolsEnterprise grade hardwareEfficient execution
Read-only database (Hive)Massively scalable storageRapidly evolving ecosystemSchema flexibility
Schema-on-read in HiveQL Define schema with Hive DDL(state the structure, map to file)
create external table CUSTOMER ( C_CUSTKEY int, C_MKTSEGMENT string, C_NATIONKEY int, C_NAME string, C_ADDRESS string, C_PHONE string, C_ACCTBAL float, C_COMMENT string)row format delimited fields terminated by '|'stored as textfile location 'asv://customer/';
Run queries using Hive DML
select l_orderkey, o_orderdate, sum(l_extendedprice*(1-l_discount))
as revenuefrom customer join orders on
(customer.c_custkey = orders.o_custkey) join lineitem on
(lineitem.l_orderkey = orders.o_orderkey) where customer.c_mktsegment = 'BUILDING' and orders.o_orderdate < '1995-03-05' and lineitem.l_shipdate > '1995-03-05' group by l_orderkey, o_orderdateorder by revenue desc, o_orderdate LIMIT 10;
Shema-on-read in Pig Latinorders = load '/wh/orders/orders.tbl'using PigStorage ('|') as ( ORDERDATE:chararray, ORDERKEY:long, CUSTKEY:int, ORDERSTATUS:chararray, TOTALPRICE:double, COMMENT:chararray);
custs = load '/wh/customer/customer.tbl'
using PigStorage ('|') as ( CUSTKEY:int, MKTSEGMENT:chararray, NATIONKEY:int, NAME:chararray, ADDRESS:chararray, PHONE:chararray);
nations = load ‘/wh/nation/nation.tbl' using PigStorage ('|') as ( id:int, nation:chararray, region:int); custnat = join custs by NATIONKEY, nations by id;ordernat = join custnat by CUSTKEY, orders by CUSTKEY;ordersbynat = group ordernat by NATIONKEY;sums = foreach ordersbynat generate group, COUNT(ordernat.TOTALPRICE), SUM(ordernat.TOTALPRICE);
dump sums;
Logic here – the rest is schema
Remember: Hive and Pig run M/R jobshive> select devicemake, devicemodel, sum(querydwelltime) as a > from hivesampletable > group by devicemake, devicemodel > order by a;Total MapReduce jobs = 2Launching Job 1 out of 2Starting Job = job_201206011857_0003, Tracking URL = http://10.114.202.178:50030/jobdetails.jsp?jobid=job_201206011857_0003Kill Command = c:\Apps\dist\bin\hadoop.cmd job -Dmapred.job.tracker=10.114.202.178:9010 -kill job_201206011857_00032012-06-02 22:29:21,382 Stage-1 map = 0%, reduce = 0%2012-06-02 22:29:33,601 Stage-1 map = 50%, reduce = 0%2012-06-02 22:29:37,617 Stage-1 map = 100%, reduce = 0%2012-06-02 22:29:48,648 Stage-1 map = 100%, reduce = 33%2012-06-02 22:29:51,664 Stage-1 map = 100%, reduce = 100%Ended Job = job_201206011857_0003Launching Job 2 out of 2Starting Job = job_201206011857_0004, Tracking URL = http://10.114.202.178:50030/jobdetails.jsp?jobid=job_201206011857_0004Kill Command = c:\Apps\dist\bin\hadoop.cmd job -Dmapred.job.tracker=10.114.202.178:9010 -kill job_201206011857_00042012-06-02 22:30:18,195 Stage-2 map = 0%, reduce = 0%2012-06-02 22:30:30,210 Stage-2 map = 100%, reduce = 0%2012-06-02 22:30:45,241 Stage-2 map = 100%, reduce = 33%2012-06-02 22:30:48,257 Stage-2 map = 100%, reduce = 100%Ended Job = job_201206011857_0004OKSamsung SGH-i987 0.4610394LG LG-C900 6.315HTC 7 Mozart 10.442SAMSUNG SGH-i917R 15.5504033HTC PD67100 15.590325499999999Apple iPhone 3.1 18.7357592
Enterprise data flows(as seen by the DW team today)
OLTPDB
HRDB
DataWarehouse
DB
Customer Mgmt.
External sources
Staging area
Datamart
Datamart
OLAPcube
OLAPcube
Reports
Interactive tools
DashboardsETL
(Optional)
ETL
ETL
ETL
Enterprise data flows(Near-term Hadoop integration)
Persistent storage in HDFS
Interactive tools
Sqoopdata interchange with relational targets
Flume for file
acquisition
OLTPin
HBASE
Hive presents data as tables
Pig transforms data in HDFSOozie manages workflows
Sqoopdata
interchange with relational sources
DWin
Hive
Presentation DB
OLAPcube
Reports
Dashboards
External sources
OLTPin
RDBMS
What Microsoft is doingNo announcements in this session!
Some general ideas to consider…
Partnership with Hortonworks to bring Hadoop to Windows
Hadoop on Azure brings Hadoop ecosystem to the cloudHadoop on Windows for on-premises deployment
Virtualized or bare-metal
Simple set-up and managementConnectivity to relational world – SQL Server, Analysis Services, Reporting Services, Excel, …
You benefit from the best of both worlds
demo
The Best of Both Worlds
Building your skillsBig Data will earn its place as the next “must have” competency in 2012.
IDC, “Predictions for 2012: Competing for 2020”, December 2011
OperationsCluster Mgmt
DeploymentMonitoringRepairUpgrade
Data stewardshipSecurityWorkflow (Oozie)
DevelopmentBasic tools
HDFS, Hive, Pig
JavaAS, RSMachine learning
Mahout
Workflow Flume, Oozie
AnalyticsBasic tools
HDFS, Hive, Pig
StatisticsR
PresentationExcel, AS, RS
Machine learning
Mahout
ResourcesFree papers, videos, webinars
All over the web – it’s open source!Overview of Hadoop architecture: Dr. DeWitt talk
http://pages.cs.wisc.edu/~dewitt/includes/passtalks/passtalks.html
Note: Hive Language Manual, Pig Latin Reference Manual 2
Hadoop booksHadoop: The Definitive Guide by Tom WhiteProgramming Pig by Alan GatesProgramming Hive (releases Sept. 21, 2012)
Commercial trainingHortonworks University
Third-party analytical toolsKarmasphereDatameer
It’s big!
It’s an opportunity!
It’s still evolving!
It’s got potential!
It’s confusing!It’s excitin
g!
© 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to
be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS
PRESENTATION.