Big Data for Relational Practitioners Len Wyatt Program Manager Microsoft Corporation DBI225

Big Data for Relational PractitionersLen WyattProgram ManagerMicrosoft Corporation

DBI225

Agenda

Why all the fuss about Big Data?What Hadoop is – structural and ecosystem overviewHow Hadoop compares with RDBMS environments

Architectural comparisonTooling comparison

NoSQL – what it means in practiceWhat Microsoft is doing

No product announcements today!Building the skills to work in this environment

Why all the fuss about Big Data?

Explosion of data – the data volume issueNew data types – the variety opportunity

In theory, Big Data > HadoopIn practice, Big Data @ Hadoop

Hadoop is a rapidly evolving ecosystemMany componentsMany vendorsHigh rate of change via open-source model

What Hadoop is – a structural overviewHadoop core =

HDFS + MapReduce

NameNode

Giant File110010101001010100101010011001010100101010010101001100101010010101001010100110010101001010100101

HDFSClient NameNode returns

locations of blocks of file

DataNode DataNode DataNode DataNode DataNode

DataNodes return blocks of the file

Hadoop Distributed File System

Distributed for parallelismReplicated for reliability

What Hadoop is – a structural overviewHadoop core = HDFS + MapReduceMapReduce is a programming paradigmDivide big problem into small ones

Run the small tasks

Combine the results

Programmer writes Map()and Reduce() functionsMR framework distributes execution on the cluster

Output

A simple MR examplepublic static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } }} public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); }}

Source: http://wiki.apache.org/hadoop/WordCount

What Hadoop is – an ecosystem overview

As IT folks, we don’t dealwith MR jobs directlyWe use the ecosystem built on that foundationThe functions shouldseem familiar:

Data movementData integrationQuery interfaceDatabase servicesWorkflow management, metadata, monitoring…

What Hadoop is – an ecosystem overviewKey toolsThings we’ll introduce today:

HDFS The underlying file systemHive Metadata and query layer for SQL-like queriesPig Data manipulation: set-based and scripted

More parts of the ecosystemSqoop Data transfer to/from relational DBsHCatalog Metadata services for Hive and PigHBase NoSQL databaseOozie Workflow managementFlume Data ingestion / manipulation … the list goes on … this is a fast-changing area!

demo

A Quick-and-Dirty Data Warehouse in Hadoop

Some ways to think about Architecture OLTP DW

ACID

BASE

SQL Server

HiveHBaseCassand

ra

SQL Server

Hype: No schema needed!More accurately:

Schema-on-write (SQL)Schema-on-read (Hive, Pig, …)

ACID databaseHighly evolved toolsEnterprise grade hardwareEfficient execution

Read-only database (Hive)Massively scalable storageRapidly evolving ecosystemSchema flexibility

Schema-on-read in HiveQL Define schema with Hive DDL(state the structure, map to file)

create external table CUSTOMER ( C_CUSTKEY int, C_MKTSEGMENT string, C_NATIONKEY int, C_NAME string, C_ADDRESS string, C_PHONE string, C_ACCTBAL float, C_COMMENT string)row format delimited fields terminated by '|'stored as textfile location 'asv://customer/';

Run queries using Hive DML

select l_orderkey, o_orderdate, sum(l_extendedprice*(1-l_discount))

as revenuefrom customer join orders on

(customer.c_custkey = orders.o_custkey) join lineitem on

(lineitem.l_orderkey = orders.o_orderkey) where customer.c_mktsegment = 'BUILDING' and orders.o_orderdate < '1995-03-05' and lineitem.l_shipdate > '1995-03-05' group by l_orderkey, o_orderdateorder by revenue desc, o_orderdate LIMIT 10;

Shema-on-read in Pig Latinorders = load '/wh/orders/orders.tbl'using PigStorage ('|') as ( ORDERDATE:chararray, ORDERKEY:long, CUSTKEY:int, ORDERSTATUS:chararray, TOTALPRICE:double, COMMENT:chararray);

custs = load '/wh/customer/customer.tbl'

using PigStorage ('|') as ( CUSTKEY:int, MKTSEGMENT:chararray, NATIONKEY:int, NAME:chararray, ADDRESS:chararray, PHONE:chararray);

nations = load ‘/wh/nation/nation.tbl' using PigStorage ('|') as ( id:int, nation:chararray, region:int); custnat = join custs by NATIONKEY, nations by id;ordernat = join custnat by CUSTKEY, orders by CUSTKEY;ordersbynat = group ordernat by NATIONKEY;sums = foreach ordersbynat generate group, COUNT(ordernat.TOTALPRICE), SUM(ordernat.TOTALPRICE);

dump sums;

Logic here – the rest is schema

Remember: Hive and Pig run M/R jobshive> select devicemake, devicemodel, sum(querydwelltime) as a > from hivesampletable > group by devicemake, devicemodel > order by a;Total MapReduce jobs = 2Launching Job 1 out of 2Starting Job = job_201206011857_0003, Tracking URL = http://10.114.202.178:50030/jobdetails.jsp?jobid=job_201206011857_0003Kill Command = c:\Apps\dist\bin\hadoop.cmd job -Dmapred.job.tracker=10.114.202.178:9010 -kill job_201206011857_00032012-06-02 22:29:21,382 Stage-1 map = 0%, reduce = 0%2012-06-02 22:29:33,601 Stage-1 map = 50%, reduce = 0%2012-06-02 22:29:37,617 Stage-1 map = 100%, reduce = 0%2012-06-02 22:29:48,648 Stage-1 map = 100%, reduce = 33%2012-06-02 22:29:51,664 Stage-1 map = 100%, reduce = 100%Ended Job = job_201206011857_0003Launching Job 2 out of 2Starting Job = job_201206011857_0004, Tracking URL = http://10.114.202.178:50030/jobdetails.jsp?jobid=job_201206011857_0004Kill Command = c:\Apps\dist\bin\hadoop.cmd job -Dmapred.job.tracker=10.114.202.178:9010 -kill job_201206011857_00042012-06-02 22:30:18,195 Stage-2 map = 0%, reduce = 0%2012-06-02 22:30:30,210 Stage-2 map = 100%, reduce = 0%2012-06-02 22:30:45,241 Stage-2 map = 100%, reduce = 33%2012-06-02 22:30:48,257 Stage-2 map = 100%, reduce = 100%Ended Job = job_201206011857_0004OKSamsung SGH-i987 0.4610394LG LG-C900 6.315HTC 7 Mozart 10.442SAMSUNG SGH-i917R 15.5504033HTC PD67100 15.590325499999999Apple iPhone 3.1 18.7357592

Enterprise data flows(as seen by the DW team today)

OLTPDB

HRDB

DataWarehouse

DB

Customer Mgmt.

External sources

Staging area

Datamart

Datamart

OLAPcube

OLAPcube

Reports

Interactive tools

DashboardsETL

(Optional)

ETL

ETL

ETL

Enterprise data flows(Near-term Hadoop integration)

Persistent storage in HDFS

Interactive tools

Sqoopdata interchange with relational targets

Flume for file

acquisition

OLTPin

HBASE

Hive presents data as tables

Pig transforms data in HDFSOozie manages workflows

Sqoopdata

interchange with relational sources

DWin

Hive

Presentation DB

OLAPcube

Reports

Dashboards

External sources

OLTPin

RDBMS

What Microsoft is doingNo announcements in this session!

Some general ideas to consider…

Partnership with Hortonworks to bring Hadoop to Windows

Hadoop on Azure brings Hadoop ecosystem to the cloudHadoop on Windows for on-premises deployment

Virtualized or bare-metal

Simple set-up and managementConnectivity to relational world – SQL Server, Analysis Services, Reporting Services, Excel, …

You benefit from the best of both worlds

demo

The Best of Both Worlds

Building your skillsBig Data will earn its place as the next “must have” competency in 2012.

IDC, “Predictions for 2012: Competing for 2020”, December 2011

OperationsCluster Mgmt

DeploymentMonitoringRepairUpgrade

Data stewardshipSecurityWorkflow (Oozie)

DevelopmentBasic tools

HDFS, Hive, Pig

JavaAS, RSMachine learning

Mahout

Workflow Flume, Oozie

AnalyticsBasic tools

HDFS, Hive, Pig

StatisticsR

PresentationExcel, AS, RS

Machine learning

Mahout

ResourcesFree papers, videos, webinars

All over the web – it’s open source!Overview of Hadoop architecture: Dr. DeWitt talk

http://pages.cs.wisc.edu/~dewitt/includes/passtalks/passtalks.html

Note: Hive Language Manual, Pig Latin Reference Manual 2

Hadoop booksHadoop: The Definitive Guide by Tom WhiteProgramming Pig by Alan GatesProgramming Hive (releases Sept. 21, 2012)

Commercial trainingHortonworks University

Third-party analytical toolsKarmasphereDatameer

http://pages.cs.wisc.edu/~dewitt/includes/passtalks/passtalks.html

It’s big!

It’s an opportunity!

It’s still evolving!

It’s got potential!

It’s confusing!It’s excitin

g!

© 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to

be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS

PRESENTATION.

Documents

Big Data for Relational Practitioners Len Wyatt Program Manager Microsoft Corporation DBI225