29
Introduction to HDInsight Stéphane Fréchette Saturday February 7, 2015

Introduction to Azure HDInsight

Embed Size (px)

Citation preview

Page 1: Introduction to Azure HDInsight

Introduction to HDInsight

Stéphane FréchetteSaturday February 7, 2015

Page 2: Introduction to Azure HDInsight

Who am I?

My name is Stéphane Fréchette

SQL Server MVP | Consultant | Speaker | Data & BI Architect | Big Data |NoSQL | Data Science. Drums, good food and fine wine. Founder @TEDxGatineau

I have a passion for architecting, designing and building solutions that matter.

Twitter: @sfrechetteBlog: stephanefrechette.comEmail: [email protected]

Page 3: Introduction to Azure HDInsight

Topics

• What is Big Data?

• Apache Hadoop

• Hadoop Ecosystem

• Microsoft Azure HDInsight

• Demos

• Summary

• Resources

• Q&A

Page 4: Introduction to Azure HDInsight

“Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process data within a tolerable elapsed time…”

- Wikipedia

Page 5: Introduction to Azure HDInsight

What is Big Data?

Many Options

Variability

Page 6: Introduction to Azure HDInsight

Internet of things

Audio /

VideoLog Files

Text/Image

Social

Sentiment

Data Market Feeds

eGov Feeds

Weather

Wikis / Blogs

Click StreamSensors / RFID / Devices

Spatial & GPS Coordinates

WEB 2.0Mobile

Advertising CollaborationeCommerce

Digital Marketing

Search Marketing

Web Logs

Recommendations

ERP / CRM

Sales Pipeline

Payables

Payroll

Inventory

Contacts

Deal

Tracking

Terabytes

(10E12)

Gigabytes

(10E9)

Exabytes

(10E18)

Petabytes

(10E15)

Velocity - Variety

Vo

lum

e

1980190,000$

20100.07$

19909,000$

200015$

Storage/GB

ERP / CRM WEB

2.0

Internet of things

What is Big Data?

Page 7: Introduction to Azure HDInsight

Common Scenarios

What is Big Data?

Page 8: Introduction to Azure HDInsight

Hadoop

• Apache Hadoop is for big data

• Open-source software framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models

• Designed to scale up from single servers to thousands of machines, each offering local computation and storage

Page 9: Introduction to Azure HDInsight

TRADITIONAL RDBMS HADOOP

Data Size

Access

Updates

Structure

Integrity

Scaling

DBA Ratio

Hadoop

Page 10: Introduction to Azure HDInsight

HDFS

• Hadoop Distributed File System (HDFS) is a Java-based file system that provides scalable and reliable data storage that is designed to span large clusters of commodity servers.

HDFS ≠ Database

Page 11: Introduction to Azure HDInsight

MapReduce

• MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.

Processing function:

- Mapping

- Reducing

Page 12: Introduction to Azure HDInsight

How it works?

Page 13: Introduction to Azure HDInsight

ServerServer

ServerServer

Runtime

How it works?

Page 14: Introduction to Azure HDInsight

Distributed Storage(HDFS)

Query(Hive)

Distributed Processing(MapReduce)

Scripting(Pig)

No

SQL D

atabase

(HB

ase)

Metadata(HCatalog)

Data Integratio

n( O

DB

C/ SQ

OO

P/ REST)

Relatio

nal

(SQL

Server)

Machine Learning(Mahout)

Graph(Pegasus)

Stats processing(RHadoop

Event Pipelin

e(Flu

me)

Active Directory (Security)

Monitoring & Deployment

(System Center)

C#, F#, .NETPowerShell

Pipelin

e / wo

rkflow

(Oozie)

Azure Storage Vault (ASV)

Bu

siness

Intelligence

Excel, Pow

er V

iew, SSA

S)

World's Data (Azure Data

Marketplace)

Event Driven

Pro

cessing

LegendRed = Core HadoopBlue = Data processingPurple = Microsoft integration points and value addsOrange = Data MovementGreen = Packages

Hadoop Ecosystem

Page 15: Introduction to Azure HDInsight

HDInsight

• HDInsight is a Hadoop-based service that brings a 100 % Apache Hadoop solution that runs on the Microsoft Azure platform

• Based on the Hortonworks Data Platform (HDP)

• Scalable, on-demand service

Page 16: Introduction to Azure HDInsight

Storage

Azure Storage (Blob)File System

Two choices

Page 17: Introduction to Azure HDInsight

Demo[Spinning up a HDInsight Cluster ;-)]

Page 18: Introduction to Azure HDInsight

Now what?

Working with your HDInsight cluster - running jobs, import/export data, viewing and consuming data…

• .NET

• Java

• Pig

• Hive

• Sqoop

• Excel

• Others

Page 19: Introduction to Azure HDInsight

What is Hive?

• A data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis

• Provides an SQL-Like language called HiveQL to query data

• Integration between Hadoop and BI and visualization tools

http://hive.apache.org

Page 20: Introduction to Azure HDInsight

What is Pig?

• Write complex MapReduce jobs using a simple script language (Pig Latin)

• A platform for analyzing large data sets that consists of high-level language for expressing data analysis programs

• Pig translates and compiles complex MapReduce jobs on the fly

http://pig.apache.org

Page 21: Introduction to Azure HDInsight

What is Sqoop?

• Command-line interface application to transfer bulk data between Hadoop and relational datastores

http://sqoop.apache.org

Page 22: Introduction to Azure HDInsight

Demo[Query, Analyze, Transfer + Visual Studio Tools for HDInsight]

Page 23: Introduction to Azure HDInsight

HadoopData Analytics

Data Flow

Page 24: Introduction to Azure HDInsight

Demo[Self-Service BI with Hive and Excel…]

Page 25: Introduction to Azure HDInsight

Machine Learning

Graph Processing

Distributed Compute

Extract LoadTransform

Predictive Analysis

Capabilities

Page 26: Introduction to Azure HDInsight

Data Knowledge Action

Summary

Page 27: Introduction to Azure HDInsight

Resources

• Apache Projects (list with links) http://bit.ly/MfpLtE

• Microsoft Azure HDInsight http://bit.ly/1dnlAX1

• HDInsight Documentation & Tutorials http://bit.ly/LWRYol

• Hortonworks Sandbox 2.2 & Tutorials http://bit.ly/1gkkCte

• Cloudera VMs CDH 5.3.x http://bit.ly/1ENWgHH

• Microsoft JDBC Driver 4.1 | 4.0 for SQL Server http://bit.ly/1kEgJ7O

• Microsoft Hive ODBC Driver http://bit.ly/NFkhcH

• Getting Started with Big Data (MVA) http://bit.ly/1wU90Xd

• Big Data and Business Analytics Immersion v3.1 (MVA) http://bit.ly/1unvvX1

• Introducing Microsoft Azure HDInsight (free e-book) http://bit.ly/1JOPe5F

Page 28: Introduction to Azure HDInsight

What Questions Do You Have?

Page 29: Introduction to Azure HDInsight

Thank YouFor attending this session