If you can't read please download the document
View
0
Download
0
Embed Size (px)
2013 © Trivadis
BASEL BERN BRUGG LAUSANNE ZUERICH DUESSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MUNICH STUTTGART VIENNA
2013 © Trivadis
HDInsight in Windows Azure
Marc Schöni
Meinrad Weiss
April 2014
04.03.2014 HDInsight in Windows Azure R 1.001
2013 © Trivadis
04.03.2014 HDInsight in Windows Azure R 1.00
Introduction
HDInsight on Windows Azure
2
2013 © Trivadis
Big data solutions deal with complexities of:
VOLUME
(Size)
VARIETY
(Structure)
VELOCITY
(Speed)
Big Data
VALUE
Hadoop/HDInsight 3
Focus
04.03.2014 HDInsight in Windows Azure R 1.00
2013 © Trivadis
04.03.2014 HDInsight in Windows Azure R 1.00
4
HDInsight Versions on Azure
Component V 1.6 V 2.1 V3 (Preview)
Apache Hadoop 1.0.3 01.02.2000 02.02.2000
Apache Hive 0.9.0 0.11.0 0.12.0
Apache Pig 0.9.3 0.11 0.12
Apache Sqoop 01.04.2002 01.04.2003 01.04.2004
Apache Oozie 03.02.2000 03.02.2002 4.0.0
Apache HCatalog 0.4.1 Merged with Hive Merged with Hive
Apache Templeton 0.1.4 Merged with Hive Merged with Hive
Ambari API v1.0 API v1.0
SQL Server JDBC Driver 3 No Information No Information
Source: http://www.windowsazure.com/en-us/documentation/articles/hdinsight-component-versioning/
http://www.windowsazure.com/en-us/documentation/articles/hdinsight-component-versioning/
2013 © Trivadis
04.03.2014 HDInsight in Windows Azure R 1.00
5
Hadoop Zoo
2013 © Trivadis
Windows Azure Blob StorageHDFS
Hadoop Filesystem Interface
Query &
Metadata:
Data
Movement: Workflow: Monitoring:
Windows Azure HDInsight Service
04.03.2014 HDInsight in Windows Azure R 1.00
6
2013 © Trivadis
04.03.2014 HDInsight in Windows Azure R 1.00
Windows Azure
Blob Storage
7
2013 © Trivadis
Windows Azure HDInsight Service
04.03.2014 HDInsight in Windows Azure R 1.00
8
2013 © Trivadis
Windows Azure HDInsight Service
04.03.2014 HDInsight in Windows Azure R 1.00
9
Focus
2013 © Trivadis
Windows Azure Storage
Scalable, durable, and available
Anywhere at anytime access
Only pay for what the service uses
Use from Windows Azure Compute
Use from anywhere on the internet
04.03.2014 HDInsight in Windows Azure R 1.00
10
2013 © Trivadis
Northern Europe
Western Europe South Central US
West US East US
Datacenters and Regions
04.03.2014 HDInsight in Windows Azure R 1.00
11
2013 © Trivadis
04.03.2014 HDInsight in Windows Azure R 1.00
12
• Higher durability • 3 local replicas in primary location
• Local replicas – synchronously replicated
• Common failures (disk, node, rack) – use local copies to recover
• Major disasters – contact customer about potential data loss
• Reduced Price – 23-34% based on how much you store
• Turn off Geo for your storage account in portal • Non-critical data that can be recreated on major
disasters
• Application manages its own replica
• Companies have limitations on geo locations
• Highest level of durability • 3 local replicas each in primary and secondary
locations
• Local replicas – synchronously replicated
• Geo replica – asynchronously replicated
• Common failures (disk, node, rack) – use local copies to recover
• Major disasters – use geo replicated copy (400+ miles apart)
• Price remains the same as before
• Enabled by default
2013 © Trivadis
Blob Storage Concepts
• Store large amounts of unstructured text or binary data with the fastest read performance
• Highly scalable, durable, and available file system
• Blobs can be exposed publically over HTTP
• Securely lock down permissions to blobs
04.03.2014 HDInsight in Windows Azure R 1.00
13
2013 © Trivadis
Azure Blob storage
Setting up the Windows Azure storage account
Azure Portal
04.03.2014 HDInsight in Windows Azure R 1.00
14
2013 © Trivadis
04.03.2014 HDInsight in Windows Azure R 1.00
15
Setup new Storage
2013 © Trivadis
04.03.2014 HDInsight in Windows Azure R 1.00
16
Move Data to Azure Blob Storage
Azure Blob storage
Set-AzureStorageBlobContent
-File "C:...\2011\Weather2011_H1_JustData.csv"
-Container $containername
-Blob "FlightDelay/.../2011/Weather2011_H1_JustData.csv"
-context $context
Power Shell
Tool like CloudBerry
Drag&Drop
2013 © Trivadis
04.03.2014 HDInsight in Windows Azure R 1.00
Windows Azure
HDInsight Service
17
2013 © Trivadis
Windows Azure HDInsight Service
04.03.2014 HDInsight in Windows Azure R 1.00
18
2013 © Trivadis
Setting up the Windows Azure HDInsight cluster
Windows Azure HDInsight
Azure Blob storage
HDInsight Console
04.03.2014 HDInsight in Windows Azure R 1.00
19
2013 © Trivadis
04.03.2014 HDInsight in Windows Azure R 1.00
20
Setup new Cluster
2013 © Trivadis
04.03.2014 HDInsight in Windows Azure R 1.00
21
Provision Cluster via PowerShell
# Create a new HDInsight cluster
$config = New-AzureHDInsightClusterConfig -ClusterSizeInNodes $clusterNodes `
| Set-AzureHDInsightDefaultStorage `
-StorageAccountName "$storageAccountName_Default.blob.core.windows.net" `
-StorageAccountKey $storageAccountKey_Default `
-StorageContainerName $containerName_Default `
| Add-AzureHDInsightMetastore `
-SqlAzureServerName "$hiveSQLDatabaseServerName.database.windows.net" `
-DatabaseName $hiveSQLDatabaseName `
-Credential $hiveCreds `
-MetastoreType HiveMetastore `
| New-AzureHDInsightCluster `
-Version "3.0" `
-Name $clusterName `
-Location $location `
-Credential $clusterCreds
2013 © Trivadis
04.03.2014 HDInsight in Windows Azure R 1.00
Map Reduce
22
2013 © Trivadis
Hadoop MapReduce
• Programming framework (library and runtime) for analyzing datasets stored in HDFS
• Composed of user- supplied Map and Reduce functions: • Map() - subdivide and
conquer
• Reduce() - combine and reduce cardinality
1. Divide a large problem into sub-problems.
………
2. Perform the same function on all sub-problems.
Do work()
3. Combine the output from all sub-functions.
Do work() Do work()
04.03.2014 HDInsight in Windows Azure R 1.00
23
2013 © Trivadis
MapReduce
• Rapidly process vast amounts of data in parallel, on a large cluster of compute nodes
• Framework schedules and monitors tasks, and re-executes failed tasks
• Typically, both input and output are stored in file system
DataNode 1
Mapper
Data is shuffled
across the network
and sorted
Map Phase Shuffle/Sort Reduce Phase
DataNode 2
Mapper
DataNode 3
Mapper
DataNode 1
Reducer
DataNode 2
DataNode 3
Reducer
04.03.2014 HDInsight in Windows Azure R 1.00
24
2013 © Trivadis
Layout Windspeed Calculation
StationID Date Windspeed
123 22.01.2012 31
124 22.01.2012 34
125 22.01.2012 22
126 22.01.2012 12
123 23.01.2012 26
124 23.01.2012 29
125 23.01.2012 46
126 23.01.2012 12
StationID Date Windspeed
123 23.01.2012 26
124 23.01.2012 29
125 23.01.2012 46
126 23.01.2012 12
Compute Node 1 Compute Node 2
StationID Date Windspeed
123 22.01.2012 31
124 22.01.2012 34
125 22.01.2012 22
126 22.01.2012 12
04.03.2014 HDInsight in Windows Azure R 1.00
25
2013 © Trivadis
Layout Windspeed Calculation
StationID Date Windspeed
123 23.01.2012 26
124 23.01.2012 29
125 23.01.2012 46
126 23.01.2012 12
Data Node 1 Data Node 2
StationID Date Windspeed
123 22.01.2012 31
124 22.01.2012 34
125 22.01.2012 22
126 22.01.2012 12
Key Value
Max 34
Key Value
Max 46
Map
Key Value
Max 46
Reduce
04.03.2014 HDInsight in Windows Azure R 1.00
26
2013 © Trivadis
Hadoop Streaming Process
04.03.2014 HDInsight in Windows Azure R 1.00
27
2013 ©