View
513
Download
0
Embed Size (px)
Citation preview
Hadoop in the Cloud: Real World Lessons from Enterprise Customers
Hadoop Summit DublinApril 2016
Survey
Session Objectives and TakeawaysSession ObjectivesEnterprise customer case studiesUnderstand the advantages of using Hadoop on cloudDiscuss the common challenges of using Hadoop on the cloud
Key TakeawaysMost Hadoop vendors and cloud providers have solution templates to help you tackle cloud migration challengesPick a Hadoop distribution and cloud provider on overall strength of analytics portfolio
OutlineHadoop on Azure OfferingsEnterprise Customer Case StudiesWhy Hadoop on Cloud?Challenges that customers face with Hadoop on cloudQ&A
Hadoop on Azure Offerings
Storage
Microsoft Hadoop StackHadoop Distributions running in Azure VMs
Azure HDInsight
ScriptPig
SQLHive
NoSQL
Hbase
Real-time Storm
Batch
Map reduce
In Memory Spark
Machine Learning R Server
Local (HDFS) or Cloud (Azure Blob/Azure Data Lake Store)
Analytics
Azure HDInsight Hadoop Meets the Cloud
Microsoft’s managed Hadoop as a Service100% open source Apache HadoopBuilt on the latest releases across Hadoop (2.7)Up and running in minutes with no hardware to deployRun on Windows or LinuxSupported by Microsoft
Customer Case Studies
Rockwell Automation is partnered with one of the six oil and gas super majors to build unmanned internet-connected gas dispensers. Each dispenser emits real-time management metrics allowing them to detect anomalies and predict when proactive maintenance needs to occur.
Store sensor data every 5 minutes Temperature, pressure, vibration, etc. Tens of thousands of data points / second
Data Factory
Azure Blobs
Azure HDInsight
Hive, Pig,
Azure SQL DB
Power BI for O365
Mobile Notification Hub
Mobile Device
Real-time notification
JustGiving wanted to harness the power of their data by using network science to map people’s connections and relationships so that they could connect people with the causes they care about. Based on 15 years of data, the JustGiving GiveGraph is the world’s largest ecosystem of givingbehavior. It contains more than 81 million person
nodes, thousands of causes and 285 million connections and is the engine that drives JustGiving’s social platform, enabling levels of personalization and engagement that a traditional infrastructure would be unable to deliver.
SQL ServerOn-premises
Agent
Azure BlobsAzure HDInsight
Give Graph
Azure Tables
Web APIWebsite +Event store
Service Bus
Real-time Event
Serves results
Azure Cache
ActivityFeeds
One of the leaders in the development and management of renewable energy infrastructure and services needed to understand data coming from their wind turbines/wind farms in an Internet of Things (IoT) scenario.
100s of windfarms across the globe Each windfarm has 100+ turbines Each turbine generates 10 data points every
25 milliseconds.
Initial goal:Provide consumption related analytics to their customers (power companies)
What else could they do with all that data?Predictive maintenance
How?Event Hub, Azure Storage, HDInsightAzure SQL DB, Excel reporting
Why Hadoop on Cloud?
Why Hadoop on Cloud?Cost savingsAgilityElasticityIntegration with other Cloud ServicesChoice of Deployment Models
Cost SavingsNo hardware licenses or service-specific support agreementPay only for what you use, when you need it, not more than you needIndependently scale storage and compute
No need to hire specialized operations team to do big data
63% lower total cost of ownership than on-premises**Pending IDC study found on a per TB basis, Microsoft customers using cloud-based Hadoop in Data Lake have a 63% lower TCO than on-premises
AgilityUp and running in minutesHadoop cluster on the cloud can be up and running in minutes
No cluster management neededAll bits and services automatically deployed by Azure HDInsight
Enterprise level supportFully supported by Microsoft and Hortonworks
ElasticityScale upHDInsight offers various 11 VM instance typesBetter VM instance = more parallelism and/or more CPU/memory
Scale outChoose custom number of instance typesMore worker nodes = more parallelism
Integration with other cloud services
Cortana Analytics SuiteUse the rich analytical services in Azure to build your entire pipeline
Cloud Deployment Models
Why use Cluster as a Service?Pay only for time the cluster was actually usedSince both data and metadata is persisted, experience is as if the cluster was never deleted
Always on cluster Cluster as a serviceStorage choice Local HDFS, Azure Blob,
Azure Data Lake StoreAzure Blob, Azure Data Lake Store
Job Scheduling Oozie Azure Data FactoryData persistence after cluster deletion
N/A Azure Blob, Azure Data Lake Store
Metadata persistence after cluster deletion
N/A Azure SQL
Common Challenges and Solutions for Hadoop on Cloud
Common challenges with Hadoop on CloudScaling cloud storage for big workloadsData and Metadata Migration from On-prem to CloudExtending Hadoop to third party appsSecurity and ComplianceIntegration with Enterprise tools
Scaling cloud storage for big workloads
PartitioningPartitioned data on Year, Month, Day
ProblemSimultaneous Read/Write caused I/O bottleneck
Partition 1 Partition 2 Partition 3
2014-10.part0
2014-11.part0
2014-12.part0
Traditional Cloud Store
2014-10.part1
2014-11.part1
2014-12.part1
2014-10.part2
2014-11.part2
2014-12.part2
2014-10.part3
2014-11.part3
2014-12.part3
Scaling cloud storage for big workloads
Partitioning per AccountPut each partition in its own account
ProblemDue to partition pruning, each query will still go to same account, still causing throughput bottlenecks
Partition 1 Partition 2 Partition 3
2014-10.part0
2014-11.part0
2014-12.part0
2014-10.part1
2014-11.part1
2014-12.part1
2014-10.part2
2014-11.part2
2014-12.part2
2014-10.part3
2014-11.part3
2014-12.part3
Traditional Cloud Store
1Traditional
Cloud Store 2Traditional Cloud Store
3
Scaling cloud storage for big workloadsSolutionKeep files of each partition across multiple storage accounts
Encode knowledge of physical location into logical partitioning key
Partition 1 Partition 2 Partition 3
2014-10.part0
2014-10.part1
2014-10.part2
Traditional Cloud Store
1
2014-11.part0
2014-11.part1
2014-11.part2
2014-12.part0
2014-12.part1
2014-12.part2
Traditional Cloud Store 2
Traditional Cloud Store
32014-
10.part32014-
11.part32014-
12.part3
Traditional Cloud Store
4
Partition 4
Azure Data Lake Store: Improving cloud store limitsNo limits on file sizes Analytics scale on demandNo code rewrites as you increase size of data stored Optimized for massive throughputOptimized for IOT with high volume of small writes
PBTB GB
PBTB
Hybrid model: Data and Metadata synchronized
GoalsHow to have minimal downtime while migrating cluster to cloud?How to move both data and metadata?How to setup mirroring, i.e. constant replication?
Hybrid model: Data and Metadata synchronizedData SynchronizationHortonworks and Microsoft together released Falcon with Azure Data Factory connectorAllows constant replication of data between on-prem and cloud
Metadata SynchronizationFor true cluster replication, metadata also needs to be replicated in addition to dataYou can configure on-prem cluster to use SQL Server and use AlwaysOn Availability Groups feature to replicate metadata between On-Prem and Cloud
Extending Hadoop to your on-prem resources
Use Azure VNet feature to extend HDInsight to your on-prem network
Hadoop Extensibility: Installing own applicationsLinux WorkloadsTraditionally, HDInsight used to run on Windows, but with Linux customers can run more open source applications
ScriptActionYou can create custom Bash scripts that can be provided during cluster creation or already running cluster to install other applications
VNet and Edge NodesAn edge node can be created in an HDInsight cluster within a VNet to run more applications
Using ISV solutionsScenarioHadoop has a rich ecosystem of appsCustomers want to use apps beyond those provided by out of box
Why use ISV applications?Provide more features than those available in HadoopWSIWYG Query Designer ToolsOLAP BI Capabilities over Hadoop clusterFine grained access controlDrag and Drop data pipeline design and orchestration
ISV apps: DatameerDatameerWYSIWIG Query Designer in an Excel-like InterfaceSchedule recurring jobsEasily share projects with other analysts/data engineers in your company
ISV apps: AtScaleAtScaleAtScale is an OLAP engine purpose-built for Hadoop. It leverages the latest advancements in the Hadoop ecosystem to support existing BI workloads. • Multiple SQL-on-Hadoop Engine
Support• Access Data Where it LaysBuilt-
in• Support for Complex Data Types• Single Drop-in Gateway Node
Deployment
ISV apps: CaskCaskBuild pipeline using Drag & DropSource connections from on prem relational databases, or cloud stores for big data into HDInsight/Data Lake StorageCommon data pipeline task libraryFree, open source license to get started, enterprise option for dedicated use
Azure Security: Encryption At RestAzure Blob Storage (In Preview)• Encryption @ rest using Microsoft managed keys• Customers can use Azure Storage configuration to manage
encryption. No HDInsight changes required.
RBAC : Securing HDInsight with Blue Talon (ISV)
Multi-user access and fine-grained authorization policies for Hive TablesRow & column level security, data masking etc.
Integration with Enterprise ToolsCustomers want variety of tools for their end usersHDInsight provides query authoring with Hue, Ambari ViewsSupports Jupyter out of box and Zeppelin with ScriptActionQuery authoring support using Visual StudioFirst class Scala/Java support for Spark apps using IntelliJ
Be productive with a robust development environment
Deep integration to Visual StudioEasy for novices to write simple queriesRobust environment for experts to also be productiveIntegrated with Pig, Hive, and StormPlayback that visualizes performance to identify bottlenecks and areas for optimization
Productive for novices and experts
Microsoft Makes Hadoop EasierDeep Visual Studio IntegrationDebug Hive jobs through Yarn logs or troubleshoot Storm topologiesVisualize Hadoop clusters, tables, and storageSubmit Hive queries, Storm topologies (C# or Java spouts/bolts)IntelliSense for authoring Hive jobs and Storm business logic
Great authoring experience: full
IntelliSense support (this tool can also fetch remote metadata for suggestion so users don’t need to
remember a lot of DB/Table names)
Integrated with Visual Studio project system so
users can do version control easily there
Show the DAG graphs for Hive on Tez job (with more
details in the tooltip)
Show associated query
Session Objectives and TakeawaysSession ObjectivesUnderstand the advantages of using Hadoop on cloudDiscuss the common problems of using Hadoop on the cloud
Key TakeawaysMost Hadoop vendors and cloud providers have solution templates to help you tackle cloud migration challengesPick a Hadoop distribution and cloud provider on overall strength of analytics portfolio
Q&A