VIRT1400BU Real-World Customer Architecture for Big … · Real-World Customer Architecture for Big Data on VMware vSphere ... landscape for the SAP migration from HPUX super domes

Joe Bruneau, General MillsJustin Murray, Technical Marketing, VMware

VIRT1400BU

#VMworld #VIRT1400BU

Real-World Customer Architecture for Big Data on VMware vSphere

VMworld 2017 Content: Not fo

r publication or distri

bution

• This presentation may contain product features that are currently under development.

• This overview of new technology represents no commitment from VMware to deliver these features in any generally available product.

• Features are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind.

• Technical feasibility and market demand will affect final delivery.

• Pricing and packaging for any new technologies or features discussed or presented have not been determined.

Disclaimer

2#VIRT1400BU CONFIDENTIAL



bution

Agenda

#VIRT1400BU CONFIDENTIAL 3

1 Introductions

2 Why Enterprises are Deploying Big Data on vSphere, and how

3 Reference Architectures – an Overview

4 Introduction to General Mills

5 Architecture Details from the General Mills Hadoop Deployment

6 Experiences in Deployment – hurdles and how to overcome them

7 Best Practices - discovered along the way

8 Lessons Learned

9 Future Work

10 Conclusions



bution

Our Roles

• Joe is a Systems Administrator at General Mills and has worked with VMware for over a decade. He started out deploying VDI and Lab Manager and then on to virtualizing servers with the Windows team. After that he joined the enterprise infrastructure team to build the VMware landscape for the SAP migration from HPUX super domes to RHEL, implementing SRM and introducing Fault Tolerance. Joe also supports the fibre channel infrastructure and works on backup and DR.

• Justin is a senior Technical Marketing architect at VMware. He works closely with the company’s customers and partners to help them deploy big data and analytics applications systems on VMware vSphere. He writes best practice documents and other material on these subjects and has spoken in public events on these technical topics.




bution

Why are Enterprises Deploying Big Data?

• They want to get off existing costly data platforms (for OLAP rather than OLTP)

• Older data warehouse technology is not serving their needs

• Want to do queries and analytics against many different forms of data (structured, unstructured, streaming, images, clickstream)

• Provide data access to their own customers – with analytic tools (e.g. VMware telemetry data)

• Integrate systems that have been islands till now

– Single source of truth for the enterprise

• Exploit new application architectures for developer productivity

• Want to do data science, machine learning, deep learning to predict their customer behaviors or detect fraud at time of occurrence (as examples)




bution

Use Cases: Virtualization of Big Data

• Enterprises have development, test, pre-prod staging and production clusters that are required to be separated from each other and provisioned independently

• Organizations need different versions of Hadoop/Spark/machine learning platforms to be available to different teams - with possibly different services available (e.g. HBase, MapReduce, Spark)

• Enterprises do not wish to dedicate a specific set of hardware to each different requirement above, and want to reduce overall costs

• IT wants to provide Hadoop clusters as a service on-demand for its end users




bution

Worker Node 1 Worker Node 2 Worker Node 3

The Existing Hadoop Architecture

ResourceManager

Client

Datanode

Nodemanager

AppMaster - 1

Nodemanager Nodemanager

Datanode Datanode

HDFS Block 1 HDFS Block 2 HDFS Block 3

Container - 2 Container - 3

Master File System Index

NameNode

Submit job

Workers

Master Scheduler




bution

Hadoop – in Virtual Machines

Worker Node 1 Worker Node 2 Worker Node 3

Input File

ResourceManagerJob

Datanode

Nodemanager

AppMaster - 1

Nodemanager Nodemanager

Datanode Datanode

HDFS Block 1 HDFS Block 2 HDFS Block 3

Container - 2 Container - 3

Namenode

Master Scheduler

Master File System Index

A virtual machineKey: #VIRT1400BU CONFIDENTIAL 8



bution

High Level View of Apache Spark




bution

Deployment on vSphere –Proven Architectures



bution

VirtualizationHost Server

VMDK

HadoopNode 1Virtual Machine

Datanode

Ext4

Nodemanager

Ext4 Ext4 Ext4

Six Local DAS disks per Virtual Machine

VMDK VMDK VMDK VMDK VMDK VMDK VMDK

HadoopNode 2VirtualMachine

Datanode

Ext4

Nodemanager

Ext4 Ext4 Ext4Ext4

VMDKVMDK VMDKVMDK

Ext4Ext4Ext4

Combined Model: Two Virtual Machines on a Host




bution

#1 Reference Architecture from Cloudera




bution

Data/Compute Separation (with External Access to HDFS)

HadoopVirtualNode 2

NN

NN

NN

NN

NN

NN

data

no

de

Isilon

VirtualizationHost

VMDKOS Image –

VMDKOS Image –

VMDK VMDK

VMDK

HadoopVirtualNode 1

Ext4

ResourceManager

Ext4

Temp

OS Image –

VMDK

Ext4

NodeManager

Ext4

HadoopVirtualNode 3

Ext4

NodeManager

Ext4

Temp

HDFS requests




bution

Key Requirements for Big Data Architecture

• Performance

• Scaling

– to dozens or hundreds of nodes (VMs)

• Robustness – distributed file system, no one process is a single point of failure

• High Availability

• Fault Tolerance

• Capable of handling new workloads with new compute demands




bution

A Customer Journey: General MillsJoe Bruneau



bution

General Mills

• 1886 started as a flour mill on the banks of the Mississippi river as the Washburn Crosby Company

• 1928 General Mills is formed and starts trading on NYSE

• Acquired Pillsbury 2001

• Over 100 locations internationally, 39,000 employees, 165 on Fortune 500

• Brands

– Betty Crocker, Bisquick, Cheerios, Chex, Fiber One, Hamburger Helper, Nature Valley, Pillsbury, Yoplait, Gardettos, Gold Medal Flour, Haagen-Dazs, Old El Paso, Progresso, Cascadian Farms, Larabar, Muir Glen, Epic Provisions, Totinos,




bution

Big Data / Connected Data at General Mills

• Started looking at big data 2½ years ago

• Marketing and Global Consumer Insights

• Understanding Consumer Trends

• Understanding the impact of marketing and promotions on retail sales

• Attended Virtualizing Big Data sessions VMworld 2015

• Worked with our TAM to set up a meeting with Justin




bution

Server Landscape

Physical Landscape

• 30 node production cluster

• 700TB

Virtual Landscape

• 15 physical ( 30 VM's)

• 60 TB




bution

Storage Benchmarks Physical vs Virtual

• Internal storage benchmarking tool




bution

TeraGen Benchmarking




bution

TeraSort Benchmarking




bution

Hadoop on vSphere




bution

CPU, Memory & Disk Configuration




bution




bution

Datastores and Devices




bution

Applications

• Impala for interactive analysis using tools like Tableau

• Hive & Spark for batch processing

• Limited Machine Learning using python and R




bution

Business Applications That Hadoop is Used For

• Data sources are across the enterprise, from master data to transactional data, web activity analytics, supply chain, IoT, manufacturing analytics, consumer sentiment.

• We see the data lake as an enabler for enterprise wide analytics




bution

General Mills Big Data

• Cloudera Hadoop

• Tableau

• SAP BW on HANA

• SAP Data Services

• Data catalog / Data quality tools




bution

Cloudera

• Superior management tools

• Cloudera Navigator




bution

Initial Build

• Once the engineering technical specs were worked out and the hardware configured it took less than a day to build the VM infrastructure because we leveraged our existing VM deployment process

• It took about a year to configure and implement everything as we were working with a 3rd party vendor as part of the solution




bution

Key Factors

• Cost

• Performance

• Using a hardware raid controller instead of a software raid controller

• Leverage existing virtual machine provisioning and deployment process

• Improve drive failure detection

• Ability to implement hot spare hard drives




bution

Best Practices

• Use a hardware raid controller

• Stick to NUMA boundaries

• Proper memory size of VM's

• Don't over allocate

• Good documentation for the Hadoop admins (where VM's are located)




bution

How Would You Do it Differently, If at All, Today?

• Develop automation to deploy and map VM's to specific Hadoop hardware configurations

• Had time permitted, I would have spent more time researching hardware options to simplify deployment and create a Hadoop as a Service infrastructure platform.




bution

What Does the Business Gain or Hope to Gain from the Big Data Systems? Who Are the End Users of This System?

• The business has a broader “connected data” strategy and Hadoop is the foundation for the strategy

• Business Analysts

• Automated Systems




bution

Future Plans

What plans are there to expand the cluster(s) in the future?

• Build as needed

How are the business needs/requests handled to increase the functionality of the system?

• Through the connected data steering team

Any new technologies that you are interested in using?

• nVME storage with virtual hadoop

• Investigate HP Synergy platform




bution

How Does the Company Think about Big Data in the Private and Public Clouds as Cooperating with or Replacing Each Other?

• Long term we see the two working together in a hybrid way

• Public cloud will help us

– Google advanced analytics capabilities

– Amazon cloud only analytics capabilities




bution

Introducing vSphere Scale-Out for Big Data and HPC Workloads

37

• Hypervisor, vMotion, vShield Endpoint, Storage vMotion, Storage APIs, Distributed Switch, I/O Controls & SR-IOV, Host Profiles / Auto Deploy and more

Features

• Sold in Packs of 8 CPU at a cost-effective price pointPackaging

• EULA enforced for use w/ Big Data/HPC workloads onlyLicensing

New package that provides all the core features required for scale-out workloads at an attractive price point



bution

Conclusions

• Virtualized Big Data is in production today at customers sites such as those of General Mills

• Big Data requires some best practices at both the business level and the technical level

• VMware can help you on this journey

[email protected]

[email protected]

http://www.vmware.com/big-data




bution

mailto:[email protected]

mailto:[email protected]

http://www.vmware.com/big-data



bution



bution

Documents

VIRT1400BU Real-World Customer Architecture for Big … · Real-World Customer Architecture for Big Data on VMware vSphere ... landscape for the SAP migration from HPUX super domes