4
42 2013 Issue 03 | Dell.com/powersolutions Business intelligence Reprinted from Dell Power Solutions, 2013 Issue 3. Copyright © 2013 Dell Inc. All rights reserved. O nce a little-known technology, the Apache Hadoop ® software framework was developed to support offline analytics. It has since evolved into a powerful platform for managing and processing the vast amounts of data deluging enterprise systems. From megabytes to yottabytes, information repositories are growing by the nanosecond through an influx of unstructured data from social networking sites, video, images, mobile devices, sensors and other sources. To gain insights from massive amounts of diverse data types, many organizations are looking beyond the restricted capacity and capabilities of standard relational database management systems (RDBMSs). The same architectural design of RDBMSs that helps ensure consistency and availability often results in scalability limitations. Also, the use of proprietary RDBMS extensions optimizes database performance but subjects organizations to vendor lock-in. And organizations may experience costly per-processing licenses for commercial RDBMSs. As a result, the demand for innovative, cost- effective big data offerings is intensifying. The global big data technology and services market is projected to expand at a 31.7 percent compound annual growth rate through 2016 — about seven Unlocking insights from vast data volumes requires a scalable system that quickly processes both unstructured and structured data. The Intel ® Distribution for Apache Hadoop provides enhancements that boost performance while streamlining deployment. By Armando Acosta and Maggie Smith Optimizing performance for big data analysis

Optimizing Performance for Big Data Analysis

Embed Size (px)

DESCRIPTION

Unlocking insights from vast data volumes requires a scalable system that quickly processes both unstructured and structured data. The Intel® Distribution for Apache Hadoop provides enhancements that boost performance while streamlining deployment. By Armando Acosta and Maggie Smith

Citation preview

Page 1: Optimizing Performance for Big Data Analysis

42 2013 Issue 03 | Dell.com/powersolutions

Business intelligence

Reprinted from Dell Power Solutions, 2013 Issue 3. Copyright © 2013 Dell Inc. All rights reserved. Reprinted from Dell Power Solutions, 2013 Issue 3. Copyright © 2013 Dell Inc. All rights reserved.

Once a little-known technology,

the Apache™ Hadoop® software

framework was developed to

support offline analytics. It has since

evolved into a powerful platform for managing

and processing the vast amounts of data deluging

enterprise systems.

From megabytes to yottabytes, information

repositories are growing by the nanosecond

through an influx of unstructured data from social

networking sites, video, images, mobile devices,

sensors and other sources. To gain insights from

massive amounts of diverse data types, many

organizations are looking beyond the restricted

capacity and capabilities of standard relational

database management systems (RDBMSs).

The same architectural design of RDBMSs that

helps ensure consistency and availability often

results in scalability limitations. Also, the use of

proprietary RDBMS extensions optimizes database

performance but subjects organizations to vendor

lock-in. And organizations may experience costly

per-processing licenses for commercial RDBMSs.

As a result, the demand for innovative, cost-

effective big data offerings is intensifying. The

global big data technology and services market is

projected to expand at a 31.7 percent compound

annual growth rate through 2016 — about seven

Unlocking insights from vast data volumes requires a scalable system that quickly processes

both unstructured and structured data. The Intel® Distribution for Apache Hadoop provides

enhancements that boost performance while streamlining deployment.

By Armando Acosta and Maggie Smith

Optimizing performance for big data analysis

Page 2: Optimizing Performance for Big Data Analysis

Dell.com/powersolutions | 2013 Issue 03 43

Business intelligence

Reprinted from Dell Power Solutions, 2013 Issue 3. Copyright © 2013 Dell Inc. All rights reserved. Reprinted from Dell Power Solutions, 2013 Issue 3. Copyright © 2013 Dell Inc. All rights reserved.

times greater than that of the information and

communication technology market.1

Deriving actionable insights from huge data

volumes calls for a system that can process

multistructured data volumes rapidly and scale

easily to accommodate growth in a stable

and secure IT environment. The open-source

Hadoop platform shows enormous promise

for big data management and processing in a

number of scenarios, ranging from mining social

media profiles and flagging credit card fraud to

identifying top job candidates and predicting

weather patterns.

Yet for all Hadoop’s data-crunching prowess,

an absence of integrated support for strong data

security has slowed deployment efforts. Consider,

for example, a financial institution that combines

multiple data warehouses into a large Hadoop

cluster. Securing the data requires extensive use

of embedded encryption tools. However, many

Hadoop implementations are not optimized

to handle the processing load incurred by

encryption and decryption, which typically add

considerable latency and consume substantial

compute resources.

To address organizational needs to run high-

performance analytics on a secure platform,

Dell has teamed up with Intel to optimize the

Intel Distribution for Apache Hadoop software

for deployment on Dell™ hardware. The Intel

Distribution is designed to provide secure

enterprise-quality distributed-processing and data-

management software, as well as deployment

support and consulting services.

Finding the right fit

Because of the wide variety of big data

challenges, organizations require broadened

flexibility and choice in a platform that helps them

gain valuable insights based on their specific use

cases. (For more information, see the sidebar,

“Distributed processing in action.”) When it comes

to big data management and analytics, one size

does not fit all.

To that end, Dell has expanded its Hadoop

offerings to include the Intel Distribution for

Apache Hadoop. The Intel Distribution joins the

field-tested Dell | Cloudera Hadoop Solution,

which combines Cloudera’s Distribution Including

Apache Hadoop (CDH) with Dell servers, Dell-

developed Crowbar deployment software and

networking components, as well as management

tools, training, technology support and

professional services. (For more information, see

the sidebar, “Insight acceleration.”)

Enhancing performance, security

and manageability

The Intel Distribution is packaged with the

Hadoop platform and other software components

(see figure). Hadoop comprises the Hadoop

Distributed File System (HDFS™) framework,

designed for high-throughput data storage

and access on commodity hardware, and

the MapReduce framework, which enables

developers to write applications that execute

jobs in parallel on large clusters. Other core

components of Hadoop are the Apache Hive™

data warehousing software and the

Apache HBase™ database, a distributed,

columnar big data store.

With the power of Hadoop at its foundation,

the Intel Distribution features a number of

additional capabilities and optimizations

designed to streamline deployment and improve

1 IDC Worldwide Big Data Technology and Services 2012-2016 Forecast, doc #238746, December 2012.

Intel Manager for Apache Hadoop SoftwareDeployment, configuration, monitoring, altering and security

Ap

ach

e Sq

oo

p™

dat

a e

xch

ang

e

Ap

ach

e Fl

um

e™

log

co

llec

tor

A

pac

he

Zo

oke

eper

co

ord

inat

ion

Ap

ach

e H

Bas

ec

olu

mn

ar s

tora

ge Apache Pig™

scriptingApache HiveSQL-like query

Apache Oozie™

workflow

MapReducedistributed processing framework

Apache HDFSHadoop Distributed File System

Taxonomy of the Intel Distribution for Apache Hadoop

Page 3: Optimizing Performance for Big Data Analysis

44 2013 Issue 03 | Dell.com/powersolutions

Business intelligence

Reprinted from Dell Power Solutions, 2013 Issue 3. Copyright © 2013 Dell Inc. All rights reserved. Reprinted from Dell Power Solutions, 2013 Issue 3. Copyright © 2013 Dell Inc. All rights reserved.

performance. Intel® Manager for Apache

Hadoop is a web-based management

console that facilitates the installation,

configuration and administration of

the Hadoop cluster. Intel Manager also

supports resource monitoring and alerting

through the open-source Nagios® and

Ganglia monitoring systems, which are

included in the Intel Distribution. By taking

advantage of this powerful, easy-to-use

tool, IT can focus critical resources and

expertise on deriving business value from

the Hadoop environment rather than

managing the cluster.

The Intel Distribution includes extensions

to HBase and Hive that help improve

real-time transactional performance and

the end-user experience. Exceptional

encryption and decryption capabilities

heighten security and access control.

The Intel Distribution is optimized to

work with Intel® Advanced Encryption

Standard New Instructions (Intel® AES-NI)

technology, which is built into Intel® Xeon®

processors. Intel AES-NI is designed to

accelerate compute-intensive encryption and

decryption, helping eliminate latency and

greatly reduce processor load.

In addition to leveraging the capabilities

of its processors, Intel can build and

optimize hardware features of the

company’s solid-state drives (SSDs) and

10 Gigabit Ethernet (10GbE) adapters to

boost Hadoop performance, security

and manageability.

Also critical to accelerating Hadoop

performance is server optimization. The

Intel Distribution is designed to efficiently

integrate Hadoop with Dell servers to

deliver optimal solutions for a variety

of use cases. The Dell PowerEdge™

R720xd server is well suited for Hadoop

deployments because these environments

often require a 1:1 spindle-to-core ratio for

optimized performance. The PowerEdge

R720xd features high spindle-to-core

counts and includes options to avoid

I/O bottlenecks.

Insight acceleration Organizations worldwide are turning to the open-source Apache Hadoop software

platform to support enterprise applications that analyze extremely large amounts

of diverse data. However, the inherent nature of Hadoop, with its distributed

architecture, adds layers of complexity, especially when it comes to deployment,

management and security. As a result, many organizations may have delayed

Hadoop deployments because they lack the necessary expertise in planning,

design, implementation and maintenance.

By providing the expert assistance, tools and technology resources needed,

Dell Services helps organizations move their Hadoop activities from the sandbox to

production environments to achieve business value. These services are tailored to

an organization’s short- and/or long-term objectives and help optimize the use of

emerging technologies, advance efficiencies and maximize the value of IT investments.

Experts at Dell Solution Centers located in key sites around the globe are available

to bolster the technical skills of those new to Hadoop and open-source technologies.

They can help participants gain hands-on experience with a variety of topics, ranging

from obtaining maximum performance from an application deployed on Dell servers

and storage to exploring cloud computing and big data using Hadoop.

At a Dell Solution Center, participants can attend a technical briefing with a

Dell expert, investigate an architectural design session or build a proof-of-concept

engagement to comprehensively validate a big data solution and streamline

deployment. Using an organization’s specific configurations and test data, participants

can discover how a big data solution from Dell meets their business needs.

A recent addition to Dell’s global network of solution centers is the Big Data

Innovation Center in Singapore, where organizations can test big data initiatives

and proofs of concept. The facility provides a big data stack that includes Dell

infrastructure using Intel Xeon E5 processor–based servers, Intel® 10 Gigabit

Ethernet networking, Intel® Solid-State Drives, the Intel Distribution for Apache

Hadoop and Revolution R Enterprise predictive analytics software. Organizations

that need to test-run their big data workloads can use the center to determine the

impact of big data initiatives to their business. The center also offers training to

help equip participants with the skills necessary for improving the quality of data

mining across a wide range of platforms and data sources.

For more information on Dell Solution Centers, visit dell.com/solutioncenters.

Page 4: Optimizing Performance for Big Data Analysis

Dell.com/powersolutions | 2013 Issue 03 45

Business intelligence

Reprinted from Dell Power Solutions, 2013 Issue 3. Copyright © 2013 Dell Inc. All rights reserved. Reprinted from Dell Power Solutions, 2013 Issue 3. Copyright © 2013 Dell Inc. All rights reserved.

Putting big data to work

Originally a tool for offline analytics of web-scale

data, Hadoop is fast on its way to becoming

a business-critical platform for gathering

intelligence and actionable insights from vast

amounts of unstructured data. Helping to drive

this transformation is the Intel Distribution for

Apache Hadoop — an open-source offering

that unites the power of Hadoop and other

software elements with important performance

enhancements and hardware optimizations from

Intel. Together, this combination of capabilities

not only enhances security, performance

and manageability, but also provides a robust

foundation for advancing innovation in analytics

by the open-source community.

Learn more

Intel Distribution for Apache Hadoop

on Dell PowerEdge Servers:

qrs.ly/br3gyd4

Authors

Armando Acosta is a senior product line consultant

at Dell and has more than 15 years of experience in

the IT industry.

Maggie Smith is a senior marketing manager at Dell.

She is focused on big data solutions for enterprises

and has over 30 years of experience marketing

technology products.

Distributed processing in action As big data becomes big business,

organizations are discovering

innovative ways to harness the value

of their data. The Intel Distribution

for Apache Hadoop helps these

organizations get the most out of

hardware performance, strengthen data

security and improve data management

and processing capabilities.

One company, for example, used

the Intel Distribution to support its

powerful search-engine technology

for life-science researchers. Dedicated

to furthering genomics research, the

company was having trouble managing

its large data sets. To scale in a cost-

effective manner, the company

deployed the Intel Distribution and

used Apache Hive and Apache Hadoop

for query and search. The company

also turned to Intel to optimize its

hardware and software for increased

performance. As a result, the company

achieved an exceptional increase in

throughput using less than half the

nodes previously deployed.

Another example is a large

telecommunications company that was

faced with eroding profits thanks in part

to the high cost of maintaining a complex

billing system. Poor-quality customer

service stemming from the beleaguered

billing system was prompting customer

churn. Unfortunately, the company’s

existing relational database management

system (RDBMS) could not deliver

storage scalability or real-time query

access. So the telecommunications

company selected the Intel Distribution

for real-time analytics and decision

support, as well as solid disaster

recovery and failover. The result:

exceptional support for a new business

intelligence initiative that provided a

lower total cost of ownership compared

to its traditional RDBMS.

iSto

ckp

ho

to/T

hin

ksto

ck