E-guide Hadoop Big Data Platforms Buyer’s Guide part 1cdn.ttgtmedia.com/searchBusinessAnalytics/... · Hadoop distributions or capabilities include Pivotal Software Inc., IBM, Amazon

E-guide

Hadoop Big Data Platforms Buyer’s Guide – part 1 Your expert guide to Hadoop big data platforms

Page 1 of 16

In this e-guide

Exploring Hadoop distributions

for managing big data

How a Hadoop distribution can

help you manage big data

E-guide

Exploring Hadoop distributions for managing big data

David Loshin, Knowledge Integrity Inc.

Companies of all sizes can use Hadoop, as vendors sell packages

that bundle Hadoop distributions with different levels of support, as

well as enhanced commercial distributions.

Hadoop is an open source technology that today is the data management

platform most commonly associated with big data applications. The distributed

processing framework was created in 2006, primarily at Yahoo and based partly

on ideas outlined by Google in a pair of technical papers; soon, other Internet

companies such as Facebook, LinkedIn and Twitter adopted the technology and

began contributing to its development. In the past few years, Hadoop has

evolved into a complex ecosystem of infrastructure components and related

tools, which are packaged together by various vendors in commercial Hadoop

distributions.

Running on clusters of commodity servers, Hadoop offers a high-performance,

low-cost approach to establishing a big data management architecture for

supporting advanced analytics initiatives. As awareness of its capabilities has

increased, Hadoop's use has spread to other industries, for both reporting and

Page 2 of 16

In this e-guide





E-guide

analytical applications involving a mix of traditional structured data and newer

forms of unstructured and semi-structured data. This includes Web clickstream

data, online ad information, social media data, healthcare claims records, and

sensor data from manufacturing equipment and other devices on the Internet of

Things.

What is Hadoop?

The Hadoop framework encompasses a large number of open source software

components with a set of core modules for capturing, processing, managing and

analyzing massive volumes of data that's surrounded by a variety of supporting

technologies. The core components include:

The Hadoop Distributed File System (HDFS), which supports a

conventional hierarchical directory and file system that distributes files

across the storage nodes (i.e., DataNodes) in a Hadoop cluster.

MapReduce, a programming model and execution framework for parallel

processing of batch applications.

YARN (short for the good-humored Yet Another Resource Negotiator),

which manages job scheduling and allocates cluster resources to running

applications, arbitrating among them when there's contention for the

Page 3 of 16

In this e-guide





E-guide

available resources. It also tracks and monitors the progress of processing

jobs.

Hadoop Common, a set of libraries and utilities used by the different

components.

In Hadoop clusters, those core pieces and other software modules are layered

on top of a collection of computing and data storage hardware nodes. The

nodes are connected via a high-speed internal network to form a high-

performance parallel and distributed processing system.

As a collection of open source technologies, Hadoop isn't controlled by any

single vendor; rather, its development is managed by the Apache Software

Foundation. Apache offers Hadoop under a license that basically grants users a

no-charge, royalty-free right to use the software. Developers can download it

directly from the Apache website and build a Hadoop environment on their own.

However, Hadoop vendors provide prebuilt "community" versions with basic

functionality that can also be downloaded at no charge and installed on a variety

of hardware platforms. They also market commercial -- or enterprise -- Hadoop

distributions that bundle the software with different levels of maintenance and

support services.

In some cases, vendors also offer performance and functionality enhancements

over the base Apache technology -- for example, by providing additional

software tools to ease cluster configuration and management, or data

Page 4 of 16

In this e-guide





E-guide

integration with external platforms. These commercial offerings make Hadoop

increasingly more attainable for companies of all sizes. This is especially

valuable when the commercial vendor's support services team can jump-start a

company's design and development of their Hadoop infrastructure, as well as

guide the selection of tools and integration of advanced capabilities to quickly

deploy high-performance analytical solutions to meet emerging business needs.

The components of a typical Hadoop software stack

What do you actually get when you obtain a commercial version of Hadoop? In

addition to the core components, typical Hadoop distributions will include -- but

aren't limited to -- the following:

Alternative data processing and application execution managers such as

Tez or Spark, which can run on top of or alongside YARN to provide cluster

management; cached data management; and other means of improving

processing performance.

Apache HBase, a column-oriented database management system modeled

after Google's BigTable project that runs on top of HDFS.

SQL-on-Hadoop tools such as Hive, Impala, Stinger, Drill and Spark SQL,

which provide varying degrees of compliance with the SQL standard for

direct querying of data stored in HDFS.

Page 5 of 16

In this e-guide





E-guide

Development tools such as Pig that help developers build MapReduce

programs.

Configuration and management tools such as ZooKeeper or Ambari, which

can be used for monitoring and administration.

Analytics environments such as Mahout that supply analytical models for

machine learning, data mining and predictive analytics.

Because the software is open source, you don't purchase a Hadoop distribution

as a product, per se. Instead, the vendors sell annual support subscriptions with

varying service-level agreements (SLAs). All of the vendors are active

participants in the Apache Hadoop community, although each may promote its

own add-on components that it has contributed to the community as part of its

Hadoop distribution.

Page 6 of 16

In this e-guide





E-guide

Who manages the Hadoop big data management environment?

It's important to recognize that getting the desired performance out of a Hadoop

system requires a coordinated team of skilled IT professionals who collaborate

on architecture planning, design, development, testing, deployment, and

ongoing operations and maintenance to ensure peak performance. Those IT

teams will typically include:

Requirements analysts to assess the system performance requirements

based on the types of applications that will be run in the Hadoop

environment.

System architects to evaluate performance requirements and design

hardware configurations.

System engineers to install, configure and tune the Hadoop software stack.

Application developers to design and implement applications.

Data management professionals to do data integration, create data layouts

and perform other management tasks.

System managers to do operational management and maintenance.

Page 7 of 16

In this e-guide





E-guide

Project managers to oversee the implementation of the various levels of the

stack and application development work.

A program manager to oversee the implementation of the Hadoop

environment and prioritization, development and deployment of

applications.

The Hadoop software platform market

In essence, the evolution of Hadoop as a viable large-scale data management

ecosystem has also created a new software market that's transforming the

business intelligence and analytics industry. This has expanded both the kinds

of analytics applications that user organizations can run and the types of data

that can be collected and analyzed as part of those applications. The market

includes three independent vendors that specialize in Hadoop -- Cloudera Inc.,

Hortonworks Inc. and MapR Technologies Inc. Other companies that offer

Hadoop distributions or capabilities include Pivotal Software Inc., IBM, Amazon

Web Services and Microsoft.

Evaluating vendors that provide Hadoop distributions requires understanding

the similarities and differences between two aspects of the product offerings.

First is the technology itself: What's included in the different distributions; what

platforms are they supported on; and, most important, what specific

components are championed by the individual vendors? Second is the service

Page 8 of 16

In this e-guide





E-guide

and support model: What types of support and SLAs are provided within each

subscription level, and how much do different subscriptions cost?

Understanding how these aspects relate to your specific business requirements

will highlight the characteristics that are important for a vendor relationship. The

next article in this series will examine several business use cases for a Hadoop

big data management platform so you can determine your organization's needs

and requirements.

Next article

Page 9 of 16

In this e-guide





E-guide

How a Hadoop distribution can help you manage big data

David Loshin, Knowledge Integrity Inc.

To help you determine if a commercial Hadoop distribution could

benefit your organization, consultant David Loshin examines big

data use cases and applications that Hadoop can support.

Many companies are struggling to manage the massive amounts of data they

collect. Whereas in the past they may have used a data warehouse platform,

such conventional architectures can fall short for dealing with data originating

from numerous internal and external sources and often varying in structure and

types of content. But new technologies have emerged to offer help -- most

prominently, Hadoop, a distributed processing framework designed to address

the volume and complexity of big data environments involving a mix of

structured, unstructured and semi-structured data.

Part of Hadoop's allure is that it consists of a variety of open source software

components and associated tools for capturing, processing, managing and

analyzing data. But, as addressed in a previous article in this series, in order to

help users take advantage of the framework, many vendors offer commercial

Hadoop distributions that provide performance and functionality enhancements

over the base Apache open source technology and bundle the software with

Page 10 of 16

In this e-guide





E-guide

maintenance and support services. As the next step, let's take a look at how a

Hadoop distribution could benefit your organization.

Making a case for a Hadoop distribution

Hadoop runs in clusters of commodity servers and typically is used to support

data analysis and not for online transaction processing applications. Several

increasingly common analytics use cases map nicely to its distributed data

processing and parallel computation model. The list includes:

Operational intelligence applications for capturing streaming data from

transaction processing systems and organizational assets, monitoring

performance levels, and applying predictive analytics for pre-emptive

maintenance or process changes.

Web analytics, which are intended to help companies understand the

demographics and online activities of website visitors, review Web server

logs to detect system performance problems, and identify ways to enhance

digital marketing efforts.

Security and risk management, such as running analytical models that

compare transactional data to a knowledge base of fraudulent activity

patterns, as well as continuous cybersecurity analysis for identifying

emerging patterns of suspicious behavior.

Page 11 of 16

In this e-guide





E-guide

Marketing optimization, including recommendation engines that absorb

huge amounts of Internet clickstream and online sales data and blend that

information with customer profiles to provide real-time suggestions for

product bundling and upselling.

Internet of Things applications, such as analyzing data from things -- like

manufacturing devices, pipelines and so-called smart buildings -- via

sensors that continuously generate and broadcast information about their

status and performance.

Sentiment analysis and brand protection, which might involve capturing

streaming social media data and analyzing the text to identify unsatisfied

customers whose issues can be addressed quickly.

Massive data ingestion for data collection, processing and integration

scenarios such as capturing satellite images and geospatial data.

Data staging, in which Hadoop is used as an initial landing spot for data

that is then integrated, cleansed and transformed into more structured

formats in preparation for loading into a data warehouse or analytical

database for analysis.

Page 12 of 16

In this e-guide





E-guide

Capabilities supporting the use cases

Applications supporting these usage scenarios can be built on top of Hadoop

using some prototypical implementation methodologies, such as:

Data lakes. Because Hadoop delivers linear scalability for processing and

storage as new data nodes are incorporated into a cluster architecture, it

provides a natural platform for capturing and managing voluminous files of raw

data. This has motivated many users to implement Hadoop systems as a catch-

all platform for their data, creating a conceptual data lake.

Data warehouse augmentation platform. Hadoop's distributed storage can

also be used to expand the data that's accessible for analysis in a data

warehouse environment. For example, a temperature-based scheme can be

used for allocating data to different levels of the storage hierarchy, depending

on its frequency of use. The most frequently accessed "hot" data is kept in the

data warehouse, while less-frequently used "cool" data is relegated to higher-

latency storage such as the Hadoop Distributed File System. This approach

relies on tightly coupled data warehouse integration with Hadoop.

Large-scale batch computation engine. When configured with a combination

of data and compute nodes, Hadoop becomes a massively parallel processing

platform that's suited to batch processing applications for manipulating and

analyzing data. One example would be data standardization and transformation

Page 13 of 16

In this e-guide





E-guide

jobs applied to data sets to prepare them for analysis. Algorithm-driven analytics

applications such as data mining, machine learning, pattern analysis and

predictive modeling are also good matches for Hadoop's batch capabilities, as

they can be executed in parallel over massive distributed data files with

iterations of partial results accumulated until the program completes with a final

set of results.

Event stream analytics processing engine. A Hadoop environment can also

be configured to process incoming data streams in real or near real time. As an

example, a customer sentiment analysis application can have multiple

communication agents running in parallel on a Hadoop cluster, each applying a

set of stream processing rules to data feeds from social networks such as

Twitter and Facebook.

Advantages of adopting Hadoop: Is it right for you?

A low-cost, high-performance computing framework like Hadoop can address

different IT and business motivations for scaling up processing power or

expanding data management capabilities in an organization. Let's examine

some characteristics of application requirements that suggest the need for a

data management platform based on a Hadoop distribution:

Ingestion and processing of large data sets, massive data volumes

and streaming data. Examples include capturing Web server logs that

Page 14 of 16

In this e-guide





E-guide

contain information about billions of online events; indexing hundreds of

millions of documents across different data sets; and continuously pulling in

data streams such as social media channels, stock market data, news

feeds and content published at expert communities.

A need to eliminate performance impediments. Application performance

is often throttled on traditional data warehouse systems as a result of data

accessibility, latency and availability issues or bandwidth limits in relation to

the amount of data that needs to be processed.

The desire for linear scalability on performance. As data volumes grow

and the number of users increases, having an environment in which

performance will scale linearly as more computing and storage resources

are added can be crucial, especially when applications can benefit from

parallel computing.

A mixture of structured and unstructured data. The applications need to

use data from different sources that vary in structure, and some -- or much

-- of it is unstructured or semi-structured, for example, text or server log

data.

IT cost efficiencies. Rather than paying premium prices for high-end

servers or specialty hardware appliances, the system architects believe that

acceptable performance can be achieved using commodity components

Page 15 of 16

In this e-guide





E-guide

Considerations for integrating Hadoop into the enterprise

A positive value proposition for using Hadoop still must be balanced, though,

with the feasibility of integrating the platform into the enterprise. Because many

organizations have made significant investments in traditional data warehouse

platforms, there may be some resistance to introducing a newer technology.

Before engaging a Hadoop distribution vendor, work to resolve any potential

barriers to adoption and assess requirements for cluster sizing and

configuration.

For example, determine where a Hadoop cluster fits in your organization's data

warehousing and analytics strategy -- whether it's intended to augment existing

data warehouses or replace them. Also, identify integration and interoperability

issues that need to be addressed, and review configuration alternatives,

including whether it's better to implement the Hadoop ecosystem on premises or

in a cloud-based or hosted environment. In addition, ensure that you have

funding to hire people with the right skills or retrain existing employees. Hadoop

application development differs greatly from conventional database

development.

Answering these types of questions will help in determining the feasibility of a

Hadoop deployment. The next step, which will be examined in the third article in

Page 16 of 16

In this e-guide





E-guide

this series, is to evaluate the features and functions you need in a commercial

Hadoop distribution.

About the author

David Loshin, managing director at Decisionworx, is a recognized thought

leader, speaker and expert consultant. He has written numerous books,

including Big Data Analytics: From Strategic Planning to Enterprise Integration

with Tools, Techniques, NoSQL and Graph. He can be reached through his

website, at www.decisionworx.com.

Email us at [email protected] and follow us on Twitter:

@BizAnalyticsTT.

http://www.decisionworx.com/

mailto:[email protected]

Documents

E-guide Hadoop Big Data Platforms Buyer’s Guide part 1cdn.ttgtmedia.com/searchBusinessAnalytics/... · Hadoop distributions or capabilities include Pivotal Software Inc., IBM, Amazon