Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
E-guide
Hadoop Big Data Platforms Buyer’s Guide – part 1 Your expert guide to Hadoop big data platforms
Page 1 of 16
In this e-guide
Exploring Hadoop distributions
for managing big data
How a Hadoop distribution can
help you manage big data
E-guide
Exploring Hadoop distributions for managing big data
David Loshin, Knowledge Integrity Inc.
Companies of all sizes can use Hadoop, as vendors sell packages
that bundle Hadoop distributions with different levels of support, as
well as enhanced commercial distributions.
Hadoop is an open source technology that today is the data management
platform most commonly associated with big data applications. The distributed
processing framework was created in 2006, primarily at Yahoo and based partly
on ideas outlined by Google in a pair of technical papers; soon, other Internet
companies such as Facebook, LinkedIn and Twitter adopted the technology and
began contributing to its development. In the past few years, Hadoop has
evolved into a complex ecosystem of infrastructure components and related
tools, which are packaged together by various vendors in commercial Hadoop
distributions.
Running on clusters of commodity servers, Hadoop offers a high-performance,
low-cost approach to establishing a big data management architecture for
supporting advanced analytics initiatives. As awareness of its capabilities has
increased, Hadoop's use has spread to other industries, for both reporting and
Page 2 of 16
In this e-guide
Exploring Hadoop distributions
for managing big data
How a Hadoop distribution can
help you manage big data
E-guide
analytical applications involving a mix of traditional structured data and newer
forms of unstructured and semi-structured data. This includes Web clickstream
data, online ad information, social media data, healthcare claims records, and
sensor data from manufacturing equipment and other devices on the Internet of
Things.
What is Hadoop?
The Hadoop framework encompasses a large number of open source software
components with a set of core modules for capturing, processing, managing and
analyzing massive volumes of data that's surrounded by a variety of supporting
technologies. The core components include:
The Hadoop Distributed File System (HDFS), which supports a
conventional hierarchical directory and file system that distributes files
across the storage nodes (i.e., DataNodes) in a Hadoop cluster.
MapReduce, a programming model and execution framework for parallel
processing of batch applications.
YARN (short for the good-humored Yet Another Resource Negotiator),
which manages job scheduling and allocates cluster resources to running
applications, arbitrating among them when there's contention for the
Page 3 of 16
In this e-guide
Exploring Hadoop distributions
for managing big data
How a Hadoop distribution can
help you manage big data
E-guide
available resources. It also tracks and monitors the progress of processing
jobs.
Hadoop Common, a set of libraries and utilities used by the different
components.
In Hadoop clusters, those core pieces and other software modules are layered
on top of a collection of computing and data storage hardware nodes. The
nodes are connected via a high-speed internal network to form a high-
performance parallel and distributed processing system.
As a collection of open source technologies, Hadoop isn't controlled by any
single vendor; rather, its development is managed by the Apache Software
Foundation. Apache offers Hadoop under a license that basically grants users a
no-charge, royalty-free right to use the software. Developers can download it
directly from the Apache website and build a Hadoop environment on their own.
However, Hadoop vendors provide prebuilt "community" versions with basic
functionality that can also be downloaded at no charge and installed on a variety
of hardware platforms. They also market commercial -- or enterprise -- Hadoop
distributions that bundle the software with different levels of maintenance and
support services.
In some cases, vendors also offer performance and functionality enhancements
over the base Apache technology -- for example, by providing additional
software tools to ease cluster configuration and management, or data
Page 4 of 16
In this e-guide
Exploring Hadoop distributions
for managing big data
How a Hadoop distribution can
help you manage big data
E-guide
integration with external platforms. These commercial offerings make Hadoop
increasingly more attainable for companies of all sizes. This is especially
valuable when the commercial vendor's support services team can jump-start a
company's design and development of their Hadoop infrastructure, as well as
guide the selection of tools and integration of advanced capabilities to quickly
deploy high-performance analytical solutions to meet emerging business needs.
The components of a typical Hadoop software stack
What do you actually get when you obtain a commercial version of Hadoop? In
addition to the core components, typical Hadoop distributions will include -- but
aren't limited to -- the following:
Alternative data processing and application execution managers such as
Tez or Spark, which can run on top of or alongside YARN to provide cluster
management; cached data management; and other means of improving
processing performance.
Apache HBase, a column-oriented database management system modeled
after Google's BigTable project that runs on top of HDFS.
SQL-on-Hadoop tools such as Hive, Impala, Stinger, Drill and Spark SQL,
which provide varying degrees of compliance with the SQL standard for
direct querying of data stored in HDFS.
Page 5 of 16
In this e-guide
Exploring Hadoop distributions
for managing big data
How a Hadoop distribution can
help you manage big data
E-guide
Development tools such as Pig that help developers build MapReduce
programs.
Configuration and management tools such as ZooKeeper or Ambari, which
can be used for monitoring and administration.
Analytics environments such as Mahout that supply analytical models for
machine learning, data mining and predictive analytics.
Because the software is open source, you don't purchase a Hadoop distribution
as a product, per se. Instead, the vendors sell annual support subscriptions with
varying service-level agreements (SLAs). All of the vendors are active
participants in the Apache Hadoop community, although each may promote its
own add-on components that it has contributed to the community as part of its
Hadoop distribution.
Page 6 of 16
In this e-guide
Exploring Hadoop distributions
for managing big data
How a Hadoop distribution can
help you manage big data
E-guide
Who manages the Hadoop big data management environment?
It's important to recognize that getting the desired performance out of a Hadoop
system requires a coordinated team of skilled IT professionals who collaborate
on architecture planning, design, development, testing, deployment, and
ongoing operations and maintenance to ensure peak performance. Those IT
teams will typically include:
Requirements analysts to assess the system performance requirements
based on the types of applications that will be run in the Hadoop
environment.
System architects to evaluate performance requirements and design
hardware configurations.
System engineers to install, configure and tune the Hadoop software stack.
Application developers to design and implement applications.
Data management professionals to do data integration, create data layouts
and perform other management tasks.
System managers to do operational management and maintenance.
Page 7 of 16
In this e-guide
Exploring Hadoop distributions
for managing big data
How a Hadoop distribution can
help you manage big data
E-guide
Project managers to oversee the implementation of the various levels of the
stack and application development work.
A program manager to oversee the implementation of the Hadoop
environment and prioritization, development and deployment of
applications.
The Hadoop software platform market
In essence, the evolution of Hadoop as a viable large-scale data management
ecosystem has also created a new software market that's transforming the
business intelligence and analytics industry. This has expanded both the kinds
of analytics applications that user organizations can run and the types of data
that can be collected and analyzed as part of those applications. The market
includes three independent vendors that specialize in Hadoop -- Cloudera Inc.,
Hortonworks Inc. and MapR Technologies Inc. Other companies that offer
Hadoop distributions or capabilities include Pivotal Software Inc., IBM, Amazon
Web Services and Microsoft.
Evaluating vendors that provide Hadoop distributions requires understanding
the similarities and differences between two aspects of the product offerings.
First is the technology itself: What's included in the different distributions; what
platforms are they supported on; and, most important, what specific
components are championed by the individual vendors? Second is the service
Page 8 of 16
In this e-guide
Exploring Hadoop distributions
for managing big data
How a Hadoop distribution can
help you manage big data
E-guide
and support model: What types of support and SLAs are provided within each
subscription level, and how much do different subscriptions cost?
Understanding how these aspects relate to your specific business requirements
will highlight the characteristics that are important for a vendor relationship. The
next article in this series will examine several business use cases for a Hadoop
big data management platform so you can determine your organization's needs
and requirements.
Next article
Page 9 of 16
In this e-guide
Exploring Hadoop distributions
for managing big data
How a Hadoop distribution can
help you manage big data
E-guide
How a Hadoop distribution can help you manage big data
David Loshin, Knowledge Integrity Inc.
To help you determine if a commercial Hadoop distribution could
benefit your organization, consultant David Loshin examines big
data use cases and applications that Hadoop can support.
Many companies are struggling to manage the massive amounts of data they
collect. Whereas in the past they may have used a data warehouse platform,
such conventional architectures can fall short for dealing with data originating
from numerous internal and external sources and often varying in structure and
types of content. But new technologies have emerged to offer help -- most
prominently, Hadoop, a distributed processing framework designed to address
the volume and complexity of big data environments involving a mix of
structured, unstructured and semi-structured data.
Part of Hadoop's allure is that it consists of a variety of open source software
components and associated tools for capturing, processing, managing and
analyzing data. But, as addressed in a previous article in this series, in order to
help users take advantage of the framework, many vendors offer commercial
Hadoop distributions that provide performance and functionality enhancements
over the base Apache open source technology and bundle the software with
Page 10 of 16
In this e-guide
Exploring Hadoop distributions
for managing big data
How a Hadoop distribution can
help you manage big data
E-guide
maintenance and support services. As the next step, let's take a look at how a
Hadoop distribution could benefit your organization.
Making a case for a Hadoop distribution
Hadoop runs in clusters of commodity servers and typically is used to support
data analysis and not for online transaction processing applications. Several
increasingly common analytics use cases map nicely to its distributed data
processing and parallel computation model. The list includes:
Operational intelligence applications for capturing streaming data from
transaction processing systems and organizational assets, monitoring
performance levels, and applying predictive analytics for pre-emptive
maintenance or process changes.
Web analytics, which are intended to help companies understand the
demographics and online activities of website visitors, review Web server
logs to detect system performance problems, and identify ways to enhance
digital marketing efforts.
Security and risk management, such as running analytical models that
compare transactional data to a knowledge base of fraudulent activity
patterns, as well as continuous cybersecurity analysis for identifying
emerging patterns of suspicious behavior.
Page 11 of 16
In this e-guide
Exploring Hadoop distributions
for managing big data
How a Hadoop distribution can
help you manage big data
E-guide
Marketing optimization, including recommendation engines that absorb
huge amounts of Internet clickstream and online sales data and blend that
information with customer profiles to provide real-time suggestions for
product bundling and upselling.
Internet of Things applications, such as analyzing data from things -- like
manufacturing devices, pipelines and so-called smart buildings -- via
sensors that continuously generate and broadcast information about their
status and performance.
Sentiment analysis and brand protection, which might involve capturing
streaming social media data and analyzing the text to identify unsatisfied
customers whose issues can be addressed quickly.
Massive data ingestion for data collection, processing and integration
scenarios such as capturing satellite images and geospatial data.
Data staging, in which Hadoop is used as an initial landing spot for data
that is then integrated, cleansed and transformed into more structured
formats in preparation for loading into a data warehouse or analytical
database for analysis.
Page 12 of 16
In this e-guide
Exploring Hadoop distributions
for managing big data
How a Hadoop distribution can
help you manage big data
E-guide
Capabilities supporting the use cases
Applications supporting these usage scenarios can be built on top of Hadoop
using some prototypical implementation methodologies, such as:
Data lakes. Because Hadoop delivers linear scalability for processing and
storage as new data nodes are incorporated into a cluster architecture, it
provides a natural platform for capturing and managing voluminous files of raw
data. This has motivated many users to implement Hadoop systems as a catch-
all platform for their data, creating a conceptual data lake.
Data warehouse augmentation platform. Hadoop's distributed storage can
also be used to expand the data that's accessible for analysis in a data
warehouse environment. For example, a temperature-based scheme can be
used for allocating data to different levels of the storage hierarchy, depending
on its frequency of use. The most frequently accessed "hot" data is kept in the
data warehouse, while less-frequently used "cool" data is relegated to higher-
latency storage such as the Hadoop Distributed File System. This approach
relies on tightly coupled data warehouse integration with Hadoop.
Large-scale batch computation engine. When configured with a combination
of data and compute nodes, Hadoop becomes a massively parallel processing
platform that's suited to batch processing applications for manipulating and
analyzing data. One example would be data standardization and transformation
Page 13 of 16
In this e-guide
Exploring Hadoop distributions
for managing big data
How a Hadoop distribution can
help you manage big data
E-guide
jobs applied to data sets to prepare them for analysis. Algorithm-driven analytics
applications such as data mining, machine learning, pattern analysis and
predictive modeling are also good matches for Hadoop's batch capabilities, as
they can be executed in parallel over massive distributed data files with
iterations of partial results accumulated until the program completes with a final
set of results.
Event stream analytics processing engine. A Hadoop environment can also
be configured to process incoming data streams in real or near real time. As an
example, a customer sentiment analysis application can have multiple
communication agents running in parallel on a Hadoop cluster, each applying a
set of stream processing rules to data feeds from social networks such as
Twitter and Facebook.
Advantages of adopting Hadoop: Is it right for you?
A low-cost, high-performance computing framework like Hadoop can address
different IT and business motivations for scaling up processing power or
expanding data management capabilities in an organization. Let's examine
some characteristics of application requirements that suggest the need for a
data management platform based on a Hadoop distribution:
Ingestion and processing of large data sets, massive data volumes
and streaming data. Examples include capturing Web server logs that
Page 14 of 16
In this e-guide
Exploring Hadoop distributions
for managing big data
How a Hadoop distribution can
help you manage big data
E-guide
contain information about billions of online events; indexing hundreds of
millions of documents across different data sets; and continuously pulling in
data streams such as social media channels, stock market data, news
feeds and content published at expert communities.
A need to eliminate performance impediments. Application performance
is often throttled on traditional data warehouse systems as a result of data
accessibility, latency and availability issues or bandwidth limits in relation to
the amount of data that needs to be processed.
The desire for linear scalability on performance. As data volumes grow
and the number of users increases, having an environment in which
performance will scale linearly as more computing and storage resources
are added can be crucial, especially when applications can benefit from
parallel computing.
A mixture of structured and unstructured data. The applications need to
use data from different sources that vary in structure, and some -- or much
-- of it is unstructured or semi-structured, for example, text or server log
data.
IT cost efficiencies. Rather than paying premium prices for high-end
servers or specialty hardware appliances, the system architects believe that
acceptable performance can be achieved using commodity components
Page 15 of 16
In this e-guide
Exploring Hadoop distributions
for managing big data
How a Hadoop distribution can
help you manage big data
E-guide
Considerations for integrating Hadoop into the enterprise
A positive value proposition for using Hadoop still must be balanced, though,
with the feasibility of integrating the platform into the enterprise. Because many
organizations have made significant investments in traditional data warehouse
platforms, there may be some resistance to introducing a newer technology.
Before engaging a Hadoop distribution vendor, work to resolve any potential
barriers to adoption and assess requirements for cluster sizing and
configuration.
For example, determine where a Hadoop cluster fits in your organization's data
warehousing and analytics strategy -- whether it's intended to augment existing
data warehouses or replace them. Also, identify integration and interoperability
issues that need to be addressed, and review configuration alternatives,
including whether it's better to implement the Hadoop ecosystem on premises or
in a cloud-based or hosted environment. In addition, ensure that you have
funding to hire people with the right skills or retrain existing employees. Hadoop
application development differs greatly from conventional database
development.
Answering these types of questions will help in determining the feasibility of a
Hadoop deployment. The next step, which will be examined in the third article in
Page 16 of 16
In this e-guide
Exploring Hadoop distributions
for managing big data
How a Hadoop distribution can
help you manage big data
E-guide
this series, is to evaluate the features and functions you need in a commercial
Hadoop distribution.
About the author
David Loshin, managing director at Decisionworx, is a recognized thought
leader, speaker and expert consultant. He has written numerous books,
including Big Data Analytics: From Strategic Planning to Enterprise Integration
with Tools, Techniques, NoSQL and Graph. He can be reached through his
website, at www.decisionworx.com.
Email us at [email protected] and follow us on Twitter:
@BizAnalyticsTT.