Upload
truongnga
View
236
Download
0
Embed Size (px)
Citation preview
By Wayne Eckerson Principal Consultant
Eckerson Group April 2014
Hadoop in the Enterprise
With Maturity Come Manageability Challenges
Benchmark Report
Hadoop in the Enterprise 2
Table of Contents
Executive Summary ...................................................................................... 3
Methodology ............................................................................................... 4
Introduction ................................................................................................ 4
Survey Findings ............................................................................................ 8
Conclusion ................................................................................................. 27
Hadoop in the Enterprise 3
BENCHMARK REPORT
Executive Summary Hadoop is maturing and becoming a key component of enterprise data
management infrastructures at many companies. Until about two years ago,
Hadoop was relatively new on the scene. Many data professionals and analysts
were downloading Hadoop and setting up experimental clusters to run various
test cases. More recently, a significant number of enterprises from a variety of
industries are moving Hadoop from test beds into production to support a
range of data processing and analytical applications.
More than 1,500 business intelligence (BI) and data management professionals
responded to our 2014 survey on enterprise Hadoop. Some of the key findings:
Hadoop deployments double. The percentage of organizations that
have implemented or are deploying Hadoop has more than doubled in
the past two years, from 9% to 25%.
Hadoop will replace some or all existing data environments in a
minority of organizations. Although Hadoop currently supplements
existing data warehousing and data processing environments, within
three years it will replace various components of those systems in 27%
of organizations and entire data management environments in 14%.
Companies will primarily use Hadoop for data processing and
analytical sandboxes. Almost half of companies that implement
Hadoop will use it to parse and transform big data (46%) as well as
support analytical sandboxes for ad hoc analysis (44%) and data mining
(40%). Less than a third (30%) of companies will use it as an enterprise
repository to hold all their data.
Almost half of organizations with Hadoop will use real-time SQL in 18
months. Use of real-time SQL queries will grow from 35% today to 48%
in 18 months, becoming the most popular feature of Hadoop behind
MapReduce (53%). Existing Hadoop users also plan to significantly
increase their usage of data streaming, in-memory computing, machine
learning, graphing and streaming in the next 18 months.
Data mining surpasses dashboards. Data mining will surpass
dashboards and scorecards as the predominant analytical applications
running on Hadoop, growing from 34% of organizations today to 57% in
three years. But the number of Hadoop-based dashboards and
scorecards will also increase, growing from 43% to 52% in three years.
Hadoop functionality meets customer needs. More than half of
organizations that have deployed Hadoop rated it as “excellent” or
“good” across more than two dozen attributes, from scalability,
performance and availability to reporting, analysis and data
governance.
A significant
number of
enterprises are
moving Hadoop
from test beds into
production to
support a range of
data processing
and analytical
applications
Hadoop in the Enterprise 4
BENCHMARK REPORT
Staffing leads Hadoop expenditures. Staffing costs are the largest
segment of Hadoop investments, comprising 28% of Hadoop-based
project budgets. About 20% of organizations that have deployed
Hadoop will spend more than $5 million, while 20% will spend less than
$100,000 or nothing.
BI teams lead the way. BI teams drive Hadoop projects in almost half
of organizations (46%) that have deployed Hadoop.
Traditional database and software vendors dominate the Hadoop
market. With some exceptions, the traditional IT market leaders are
the primary providers of Hadoop-related products and services. The
only exception is Cloudera, which is the top data management
provider. Oracle is the leading data integration vendor, while IBM is the
leading supplier of analytics software and consulting services.
Methodology This report is based primarily on a survey, conducted in June and July of 2014,
of 1,515 business and IT professionals. The survey results are based on answers
by 906 respondents who completed the survey and 609 respondents who
partially completed it. The report is also based on interviews with users and
briefings with vendors as well as a review of industry literature on big data.
Among survey respondents, slightly more than half (53%) are IT professionals,
and the rest are consultants (23%) or business professionals (12%). Almost a
third (32%) work in large companies with more than $1 billion in revenue, while
about a quarter (24%) work in medium-size companies with between $100
million and $1 billion in revenues. The rest (44%) work in small companies with
under $100 million in revenues. Almost all a third (32%) of respondents are
from North America, followed closely by 30% from India, 23% from Europe, 8%
from Southeast Asia and the rest from other countries.
Introduction Is Hadoop ready for prime time? Can organizations run Hadoop in production
with confidence? Can Hadoop meet relevant service-level agreements for
scalability, performance and availability while ensuring adequate security and
manageability? Can Hadoop today be considered enterprise software?
Many companies today are asking these questions. They want to exploit the
capabilities of Hadoop, a series of inter-related open source projects managed
by The Apache Software Foundation that provide a novel and cost-effective way
to process and analyze large volumes of data. But many are not yet certain of
the answers. Most recognize that Hadoop holds great promise to help them
deliver information and insights to business users.
Hadoop in the Enterprise 5
BENCHMARK REPORT
Hadoop benefits. A big advantage of Hadoop is that it enables organizations to
cost-effectively store all their data: both structured data from enterprise
resource planning and customer relationship management systems as well as
multi-structured data, such as Web server logs, sensor data, email and
extensible markup language, or XML, data. Because multi-structured data
consists of 80% of all data by most estimates, Hadoop heralds a new age of
data processing—one in which organizations can find needles of insight in a
haystack containing terabytes or petabytes of information.
Challenges. Although Hadoop is infinitely scalable from a technical standpoint,
it has lacked the management utilities required by most organizations and their
IT managers. These include security, manageability, backup and recovery,
disaster recovery and data governance. Moreover, as a batch processing
environment, Hadoop until recently has lacked the ability to support real-time
queries (i.e., SQL queries)—and subsequently true ad hoc reporting and visual
exploration.
Evolution
Undaunted by the challenges, early adopters have embraced Hadoop for both
tactical and strategic reasons. Tactically, they use Hadoop to reduce the costs
of running a data warehouse by offloading transformations to Hadoop to avoid
an expensive data warehouse upgrade. In some cases, they also offload data
that no longer fits in the data warehouse.
Strategically, early adopters use Hadoop to build new applications with new
types of data that previously were too costly to manage. Many use Hadoop to
process and analyze large volumes of clickstream data and social media feeds
to gain a better understanding of customer needs, behavior and sentiment.
Going mainstream. Encouraged by the success of early Hadoop adopters,
mainstream companies have been testing the Hadoop waters, intrigued by its
promise but wary of its newness. “The greatest impact from Hadoop is the
realization that bigger questions can be answered with affordable and
manageable technologies,” said one survey respondent. Another wrote,
“Hadoop is a manifestation of possibilities. The world is about to change.”
But others are leery of another new technology that carries hidden risks:
“*Hadoop+ is still immature. Although evolving rapidly, it still lacks advanced
features.” Another says, “*Hadoop+ offers too much change now for *anyone+
other than the big boys to invest.” Another says, “IT hype often sells
vaporware. I’ve seen this before.”
Undaunted by the
challenges, early
adopters have
embraced Hadoop
for both tactical
and strategic
reasons
Hadoop in the Enterprise 6
BENCHMARK REPORT
Nonetheless, mainstream companies are starting to put Hadoop into
production to support key business initiatives, from decision support to data
mining. While most Hadoop projects today either support or serve as adjuncts
to existing data warehouses, many data management professionals expect this
to change. In the future, Hadoop may supplant large portions of the analytical
ecosystems currently running in many companies if not replace them
altogether.
In fact, some data management professionals see Hadoop as the foundation of
a new analytical ecosystem in which the data is processed and analyzed in place
rather than moved downstream to specialized data processing and analytical
systems. In the world of Hadoop, the data never moves, it stays in place. This
eliminates time-consuming, costly and error-prone data replication or
movement jobs. Hadoop “brings compute to the data,” becoming the staging
area, archive, transformation engine, query engine, data mining engine and so
on.
Before this vision of Hadoop becomes reality, it needs to pass the enterprise
sniff test. That is, it needs to run, act, interface with and behave like other
enterprise software in an organization’s data center. It needs to be reliable,
available, scalable, secure, manageable and highly performing, among other
things. Today, millions of dollars are being invested in Hadoop software and
services to ensure that Hadoop passes.
Six Vehicles for Deploying Hadoop
There are a number of approaches vendors are taking to turn Hadoop into an
enterprise-caliber platform for data analytics.
1. Do it yourself. Anyone can download Hadoop software from Apache’s
website, install it on a server and start using it. This is the essence of the do-it-
yourself, or DIY, approach that all vendor-driven Hadoop systems compete
against. Of course, going solo requires considerable expertise. And while a fairly
experienced person can download and install Hadoop on a small cluster, it
takes a great deal more expertise to install and manage Hadoop on a big cluster
in a production environment.
“Managing data and clusters at scale presents different challenges to those
associated with running test data through a couple of machines,” observes Paul
Miller in a report from GigaOm Research. “Again and again, organizational
deployments of Hadoop fail as they simplistically try to replicate processes and
procedures tested on one or two machines across more-complex clusters.” 1
1. Understanding the Power of Hadoop as a Service, Paul Miller, GigaOm Research, June 12, 2014.
Hadoop in the Enterprise 7
BENCHMARK REPORT
Complicating the challenge is a dearth of automated tools to support Hadoop
administration, Miller explains. This leaves “system administrators with a
second problem: dependence on manual procedures.” Thus, “scaling Hadoop,
either as part of a meaningful pilot project or to deliver production enterprise
workloads, is a challenging undertaking, and managing the complex
interactions among parallel nodes remains a complex and often largely manual
process. Each node must be actively monitored throughout its commitment to
a particular workload, and Hadoop’s often arcane errors and failure modes can
rarely be resolved automatically.”
2. Open source Hadoop distributions. To simplify the process of standing up an
Apache Hadoop cluster, several vendors, including Cloudera, Hortonworks,
MapR, Pivotal and others, package up Apache software and deliver a single,
integrated release that is tested to make sure all the components work
together. They also provide bug patches and updates, service, support and
training. The software is free of charge; vendors make money by charging for
packaging, service and training.
Lately, these Hadoop distribution vendors have been forging strategic alliances
with established IT vendors to deliver Hadoop capabilities through traditional
enterprise IT channels. For example, Hewlett-Packard recently announced it
was investing $50 million for a stake in Hortonworks, and Hortonworks has
partnered with EMC’s Pivotal to support its enterprise data management
software. Cloudera, in the meantime, has a strategic partnership with Oracle.
Today, the Oracle Big Data SQL is built on the Cloudera distribution.
3. Customized Hadoop distributions. Some vendors have extended the Apache
Hadoop framework with proprietary software that is either embedded
transparently in the distribution (MapR) or offered as a commercial add-on
(Cloudera). For instance, MapR adds software to Hadoop that ensures
enterprise-caliber reliability, availability and security along with backup and
recovery and disaster recovery services. MapR says its platform is “100% binary
compatible” with Hadoop Distributed File System, ensuring “plug-and-play
compatibility” from a single data platform. Cloudera, on the other hand,
charges for certain “premium” features that go beyond what Apache Hadoop
offers, including real-time query, in-memory computing, machine learning,
search and stream processing.
4. Relational database vendors. Oracle, Teradata, Pivotal, Actian and IBM
incorporate Hadoop distributions inside analytical platforms that usually consist
of a massively parallel processing database and an analytical or object database
as well as other tools such as extract, transform and load (ETL) and data
connectors. The vendors deliver these platforms as software-only systems,
Hadoop in the Enterprise 8
BENCHMARK REPORT
appliances or in the cloud. These traditional relational database vendors help
companies bridge the new and old worlds, minimizing the risk of implementing
new technology that is relatively immature. Plus, by bundling all the software
(and in some cases hardware), they offer one-stop shopping and a single throat
to choke. This platform approach reduces customer risks and enables
established vendors to keep charging premium prices despite using open
source software.
5. Platform-as-a-service vendors. Cloud providers such as Amazon and
Microsoft provide the infrastructure and software in the cloud so customers
can build their own Hadoop environments without having to buy, install or
manage hardware and software. And with subscription pricing, customers only
pay for what they need and can dynamically scale up and down their
environment as requirements dictate. For example, Amazon Web Services
customers can use the Amazon Elastic MapReduce service to build Hadoop
clusters in the cloud. Likewise, Microsoft customers can implement Hadoop on
Microsoft’s Azure cloud platform using Microsoft HDInsights.
6. Hadoop as a service. Other vendors, such as Altiscale, Mortar and Qubole, go
one step further and offer services built on top of a cloud-based Hadoop
infrastructure that the vendors run and manage themselves. Customers get all
the benefits of Hadoop without having to install, manage and run Hadoop—or
for that matter, know anything about it. These services extract your data and
make it available for querying via Hive, Pig or other Hadoop-based toolsets.
They offer service-level agreements for data loading and availability and many
of them support auto-scaling that dynamically increases Hadoop processing
power to handle peak query traffic.
Survey Findings Usage and Adoption
Hadoop is no longer just for technology enthusiasts and bleeding-edge Internet
startups. Our research shows that it’s slowly making its way into corporate data
centers and becoming an integral part of enterprise data strategies at many
Fortune 1,000 companies.
Until recently, most large organizations implemented Hadoop on an
experimental basis. They wanted to test its functionality and enterprise
software credentials, especially its security, reliability, manageability and ability
to interface with existing systems, applications and analytical tools. The allure
of free software to manage a big data analytics platform was too good to resist.
Hadoop is no longer
a technology for
tech enthusiasts
and bleeding-edge
Internet startups.
Our research shows
that it’s slowly
becoming an
integral part of
enterprise data
strategies
Hadoop in the Enterprise 9
BENCHMARK REPORT
Today, an increasing number of companies are turning their Hadoop test beds
into production. They believe Hadoop is ready for prime time, at least for some
data processing and analytical workloads, such as offloading ETL processing
from data warehouses and parsing voluminous Web server logs. In the past
nine months, many Hadoop software providers—including MapR, Hortonworks
and Cloudera—have noticed a sizable uptick in the number of large companies
that are putting Hadoop into production.
Implementation status. In fact, our latest survey shows that a quarter of
organizations (25%) have either already deployed Hadoop or have projects
under development. That’s a far cry from just two years ago, when only 9% of
organizations had either fully or partially deployed Hadoop. (The 2012
TechTarget report Exploiting Big Data Strategies for Integrating With Hadoop to
Deliver Business Insights, was based on a survey with 1,158 respondents.)
Despite Hadoop’s rapid uptake, there are still many companies that have yet to
make the leap with the technology. Our 2014 survey shows 39% have no plans
for Hadoop, and 36% are considering Hadoop but have yet to install or develop
anything with it (see Figure 1). Clearly, Hadoop has accelerated within the
early-adopter market but still has some distance to go before it becomes a
mainstream technology.
Figure 1. Status of Hadoop in Organizations—2014
Based on answers by 1,495 survey respondents, 2014
Surprisingly, Hadoop is more common in large enterprises than smaller ones.
Even though it’s free, Hadoop can be expensive to deploy at scale, largely
because of the skilled experts currently required to implement and use it. At
Hadoop in the Enterprise 10
BENCHMARK REPORT
the same time, large companies can avoid millions of dollars in costly upgrades
to existing data platforms by implementing Hadoop.
Our research shows that more than one-third (34%) of large companies (more
than $1 billion in annual revenues) have implemented Hadoop compared with
27% of midsize companies ($100 million to $1 billion in revenues) and 18% of
smaller organizations (less than $100 million in revenues). (See Figure 2.)
Figure 2. Hadoop Adoption by Company Size
Based on answers by 893 survey respondents who have fully or partially
deployed Hadoop, 2014
The early adopters of Hadoop largely came from the high-tech, Internet, e-
commerce and media industries. As Hadoop moves deeper into the corporate
landscape, it is finding a home in a range of industries, especially those with
large volumes of Web, customer or transaction data. The “nontraditional”
industries that have fully deployed Hadoop are financial services and retail,
indicated by 8% of respondents, and education, government, computer
resellers and telecommunications with 6% of respondents.
And there are many fast followers in these industries that want to keep up with
the leading technology adopters. Companies that are developing or have
partially deployed Hadoop come from business services and consulting (42%),
computer manufacturing (30%), financial services (21%), education (20%),
telecommunications (20%), computer resellers (16%) and other (23%)
industries (see Figure 3).
Hadoop in the Enterprise 11
BENCHMARK REPORT
Figure 3. Hadoop Adoption by Industry
Based on answers by 893 survey respondents who have fully or partially
deployed Hadoop, 2014
Architecture and Use Cases
Business impact. Industry observers have been predicting that as Hadoop
grows, it will begin to supplant or replace data warehouses. While this is not
happening in full force now, managers and professionals actively working with
Hadoop agree that it will soon be encroaching on traditional data
environments.
Today, 41% of companies say that Hadoop supplements their existing data
environments, while 16% say it’s replacing some parts of their environments. A
handful, 6%, say it has replaced their entire existing data processing
environments. But, when we asked the same respondents to project out three
years, one out of seven (14%) said they expect Hadoop to replace most of their
current data environments, and another 27% expect it to replace parts of their
infrastructures. That’s a tectonic change, especially when you consider the
large amount of capital companies have invested in their existing data
processing environments (see Figure 4).
Hadoop in the Enterprise 12
BENCHMARK REPORT
Figure 4. Hadoop’s Impact: Today and in Three Years
Based on answers by 368 survey respondents who have fully or partially
deployed Hadoop, 2014
Architecture. Among organizations that currently use Hadoop, almost half
(46%) use it for data processing functions, such as parsing or transforming data,
while 44% use Hadoop as an “analysis sandbox” to support ad hoc queries. In
addition, 40% use it as a “data mining sandbox” to support machine learning
and complex analytical functions (see Figure 5).
Figure 5. Role of Hadoop in Analytical Architecture Today
Based on answers by 195 respondents who have fully or partially deployed
Hadoop, 2014
41%
16%
6%
37%
29%
27%
14%
30%
Supplements what we have
Replaces some of what we have
Replaces most of what we have
No impact
Today In Three Years
Hadoop in the Enterprise 13
BENCHMARK REPORT
The data lake. Today, 32% of companies use Hadoop as an “enterprise
operating system” and 30% as an enterprise repository to holds all corporate
data (see Figure 5). These embody the notion of the so-called data lake, which
is quickly gaining visibility as a prime use case for Hadoop. Many IT
professionals love the idea of loading all corporate data into a single, low-cost
enterprise data repository and doing all the processing there instead of moving
data into specialized downstream systems.
A data lake eliminates the need to replicate and move data, which is error-
prone and expensive. A data lake not only stores all enterprise data, it
processes it in place using a variety of engines, such as SQL, in-memory
computing, machine learning, graphing and streaming—any engine that
conforms to the Hadoop 2 resource management interface. The percentage of
companies using Hadoop as a data lake should rise from 30% today to closer to
50% in the next several years.
Analytics functions. Until Hadoop 2, which was released in the fall of 2013,
Hadoop was a batch data processing platform. Many early users likened it to
the early days of mainframe computing, when you would submit your request
in writing to a mainframe guru and then come back the next day to get the
answer.
Today, a majority of companies (57%) still use Hadoop as a batch-oriented data
processing pump, although this is changing with the advent of real-time query
engines for Hadoop, which first became available in late 2013. Today, slightly
more than a third of companies (35%) use real-time queries, while almost half
(48%) plan to use it within 18 months. SQL on Hadoop is the most rapidly
growing feature in Hadoop, driven by the large number of data analysts and
data scientists who prefer to use SQL to query data rather than a programming
language, such as Java, Perl or Python. And vendors are happy to oblige.
Cloudera, Hortonworks, Pivotal, Actian, IBM and Microsoft have all released
SQL on Hadoop products in the past several months. Similarly, data streaming,
in-memory computing, search and graphing will also experience a big spike in
usage in the next 18 months (see Figure 6).
Hadoop in the Enterprise 14
BENCHMARK REPORT
Figure 6. Analytics Functions Run in Hadoop Today and in 18 Months
Based on answers by 309 respondents who have fully or partially deployed
Hadoop, 2014
Analytical applications. Dashboards and scorecards are the primary analytical
applications run in enterprise Hadoop environments today—cited by 43% of
respondents, with 52% anticipating such uses in the near future. But data
mining and machine learning will become the predominant application in the
future, growing from 34% of companies today to 57% of companies in three
years. The use of visual discovery and exploration on Hadoop will also rise
significantly, growing from 21% today to 41% in three years. In addition, the
percentage of companies implementing streaming analytics will rise from 16%
today to 39% in the next three years. In short, organizations expect to run a lot
more analytical workloads on Hadoop in the near future (see Figure 7).
Hadoop in the Enterprise 15
BENCHMARK REPORT
Figure 7. Analytical Applications on Hadoop Today and in Three Years
Based on answers by 309 respondents who have fully or partially deployed
Hadoop, 2014
Hadoop Functionality
Hadoop is a relatively young technology. Version 2.0 just shipped last fall. The
general consensus is that Hadoop holds a lot of promise but lacks the critical
functionality necessary to run in production in large enterprise computing
environments. In other words, Hadoop has to mature before companies feel
comfortable deploying it in a production environment.
But our survey shows that a majority of organizations that have either deployed
Hadoop or are in the process of doing so give the technology high marks. A
majority of these respondents give Hadoop a rating of either “excellent” or
“good” in the following areas: scalability (83%), availability (74%), backup and
recovery (54%), manageability (53%) and mixed workload management (52%).
Only security (49%) fell below the 50% threshold (see Figure 8).
43%
40%
37%
34%
30%
24%
22%
21%
17%
16%
12%
52%
45%
44%
57%
38%
37%
38%
41%
33%
39%
5%
Dashboards and/or scorecards
Batch reporting
Ad hoc reporting and dashboarding
Data mining or machine learning
Ad hoc queries
Reporting bursting (sends tailored…
Location analytics (e.g. mapping)
Visual discovery and exploration
Ad hoc mashups and analyses
Streaming analytics
None
Today In 3 Years
Hadoop in the Enterprise 16
BENCHMARK REPORT
Figure 8. Current Users Rate Hadoop Capabilities
Based on answers by 288 respondents who have fully or partially deployed
Hadoop, 2014
These strong marks attest to the rapid evolution of Hadoop as an enterprise
data management platform as well as the work of long-established commercial
vendors and service providers that supplement Hadoop with product and
service capabilities to shore up its enterprise features.
Data management. Current users give Hadoop even higher marks in its ability
to manage various types of data workloads. These customers rate Hadoop as
“excellent” or “good” in data loading (79%) query performance (69%), data
streaming (65%), data exporting (65%) and data transformation (64%). Clearly,
data management is Hadoop’s sweet spot, especially for large volumes of
multi-structured data.
Hadoop in the Enterprise 17
BENCHMARK REPORT
Figure 9. Hadoop’s Ability to Handle Data Management Workloads, Rated
Based on answers by 288 respondents who have fully or partially deployed
Hadoop, 2014
Data governance. Though Hadoop has yet to fully address data governance,
our respondents gave it high ratings. They gave “excellent” or “good” ratings to
Hadoop for its ability to support data consistency (63%), data profiling (60%),
data quality (58%), metadata management (57%), conformed dimensions
(52%), and data lineage and impact analysis (52%). (See Figure 10.)
Figure 10. Hadoop Support for Data Governance Tasks, Rated
Based on answers by 267 respondents who have fully or partially deployed
Hadoop, 2014
Hadoop in the Enterprise 18
BENCHMARK REPORT
One explanation for these higher-than-expected results is that data governance
in Hadoop today is largely manual—conducted by experienced data scientists,
whose main job is to explore or “profile” big data in Hadoop to investigate its
value and build logical views on top of that data (i.e., Hive tables) for them and
others to query. HCatalog is an emerging metadata repository that catalogs
Hive, Pig and HBase data elements and can be accessed via a standard,
representational state transfer, or REST, interface.
Another factor ameliorating data governance issues in Hadoop is that most big
data consists of large volumes of data from a single source, such as Web server
logs or point-of-sale data. Much of this data is machine generated, consisting of
uniformly repeating data structures. Once the structure is understood, data can
be quickly extracted for query processing.
Analytics. Hadoop also scores well among existing users for analytics. Given its
batch data processing heritage, it’s not surprising that Hadoop scores highest
on this capability, with 71% of existing users giving Hadoop a rating of
“excellent” or “good.” Following close behind are data mining (60%), ad hoc
queries (60%) and report bursting (58%). Other analytic functions are also in
the mix: dashboards and scorecards (55%), ad hoc reporting (56%), streaming
analytics (55%), ad hoc mashups (54%), location analytics (52%) and visual
exploration (51%). (See Figure 11)
Figure 11. Hadoop’s Capability to Support Analytic Functions, Rated
Based on answers by 266 respondents who have fully or partially deployed
Hadoop, 2014
Hadoop in the Enterprise 19
BENCHMARK REPORT
Many of the features listed above, such as report bursting, dashboards, ad hoc
reporting and location analytics, are not native to Hadoop. Rather they are
intrinsic to third-party reporting and analysis tools that query Hadoop via Hive
or some other means. The high scores reflect that reporting, and analysis
vendors are doing a reasonable job of making Hadoop data accessible to their
customers. Of course, data scientists today are the primary analytical users of
Hadoop and they primarily rely on fourth-generation languages, such as Java,
Perl or Python to access Hadoop data. But this is changing as the industry is
intent on making Hadoop data accessible to business analysts and users who
know how to use SQL or reporting and analysis tools.
One oddity in the results is that visual exploration tools trail the pack of
analytical techniques. Given the popularity of Tableau and other visual
exploration tools and the close association of visualization in general with big
data, it’s hard to explain this result, other than the fact that most ad hoc
queries and reports conducted with Hadoop data are done by data scientists
using Java, Perl or Python. But I expect the use of visual exploration to query
Hadoop data will grow rapidly in the next year or two.
Investment and spending plans. As with many new information technologies,
Hadoop is very resource-intensive. Staff costs are the largest component of
Hadoop investments, comprising 28% of expenditures in Hadoop-based
projects. The remaining expenditures are divided among hardware (21%),
software (18%), services (18%) and other (16%). (See Figure 12.)
Figure 12. Average Hadoop Expenditures
Based on answers by 209 respondents who have fully or partially deployed
Hadoop, 2014
Hadoop in the Enterprise 20
BENCHMARK REPORT
Current Hadoop users have sizable budgets for the immediate future. More
than one-third of respondents (36%) report they intend to spend more than $1
million on Hadoop products, services and other related requirements over the
next 18 months. Another 13% plan to spend between $500,000 and $1 million,
while 21% will spend between $100,000 and $500,000, and 29% will spend less
than $100,000 or nothing. (See Figure 13).
Figure 13. Hadoop Spending Plans Over the Next 18 Months
Based on answers by 237 respondents who have fully or partially deployed
Hadoop, 2014
The wide range of expenditures shows the value of open source software: if
companies desire, they can download the software and run it on existing
computers without spending time. But organizations that need enterprise-
caliber computing will usually spend significant sums with product and services
vendors to minimize the risks of an outage or simply to share the blame if
something goes wrong.
Expenditures by company size. Ironically, there is not a strong correlation
between spending and company size. For example, 19% of the smallest
organizations in the survey (with less than $100 million in annual revenues)
plan to spend between $1 million to $5 million on their Hadoop deployments,
just shy of 21% of the largest enterprises that plan to spend this sum of money.
Conversely, 21% of large companies plan to spend less than $100,000 or
nothing on Hadoop.
Hadoop in the Enterprise 21
BENCHMARK REPORT
Leadership
Managers and professionals involved in BI and analytics are leading the charge
to implement Hadoop. Close to half of respondents (46%) that have deployed
Hadoop say the BI team is driving the development of Hadoop, while more than
a third (38%) say the data warehouse (DW) team is taking the initiative.
Following close behind are the IT department (35%) and application
development (33%). (See Figure 14.)
Figure 14. Who’s Driving Hadoop Development
Based on answers by 368 respondents who have fully or partially deployed
Hadoop, 2014
The leading role of BI and DW managers makes sense, since many organizations
plan to supplement or replace some or all of their current DW environments
with Hadoop, as discussed above. Obviously, BI and DW managers either want
to lead their organizations into the future or preserve their job security as
Hadoop rolls over the traditional data warehousing world. Another explanation
of the preponderance of BI and DW managers running Hadoop deployments is
that our survey was directed to BI and DW managers more than other
categories of IT management.
When we dissect these results by size of organization, we see that the IT
department plays a slightly more leading role in both small and large
organizations, while BI and DW managers, along with data scientists, take the
lead in medium-size organizations. Among large organizations, 44% of
respondents say their Hadoop deployments are driven by the IT department,
while 41% are driven by the BI team, 38% by the data warehousing team, 32%
the application development team and 29% by a data scientist (see Figure 15).
Hadoop in the Enterprise 22
BENCHMARK REPORT
Figure 15. Who’s Driving Development of Hadoop—by Organization Size (in
annual revenues)
Based on answers by 368 respondents who have fully or partially deployed
Hadoop, 2014
Vendors
As in any new large market, many vendors are jockeying for position, both
startups and established commercial players. Today, Cloudera is the front-
runner among data management players supplying Hadoop software. A third of
respondents that have deployed Hadoop (33%) are using Cloudera. Trailing
Cloudera are traditional database management vendors IBM (25%), Oracle
(21%) and Microsoft (20%). Amazon, which provides Hadoop services in the
cloud, registers 20%, a strong showing for a nontraditional data management
vendor. Next on the list are Hadoop “pure plays”—MapR (17%) and
Hortonworks (16%), followed by Teradata and HP (14% each), with Pivotal,
Intel, and Actian trailing further at 7%, 6%, and 3% respectively (see Figure 16).
Hadoop in the Enterprise 23
BENCHMARK REPORT
Figure 16. Data Management Vendors Supplying Hadoop Environments
Based on answers by 288 respondents who have fully or partially deployed
Hadoop, 2014
Data integration. One of the key strengths of Hadoop is its ability to parse and
transform any kind of data. Until recently, data scientists wrote transformation
and parsing programs in MapReduce or Java, which is not the most efficient
method. Now, traditional data integration vendors have integrated their
products with Hadoop, enabling organizations to use visual tools to create
transformation programs. More important, these tools enable a host of data
integration developers to develop programs for Hadoop data without having to
learn how to use a new tool.
Among vendors providing data integration services for Hadoop, Oracle leads
the pack with 25% of respondents, followed closely by IBM (24%), Microsoft
(22%) and Informatica (21%). A second tier of data integration vendors consists
of Talend (13%) and Pentaho (12%). Pivotal, SnapLogic and Syncsort also are
contenders in the data integration space (see Figure 17).
Hadoop in the Enterprise 24
BENCHMARK REPORT
Figure 17. Data Integration Vendors Supplying Products for Hadoop
Based on answers by 269 respondents who have fully or partially deployed
Hadoop, 2014
Analytics vendors. The leading database and software providers are also the
leading analytics vendors. IBM is in the lead, with 27% of companies that have
deployed Hadoop using IBM software or services to analyze big data. Microsoft
and Oracle follow with 26% and 25%. Visual discovery tool Tableau is used by
22%, while SAP (20%), SAS (18%), Teradata (13%), MicroStrategy (9%), Splunk
(8%), Pivotal (7%) and smaller vendors round out the list (see Figure 18).
Hadoop in the Enterprise 25
BENCHMARK REPORT
Figure 18. Analytics Vendors Supplying Products for Hadoop
Based on answers by 269 respondents who have fully or partially deployed
Hadoop, 2014
Hardware. The traditional hardware vendors are the go-to choices for
companies that have implemented Hadoop on-premises. IBM leads the way
with 33% of respondents using IBM hardware to deploy Hadoop. IBM is
followed closely by Dell (31%), HP (25%), Cisco (21%) and Oracle (19%). (See
Figure 19.)
Hadoop in the Enterprise 26
BENCHMARK REPORT
Figure 19. Hadoop Hardware Vendors
Based on answers by 242 respondents who have fully or partially deployed
Hadoop, 2014
Consulting companies. More than half of organizations that have deployed
Hadoop (58%) said they currently retain outside consulting assistance, while
42% do not. Among consultancies providing Hadoop services, IBM Global
Business Services is used by 14% respondents, followed closely by Accenture
(13%). Further down, Cognizant is used by 8% of respondents, while Capgemini,
Tata, Wipro are used by 7% of respondent companies, followed by PWC, KPMG,
Dell, Deloitte, Ernst & Young, Booz Allen Hamilton and a host of other boutique
consultancies (see Figure 20).
Hadoop in the Enterprise 27
BENCHMARK REPORT
Figure 20. Hadoop Consultants
Based on answers by 242 respondents who have fully or partially deployed
Hadoop, 2014
Conclusion Hadoop is establishing a foothold in enterprises. It is maturing and gradually
becoming a key piece of the enterprise data infrastructure at many
organizations. Until recently, mainstream companies experimented with
Hadoop to learn its functionality and capabilities and better understand the
role it could play in their existing analytical ecosystems.
Today, many of these organizations are moving Hadoop in production to
support data processing and analytical requirements or to offload workloads
from existing data warehouses to save money. IT and data management
professionals who have implemented Hadoop give it high ratings. The software
is evolving fast, thanks to the hard work of many vendors in the space—as well
as practitioners who contribute code to the Apache Foundation.
Although it remains to be seen whether Hadoop can fulfill its promise as the
enterprise operating system for data analytics, it’s moving in the right direction.
Hadoop in the Enterprise 28
BENCHMARK REPORT
WAYNE ECKERSON is principal consultant of Eckerson
Group LLC (www.eckerson.com), a business-technology
consulting firm that helps business leaders use data and
technology to drive better insights and actions. His team
of consultants provides information and advice on
business intelligence, analytics, performance
management, data governance, data warehousing and
big data. They work closely with organizations that want to assess their current
capabilities and develop a strategy that optimizes their investments in business
intelligence and analytics.
Eckerson has conducted many groundbreaking research studies, chaired
numerous conferences and written two widely read books: The Secrets of
Analytical Leaders: Insights from Information Insiders (2012) and Performance
Dashboards: Measuring, Monitoring, and Managing Your Business (2005/2010).
He is currently working on a book about data governance.
Write him at [email protected].
Hadoop in the Enterprise: With Maturity Come Manageability Challenges
is a SearchBusinessAnalytics e-publication.
Wayne Eckerson
Principal Consultant, Eckerson Group
Doug Olender
Publisher
TechTarget
275 Grove Street, Newton, MA 02466
www.techtarget.com
© 2014 TechTarget Inc. No part of this publication may be transmitted or reproduced in any form
or by any means without written permission from the publisher. TechTarget reprints are available
through The YGS Group.
ABOUT TECHTARGET:
TechTarget publishes
media for information
technology professionals.
More than 100 focused
websites enable quick
access to a deep store of
news, advice and analysis
about the technologies,
products and processes
crucial to your job. Our
live and virtual events give
you direct access to
independent expert
commentary and advice.
At IT Knowledge
Exchange, our social
community, you can get
advice and share solutions
with peers and experts.