Hadoop in the Enterprise - Talendinfo.talend.com/rs/talend/images/WP_EN_BD_BenchmarkReport_Hado… · Apache Hadoop cluster, several vendors, including loudera, Hortonworks, ... Hadoop

By Wayne Eckerson Principal Consultant

Eckerson Group April 2014

Hadoop in the Enterprise

With Maturity Come Manageability Challenges

Benchmark Report

Hadoop in the Enterprise 2

Table of Contents

Executive Summary ...................................................................................... 3

Methodology ............................................................................................... 4

Introduction ................................................................................................ 4

Survey Findings ............................................................................................ 8

Conclusion ................................................................................................. 27


BENCHMARK REPORT

Executive Summary Hadoop is maturing and becoming a key component of enterprise data

management infrastructures at many companies. Until about two years ago,

Hadoop was relatively new on the scene. Many data professionals and analysts

were downloading Hadoop and setting up experimental clusters to run various

test cases. More recently, a significant number of enterprises from a variety of

industries are moving Hadoop from test beds into production to support a

range of data processing and analytical applications.

More than 1,500 business intelligence (BI) and data management professionals

responded to our 2014 survey on enterprise Hadoop. Some of the key findings:

Hadoop deployments double. The percentage of organizations that

have implemented or are deploying Hadoop has more than doubled in

the past two years, from 9% to 25%.

Hadoop will replace some or all existing data environments in a

minority of organizations. Although Hadoop currently supplements

existing data warehousing and data processing environments, within

three years it will replace various components of those systems in 27%

of organizations and entire data management environments in 14%.

Companies will primarily use Hadoop for data processing and

analytical sandboxes. Almost half of companies that implement

Hadoop will use it to parse and transform big data (46%) as well as

support analytical sandboxes for ad hoc analysis (44%) and data mining

(40%). Less than a third (30%) of companies will use it as an enterprise

repository to hold all their data.

Almost half of organizations with Hadoop will use real-time SQL in 18

months. Use of real-time SQL queries will grow from 35% today to 48%

in 18 months, becoming the most popular feature of Hadoop behind

MapReduce (53%). Existing Hadoop users also plan to significantly

increase their usage of data streaming, in-memory computing, machine

learning, graphing and streaming in the next 18 months.

Data mining surpasses dashboards. Data mining will surpass

dashboards and scorecards as the predominant analytical applications

running on Hadoop, growing from 34% of organizations today to 57% in

three years. But the number of Hadoop-based dashboards and

scorecards will also increase, growing from 43% to 52% in three years.

Hadoop functionality meets customer needs. More than half of

organizations that have deployed Hadoop rated it as “excellent” or

“good” across more than two dozen attributes, from scalability,

performance and availability to reporting, analysis and data

governance.

A significant

number of

enterprises are

moving Hadoop

from test beds into

production to

support a range of

data processing

and analytical

applications


BENCHMARK REPORT

Staffing leads Hadoop expenditures. Staffing costs are the largest

segment of Hadoop investments, comprising 28% of Hadoop-based

project budgets. About 20% of organizations that have deployed

Hadoop will spend more than $5 million, while 20% will spend less than

$100,000 or nothing.

BI teams lead the way. BI teams drive Hadoop projects in almost half

of organizations (46%) that have deployed Hadoop.

Traditional database and software vendors dominate the Hadoop

market. With some exceptions, the traditional IT market leaders are

the primary providers of Hadoop-related products and services. The

only exception is Cloudera, which is the top data management

provider. Oracle is the leading data integration vendor, while IBM is the

leading supplier of analytics software and consulting services.

Methodology This report is based primarily on a survey, conducted in June and July of 2014,

of 1,515 business and IT professionals. The survey results are based on answers

by 906 respondents who completed the survey and 609 respondents who

partially completed it. The report is also based on interviews with users and

briefings with vendors as well as a review of industry literature on big data.

Among survey respondents, slightly more than half (53%) are IT professionals,

and the rest are consultants (23%) or business professionals (12%). Almost a

third (32%) work in large companies with more than $1 billion in revenue, while

about a quarter (24%) work in medium-size companies with between $100

million and $1 billion in revenues. The rest (44%) work in small companies with

under $100 million in revenues. Almost all a third (32%) of respondents are

from North America, followed closely by 30% from India, 23% from Europe, 8%

from Southeast Asia and the rest from other countries.

Introduction Is Hadoop ready for prime time? Can organizations run Hadoop in production

with confidence? Can Hadoop meet relevant service-level agreements for

scalability, performance and availability while ensuring adequate security and

manageability? Can Hadoop today be considered enterprise software?

Many companies today are asking these questions. They want to exploit the

capabilities of Hadoop, a series of inter-related open source projects managed

by The Apache Software Foundation that provide a novel and cost-effective way

to process and analyze large volumes of data. But many are not yet certain of

the answers. Most recognize that Hadoop holds great promise to help them

deliver information and insights to business users.


BENCHMARK REPORT

Hadoop benefits. A big advantage of Hadoop is that it enables organizations to

cost-effectively store all their data: both structured data from enterprise

resource planning and customer relationship management systems as well as

multi-structured data, such as Web server logs, sensor data, email and

extensible markup language, or XML, data. Because multi-structured data

consists of 80% of all data by most estimates, Hadoop heralds a new age of

data processing—one in which organizations can find needles of insight in a

haystack containing terabytes or petabytes of information.

Challenges. Although Hadoop is infinitely scalable from a technical standpoint,

it has lacked the management utilities required by most organizations and their

IT managers. These include security, manageability, backup and recovery,

disaster recovery and data governance. Moreover, as a batch processing

environment, Hadoop until recently has lacked the ability to support real-time

queries (i.e., SQL queries)—and subsequently true ad hoc reporting and visual

exploration.

Evolution

Undaunted by the challenges, early adopters have embraced Hadoop for both

tactical and strategic reasons. Tactically, they use Hadoop to reduce the costs

of running a data warehouse by offloading transformations to Hadoop to avoid

an expensive data warehouse upgrade. In some cases, they also offload data

that no longer fits in the data warehouse.

Strategically, early adopters use Hadoop to build new applications with new

types of data that previously were too costly to manage. Many use Hadoop to

process and analyze large volumes of clickstream data and social media feeds

to gain a better understanding of customer needs, behavior and sentiment.

Going mainstream. Encouraged by the success of early Hadoop adopters,

mainstream companies have been testing the Hadoop waters, intrigued by its

promise but wary of its newness. “The greatest impact from Hadoop is the

realization that bigger questions can be answered with affordable and

manageable technologies,” said one survey respondent. Another wrote,

“Hadoop is a manifestation of possibilities. The world is about to change.”

But others are leery of another new technology that carries hidden risks:

“*Hadoop+ is still immature. Although evolving rapidly, it still lacks advanced

features.” Another says, “*Hadoop+ offers too much change now for *anyone+

other than the big boys to invest.” Another says, “IT hype often sells

vaporware. I’ve seen this before.”

Undaunted by the

challenges, early

adopters have

embraced Hadoop

for both tactical

and strategic

reasons


BENCHMARK REPORT

Nonetheless, mainstream companies are starting to put Hadoop into

production to support key business initiatives, from decision support to data

mining. While most Hadoop projects today either support or serve as adjuncts

to existing data warehouses, many data management professionals expect this

to change. In the future, Hadoop may supplant large portions of the analytical

ecosystems currently running in many companies if not replace them

altogether.

In fact, some data management professionals see Hadoop as the foundation of

a new analytical ecosystem in which the data is processed and analyzed in place

rather than moved downstream to specialized data processing and analytical

systems. In the world of Hadoop, the data never moves, it stays in place. This

eliminates time-consuming, costly and error-prone data replication or

movement jobs. Hadoop “brings compute to the data,” becoming the staging

area, archive, transformation engine, query engine, data mining engine and so

on.

Before this vision of Hadoop becomes reality, it needs to pass the enterprise

sniff test. That is, it needs to run, act, interface with and behave like other

enterprise software in an organization’s data center. It needs to be reliable,

available, scalable, secure, manageable and highly performing, among other

things. Today, millions of dollars are being invested in Hadoop software and

services to ensure that Hadoop passes.

Six Vehicles for Deploying Hadoop

There are a number of approaches vendors are taking to turn Hadoop into an

enterprise-caliber platform for data analytics.

1. Do it yourself. Anyone can download Hadoop software from Apache’s

website, install it on a server and start using it. This is the essence of the do-it-

yourself, or DIY, approach that all vendor-driven Hadoop systems compete

against. Of course, going solo requires considerable expertise. And while a fairly

experienced person can download and install Hadoop on a small cluster, it

takes a great deal more expertise to install and manage Hadoop on a big cluster

in a production environment.

“Managing data and clusters at scale presents different challenges to those

associated with running test data through a couple of machines,” observes Paul

Miller in a report from GigaOm Research. “Again and again, organizational

deployments of Hadoop fail as they simplistically try to replicate processes and

procedures tested on one or two machines across more-complex clusters.” 1

1. Understanding the Power of Hadoop as a Service, Paul Miller, GigaOm Research, June 12, 2014.


BENCHMARK REPORT

Complicating the challenge is a dearth of automated tools to support Hadoop

administration, Miller explains. This leaves “system administrators with a

second problem: dependence on manual procedures.” Thus, “scaling Hadoop,

either as part of a meaningful pilot project or to deliver production enterprise

workloads, is a challenging undertaking, and managing the complex

interactions among parallel nodes remains a complex and often largely manual

process. Each node must be actively monitored throughout its commitment to

a particular workload, and Hadoop’s often arcane errors and failure modes can

rarely be resolved automatically.”

2. Open source Hadoop distributions. To simplify the process of standing up an

Apache Hadoop cluster, several vendors, including Cloudera, Hortonworks,

MapR, Pivotal and others, package up Apache software and deliver a single,

integrated release that is tested to make sure all the components work

together. They also provide bug patches and updates, service, support and

training. The software is free of charge; vendors make money by charging for

packaging, service and training.

Lately, these Hadoop distribution vendors have been forging strategic alliances

with established IT vendors to deliver Hadoop capabilities through traditional

enterprise IT channels. For example, Hewlett-Packard recently announced it

was investing $50 million for a stake in Hortonworks, and Hortonworks has

partnered with EMC’s Pivotal to support its enterprise data management

software. Cloudera, in the meantime, has a strategic partnership with Oracle.

Today, the Oracle Big Data SQL is built on the Cloudera distribution.

3. Customized Hadoop distributions. Some vendors have extended the Apache

Hadoop framework with proprietary software that is either embedded

transparently in the distribution (MapR) or offered as a commercial add-on

(Cloudera). For instance, MapR adds software to Hadoop that ensures

enterprise-caliber reliability, availability and security along with backup and

recovery and disaster recovery services. MapR says its platform is “100% binary

compatible” with Hadoop Distributed File System, ensuring “plug-and-play

compatibility” from a single data platform. Cloudera, on the other hand,

charges for certain “premium” features that go beyond what Apache Hadoop

offers, including real-time query, in-memory computing, machine learning,

search and stream processing.

4. Relational database vendors. Oracle, Teradata, Pivotal, Actian and IBM

incorporate Hadoop distributions inside analytical platforms that usually consist

of a massively parallel processing database and an analytical or object database

as well as other tools such as extract, transform and load (ETL) and data

connectors. The vendors deliver these platforms as software-only systems,


BENCHMARK REPORT

appliances or in the cloud. These traditional relational database vendors help

companies bridge the new and old worlds, minimizing the risk of implementing

new technology that is relatively immature. Plus, by bundling all the software

(and in some cases hardware), they offer one-stop shopping and a single throat

to choke. This platform approach reduces customer risks and enables

established vendors to keep charging premium prices despite using open

source software.

5. Platform-as-a-service vendors. Cloud providers such as Amazon and

Microsoft provide the infrastructure and software in the cloud so customers

can build their own Hadoop environments without having to buy, install or

manage hardware and software. And with subscription pricing, customers only

pay for what they need and can dynamically scale up and down their

environment as requirements dictate. For example, Amazon Web Services

customers can use the Amazon Elastic MapReduce service to build Hadoop

clusters in the cloud. Likewise, Microsoft customers can implement Hadoop on

Microsoft’s Azure cloud platform using Microsoft HDInsights.

6. Hadoop as a service. Other vendors, such as Altiscale, Mortar and Qubole, go

one step further and offer services built on top of a cloud-based Hadoop

infrastructure that the vendors run and manage themselves. Customers get all

the benefits of Hadoop without having to install, manage and run Hadoop—or

for that matter, know anything about it. These services extract your data and

make it available for querying via Hive, Pig or other Hadoop-based toolsets.

They offer service-level agreements for data loading and availability and many

of them support auto-scaling that dynamically increases Hadoop processing

power to handle peak query traffic.

Survey Findings Usage and Adoption

Hadoop is no longer just for technology enthusiasts and bleeding-edge Internet

startups. Our research shows that it’s slowly making its way into corporate data

centers and becoming an integral part of enterprise data strategies at many

Fortune 1,000 companies.

Until recently, most large organizations implemented Hadoop on an

experimental basis. They wanted to test its functionality and enterprise

software credentials, especially its security, reliability, manageability and ability

to interface with existing systems, applications and analytical tools. The allure

of free software to manage a big data analytics platform was too good to resist.

Hadoop is no longer

a technology for

tech enthusiasts

and bleeding-edge

Internet startups.

Our research shows

that it’s slowly

becoming an

integral part of

enterprise data

strategies


BENCHMARK REPORT

Today, an increasing number of companies are turning their Hadoop test beds

into production. They believe Hadoop is ready for prime time, at least for some

data processing and analytical workloads, such as offloading ETL processing

from data warehouses and parsing voluminous Web server logs. In the past

nine months, many Hadoop software providers—including MapR, Hortonworks

and Cloudera—have noticed a sizable uptick in the number of large companies

that are putting Hadoop into production.

Implementation status. In fact, our latest survey shows that a quarter of

organizations (25%) have either already deployed Hadoop or have projects

under development. That’s a far cry from just two years ago, when only 9% of

organizations had either fully or partially deployed Hadoop. (The 2012

TechTarget report Exploiting Big Data Strategies for Integrating With Hadoop to

Deliver Business Insights, was based on a survey with 1,158 respondents.)

Despite Hadoop’s rapid uptake, there are still many companies that have yet to

make the leap with the technology. Our 2014 survey shows 39% have no plans

for Hadoop, and 36% are considering Hadoop but have yet to install or develop

anything with it (see Figure 1). Clearly, Hadoop has accelerated within the

early-adopter market but still has some distance to go before it becomes a

mainstream technology.

Figure 1. Status of Hadoop in Organizations—2014

Based on answers by 1,495 survey respondents, 2014

Surprisingly, Hadoop is more common in large enterprises than smaller ones.

Even though it’s free, Hadoop can be expensive to deploy at scale, largely

because of the skilled experts currently required to implement and use it. At


BENCHMARK REPORT

the same time, large companies can avoid millions of dollars in costly upgrades

to existing data platforms by implementing Hadoop.

Our research shows that more than one-third (34%) of large companies (more

than $1 billion in annual revenues) have implemented Hadoop compared with

27% of midsize companies ($100 million to $1 billion in revenues) and 18% of

smaller organizations (less than $100 million in revenues). (See Figure 2.)

Figure 2. Hadoop Adoption by Company Size

Based on answers by 893 survey respondents who have fully or partially

deployed Hadoop, 2014

The early adopters of Hadoop largely came from the high-tech, Internet, e-

commerce and media industries. As Hadoop moves deeper into the corporate

landscape, it is finding a home in a range of industries, especially those with

large volumes of Web, customer or transaction data. The “nontraditional”

industries that have fully deployed Hadoop are financial services and retail,

indicated by 8% of respondents, and education, government, computer

resellers and telecommunications with 6% of respondents.

And there are many fast followers in these industries that want to keep up with

the leading technology adopters. Companies that are developing or have

partially deployed Hadoop come from business services and consulting (42%),

computer manufacturing (30%), financial services (21%), education (20%),

telecommunications (20%), computer resellers (16%) and other (23%)

industries (see Figure 3).


BENCHMARK REPORT

Figure 3. Hadoop Adoption by Industry



Architecture and Use Cases

Business impact. Industry observers have been predicting that as Hadoop

grows, it will begin to supplant or replace data warehouses. While this is not

happening in full force now, managers and professionals actively working with

Hadoop agree that it will soon be encroaching on traditional data

environments.

Today, 41% of companies say that Hadoop supplements their existing data

environments, while 16% say it’s replacing some parts of their environments. A

handful, 6%, say it has replaced their entire existing data processing

environments. But, when we asked the same respondents to project out three

years, one out of seven (14%) said they expect Hadoop to replace most of their

current data environments, and another 27% expect it to replace parts of their

infrastructures. That’s a tectonic change, especially when you consider the

large amount of capital companies have invested in their existing data

processing environments (see Figure 4).


BENCHMARK REPORT

Figure 4. Hadoop’s Impact: Today and in Three Years



Architecture. Among organizations that currently use Hadoop, almost half

(46%) use it for data processing functions, such as parsing or transforming data,

while 44% use Hadoop as an “analysis sandbox” to support ad hoc queries. In

addition, 40% use it as a “data mining sandbox” to support machine learning

and complex analytical functions (see Figure 5).

Figure 5. Role of Hadoop in Analytical Architecture Today

Based on answers by 195 respondents who have fully or partially deployed

Hadoop, 2014

41%

16%

6%

37%

29%

27%

14%

30%

Supplements what we have

Replaces some of what we have

Replaces most of what we have

No impact

Today In Three Years


BENCHMARK REPORT

The data lake. Today, 32% of companies use Hadoop as an “enterprise

operating system” and 30% as an enterprise repository to holds all corporate

data (see Figure 5). These embody the notion of the so-called data lake, which

is quickly gaining visibility as a prime use case for Hadoop. Many IT

professionals love the idea of loading all corporate data into a single, low-cost

enterprise data repository and doing all the processing there instead of moving

data into specialized downstream systems.

A data lake eliminates the need to replicate and move data, which is error-

prone and expensive. A data lake not only stores all enterprise data, it

processes it in place using a variety of engines, such as SQL, in-memory

computing, machine learning, graphing and streaming—any engine that

conforms to the Hadoop 2 resource management interface. The percentage of

companies using Hadoop as a data lake should rise from 30% today to closer to

50% in the next several years.

Analytics functions. Until Hadoop 2, which was released in the fall of 2013,

Hadoop was a batch data processing platform. Many early users likened it to

the early days of mainframe computing, when you would submit your request

in writing to a mainframe guru and then come back the next day to get the

answer.

Today, a majority of companies (57%) still use Hadoop as a batch-oriented data

processing pump, although this is changing with the advent of real-time query

engines for Hadoop, which first became available in late 2013. Today, slightly

more than a third of companies (35%) use real-time queries, while almost half

(48%) plan to use it within 18 months. SQL on Hadoop is the most rapidly

growing feature in Hadoop, driven by the large number of data analysts and

data scientists who prefer to use SQL to query data rather than a programming

language, such as Java, Perl or Python. And vendors are happy to oblige.

Cloudera, Hortonworks, Pivotal, Actian, IBM and Microsoft have all released

SQL on Hadoop products in the past several months. Similarly, data streaming,

in-memory computing, search and graphing will also experience a big spike in

usage in the next 18 months (see Figure 6).


BENCHMARK REPORT

Figure 6. Analytics Functions Run in Hadoop Today and in 18 Months


Hadoop, 2014

Analytical applications. Dashboards and scorecards are the primary analytical

applications run in enterprise Hadoop environments today—cited by 43% of

respondents, with 52% anticipating such uses in the near future. But data

mining and machine learning will become the predominant application in the

future, growing from 34% of companies today to 57% of companies in three

years. The use of visual discovery and exploration on Hadoop will also rise

significantly, growing from 21% today to 41% in three years. In addition, the

percentage of companies implementing streaming analytics will rise from 16%

today to 39% in the next three years. In short, organizations expect to run a lot

more analytical workloads on Hadoop in the near future (see Figure 7).


BENCHMARK REPORT

Figure 7. Analytical Applications on Hadoop Today and in Three Years


Hadoop, 2014

Hadoop Functionality

Hadoop is a relatively young technology. Version 2.0 just shipped last fall. The

general consensus is that Hadoop holds a lot of promise but lacks the critical

functionality necessary to run in production in large enterprise computing

environments. In other words, Hadoop has to mature before companies feel

comfortable deploying it in a production environment.

But our survey shows that a majority of organizations that have either deployed

Hadoop or are in the process of doing so give the technology high marks. A

majority of these respondents give Hadoop a rating of either “excellent” or

“good” in the following areas: scalability (83%), availability (74%), backup and

recovery (54%), manageability (53%) and mixed workload management (52%).

Only security (49%) fell below the 50% threshold (see Figure 8).

43%

40%

37%

34%

30%

24%

22%

21%

17%

16%

12%

52%

45%

44%

57%

38%

37%

38%

41%

33%

39%

5%

Dashboards and/or scorecards

Batch reporting

Ad hoc reporting and dashboarding

Data mining or machine learning

Ad hoc queries

Reporting bursting (sends tailored…

Location analytics (e.g. mapping)

Visual discovery and exploration

Ad hoc mashups and analyses

Streaming analytics

None

Today In 3 Years


BENCHMARK REPORT

Figure 8. Current Users Rate Hadoop Capabilities


Hadoop, 2014

These strong marks attest to the rapid evolution of Hadoop as an enterprise

data management platform as well as the work of long-established commercial

vendors and service providers that supplement Hadoop with product and

service capabilities to shore up its enterprise features.

Data management. Current users give Hadoop even higher marks in its ability

to manage various types of data workloads. These customers rate Hadoop as

“excellent” or “good” in data loading (79%) query performance (69%), data

streaming (65%), data exporting (65%) and data transformation (64%). Clearly,

data management is Hadoop’s sweet spot, especially for large volumes of

multi-structured data.


BENCHMARK REPORT

Figure 9. Hadoop’s Ability to Handle Data Management Workloads, Rated


Hadoop, 2014

Data governance. Though Hadoop has yet to fully address data governance,

our respondents gave it high ratings. They gave “excellent” or “good” ratings to

Hadoop for its ability to support data consistency (63%), data profiling (60%),

data quality (58%), metadata management (57%), conformed dimensions

(52%), and data lineage and impact analysis (52%). (See Figure 10.)

Figure 10. Hadoop Support for Data Governance Tasks, Rated


Hadoop, 2014


BENCHMARK REPORT

One explanation for these higher-than-expected results is that data governance

in Hadoop today is largely manual—conducted by experienced data scientists,

whose main job is to explore or “profile” big data in Hadoop to investigate its

value and build logical views on top of that data (i.e., Hive tables) for them and

others to query. HCatalog is an emerging metadata repository that catalogs

Hive, Pig and HBase data elements and can be accessed via a standard,

representational state transfer, or REST, interface.

Another factor ameliorating data governance issues in Hadoop is that most big

data consists of large volumes of data from a single source, such as Web server

logs or point-of-sale data. Much of this data is machine generated, consisting of

uniformly repeating data structures. Once the structure is understood, data can

be quickly extracted for query processing.

Analytics. Hadoop also scores well among existing users for analytics. Given its

batch data processing heritage, it’s not surprising that Hadoop scores highest

on this capability, with 71% of existing users giving Hadoop a rating of

“excellent” or “good.” Following close behind are data mining (60%), ad hoc

queries (60%) and report bursting (58%). Other analytic functions are also in

the mix: dashboards and scorecards (55%), ad hoc reporting (56%), streaming

analytics (55%), ad hoc mashups (54%), location analytics (52%) and visual

exploration (51%). (See Figure 11)

Figure 11. Hadoop’s Capability to Support Analytic Functions, Rated


Hadoop, 2014


BENCHMARK REPORT

Many of the features listed above, such as report bursting, dashboards, ad hoc

reporting and location analytics, are not native to Hadoop. Rather they are

intrinsic to third-party reporting and analysis tools that query Hadoop via Hive

or some other means. The high scores reflect that reporting, and analysis

vendors are doing a reasonable job of making Hadoop data accessible to their

customers. Of course, data scientists today are the primary analytical users of

Hadoop and they primarily rely on fourth-generation languages, such as Java,

Perl or Python to access Hadoop data. But this is changing as the industry is

intent on making Hadoop data accessible to business analysts and users who

know how to use SQL or reporting and analysis tools.

One oddity in the results is that visual exploration tools trail the pack of

analytical techniques. Given the popularity of Tableau and other visual

exploration tools and the close association of visualization in general with big

data, it’s hard to explain this result, other than the fact that most ad hoc

queries and reports conducted with Hadoop data are done by data scientists

using Java, Perl or Python. But I expect the use of visual exploration to query

Hadoop data will grow rapidly in the next year or two.

Investment and spending plans. As with many new information technologies,

Hadoop is very resource-intensive. Staff costs are the largest component of

Hadoop investments, comprising 28% of expenditures in Hadoop-based

projects. The remaining expenditures are divided among hardware (21%),

software (18%), services (18%) and other (16%). (See Figure 12.)

Figure 12. Average Hadoop Expenditures


Hadoop, 2014


BENCHMARK REPORT

Current Hadoop users have sizable budgets for the immediate future. More

than one-third of respondents (36%) report they intend to spend more than $1

million on Hadoop products, services and other related requirements over the

next 18 months. Another 13% plan to spend between $500,000 and $1 million,

while 21% will spend between $100,000 and $500,000, and 29% will spend less

than $100,000 or nothing. (See Figure 13).

Figure 13. Hadoop Spending Plans Over the Next 18 Months


Hadoop, 2014

The wide range of expenditures shows the value of open source software: if

companies desire, they can download the software and run it on existing

computers without spending time. But organizations that need enterprise-

caliber computing will usually spend significant sums with product and services

vendors to minimize the risks of an outage or simply to share the blame if

something goes wrong.

Expenditures by company size. Ironically, there is not a strong correlation

between spending and company size. For example, 19% of the smallest

organizations in the survey (with less than $100 million in annual revenues)

plan to spend between $1 million to $5 million on their Hadoop deployments,

just shy of 21% of the largest enterprises that plan to spend this sum of money.

Conversely, 21% of large companies plan to spend less than $100,000 or

nothing on Hadoop.


BENCHMARK REPORT

Leadership

Managers and professionals involved in BI and analytics are leading the charge

to implement Hadoop. Close to half of respondents (46%) that have deployed

Hadoop say the BI team is driving the development of Hadoop, while more than

a third (38%) say the data warehouse (DW) team is taking the initiative.

Following close behind are the IT department (35%) and application

development (33%). (See Figure 14.)

Figure 14. Who’s Driving Hadoop Development


Hadoop, 2014

The leading role of BI and DW managers makes sense, since many organizations

plan to supplement or replace some or all of their current DW environments

with Hadoop, as discussed above. Obviously, BI and DW managers either want

to lead their organizations into the future or preserve their job security as

Hadoop rolls over the traditional data warehousing world. Another explanation

of the preponderance of BI and DW managers running Hadoop deployments is

that our survey was directed to BI and DW managers more than other

categories of IT management.

When we dissect these results by size of organization, we see that the IT

department plays a slightly more leading role in both small and large

organizations, while BI and DW managers, along with data scientists, take the

lead in medium-size organizations. Among large organizations, 44% of

respondents say their Hadoop deployments are driven by the IT department,

while 41% are driven by the BI team, 38% by the data warehousing team, 32%

the application development team and 29% by a data scientist (see Figure 15).


BENCHMARK REPORT

Figure 15. Who’s Driving Development of Hadoop—by Organization Size (in

annual revenues)


Hadoop, 2014

Vendors

As in any new large market, many vendors are jockeying for position, both

startups and established commercial players. Today, Cloudera is the front-

runner among data management players supplying Hadoop software. A third of

respondents that have deployed Hadoop (33%) are using Cloudera. Trailing

Cloudera are traditional database management vendors IBM (25%), Oracle

(21%) and Microsoft (20%). Amazon, which provides Hadoop services in the

cloud, registers 20%, a strong showing for a nontraditional data management

vendor. Next on the list are Hadoop “pure plays”—MapR (17%) and

Hortonworks (16%), followed by Teradata and HP (14% each), with Pivotal,

Intel, and Actian trailing further at 7%, 6%, and 3% respectively (see Figure 16).


BENCHMARK REPORT

Figure 16. Data Management Vendors Supplying Hadoop Environments


Hadoop, 2014

Data integration. One of the key strengths of Hadoop is its ability to parse and

transform any kind of data. Until recently, data scientists wrote transformation

and parsing programs in MapReduce or Java, which is not the most efficient

method. Now, traditional data integration vendors have integrated their

products with Hadoop, enabling organizations to use visual tools to create

transformation programs. More important, these tools enable a host of data

integration developers to develop programs for Hadoop data without having to

learn how to use a new tool.

Among vendors providing data integration services for Hadoop, Oracle leads

the pack with 25% of respondents, followed closely by IBM (24%), Microsoft

(22%) and Informatica (21%). A second tier of data integration vendors consists

of Talend (13%) and Pentaho (12%). Pivotal, SnapLogic and Syncsort also are

contenders in the data integration space (see Figure 17).


BENCHMARK REPORT

Figure 17. Data Integration Vendors Supplying Products for Hadoop


Hadoop, 2014

Analytics vendors. The leading database and software providers are also the

leading analytics vendors. IBM is in the lead, with 27% of companies that have

deployed Hadoop using IBM software or services to analyze big data. Microsoft

and Oracle follow with 26% and 25%. Visual discovery tool Tableau is used by

22%, while SAP (20%), SAS (18%), Teradata (13%), MicroStrategy (9%), Splunk

(8%), Pivotal (7%) and smaller vendors round out the list (see Figure 18).


BENCHMARK REPORT

Figure 18. Analytics Vendors Supplying Products for Hadoop


Hadoop, 2014

Hardware. The traditional hardware vendors are the go-to choices for

companies that have implemented Hadoop on-premises. IBM leads the way

with 33% of respondents using IBM hardware to deploy Hadoop. IBM is

followed closely by Dell (31%), HP (25%), Cisco (21%) and Oracle (19%). (See

Figure 19.)


BENCHMARK REPORT

Figure 19. Hadoop Hardware Vendors


Hadoop, 2014

Consulting companies. More than half of organizations that have deployed

Hadoop (58%) said they currently retain outside consulting assistance, while

42% do not. Among consultancies providing Hadoop services, IBM Global

Business Services is used by 14% respondents, followed closely by Accenture

(13%). Further down, Cognizant is used by 8% of respondents, while Capgemini,

Tata, Wipro are used by 7% of respondent companies, followed by PWC, KPMG,

Dell, Deloitte, Ernst & Young, Booz Allen Hamilton and a host of other boutique

consultancies (see Figure 20).


BENCHMARK REPORT

Figure 20. Hadoop Consultants


Hadoop, 2014

Conclusion Hadoop is establishing a foothold in enterprises. It is maturing and gradually

becoming a key piece of the enterprise data infrastructure at many

organizations. Until recently, mainstream companies experimented with

Hadoop to learn its functionality and capabilities and better understand the

role it could play in their existing analytical ecosystems.

Today, many of these organizations are moving Hadoop in production to

support data processing and analytical requirements or to offload workloads

from existing data warehouses to save money. IT and data management

professionals who have implemented Hadoop give it high ratings. The software

is evolving fast, thanks to the hard work of many vendors in the space—as well

as practitioners who contribute code to the Apache Foundation.

Although it remains to be seen whether Hadoop can fulfill its promise as the

enterprise operating system for data analytics, it’s moving in the right direction.


BENCHMARK REPORT

WAYNE ECKERSON is principal consultant of Eckerson

Group LLC (www.eckerson.com), a business-technology

consulting firm that helps business leaders use data and

technology to drive better insights and actions. His team

of consultants provides information and advice on

business intelligence, analytics, performance

management, data governance, data warehousing and

big data. They work closely with organizations that want to assess their current

capabilities and develop a strategy that optimizes their investments in business

intelligence and analytics.

Eckerson has conducted many groundbreaking research studies, chaired

numerous conferences and written two widely read books: The Secrets of

Analytical Leaders: Insights from Information Insiders (2012) and Performance

Dashboards: Measuring, Monitoring, and Managing Your Business (2005/2010).

He is currently working on a book about data governance.

Write him at [email protected].

Hadoop in the Enterprise: With Maturity Come Manageability Challenges

is a SearchBusinessAnalytics e-publication.

Wayne Eckerson

Principal Consultant, Eckerson Group

Doug Olender

Publisher

TechTarget

275 Grove Street, Newton, MA 02466

www.techtarget.com

© 2014 TechTarget Inc. No part of this publication may be transmitted or reproduced in any form

or by any means without written permission from the publisher. TechTarget reprints are available

through The YGS Group.

ABOUT TECHTARGET:

TechTarget publishes

media for information

technology professionals.

More than 100 focused

websites enable quick

access to a deep store of

news, advice and analysis

about the technologies,

products and processes

crucial to your job. Our

live and virtual events give

you direct access to

independent expert

commentary and advice.

At IT Knowledge

Exchange, our social

community, you can get

advice and share solutions

with peers and experts.

mailto:[email protected]

http://www.searchbusinessanalytics.com/

http://www.techtarget.com/

http://reprints.ygsgroup.com/m/techtarget

cmaindron

Texte tapé à la machine

cmaindron


WP188-EN

cmaindron


cmaindron


cmaindron


Documents

Hadoop in the Enterprise - Talendinfo.talend.com/rs/talend/images/WP_EN_BD_BenchmarkReport_Hado… · Apache Hadoop cluster, several vendors, including loudera, Hortonworks, ... Hadoop