BigDataReport Final

Embed Size (px)

Citation preview

  • 7/29/2019 BigDataReport Final

    1/35

    ExploitingBig DataSeies Iei ih Hd

    Deie Bsiess Isihs

    By Wayne eckerson

    Dirctr f Rarc, Bin Alicatin and Arcitctr Mdia Gr, TcTargt, Jn 2012

  • 7/29/2019 BigDataReport Final

    2/35

    ExploItIng BIg Data: StratEgIES or IntEgratIng wItH HaDoop to DElIvEr BuSInESS InSIgHtS 2

    e s

    The big daTa movemenT is transorming the business intelligence (BI) indus-

    try. The term not only reers to large volumes o multi-structured data but

    also new technologies, specically Hadoop and analytical databases. The

    combination expands the horizon or analytical inquiries, providing exception-

    ally high price-perormance and new, fexible data processing rameworks.Analytical databases oer exceptionally high price-perormance or ana-

    lytical workloads. Analytical databases come as appliances, sotware-only

    products or cloud-based services. The ormer are easy to deploy and man-

    age, while the latter integrate seamlessly into an existing data center, giv-

    ing administrators fexible deployment

    options. Companies are implementing

    analytical databases to provide new ana-

    lytical capabilities or augment or replace

    aging data warehousing platorms. Theyare also using them to serve as analytical

    sandboxes that ofoad new or existing

    analytical workloads rom overburdened

    data warehouses.

    Hadoop adds large-scale storage o

    multi-structured dataWeb server logs,

    video, audio, systems logs and sensor

    datato the mix o content that organiza-

    tions can capture and rene or business analysis. It also increases agility byletting data proessionals land data without rst having to transorm it into

    an ocially sanctioned schema. And because o its open source heritage,

    Hadoop reduces the cost o storing and transorming large volumes o data.

    About 11% o surveyed organizations have implemented Hadoop, and around

    hal o these early adopters are just experimenting with the system. Nonethe-

    less, the hype over Hadoop has grown to a deaening crescendo, and many BI

    Hadoop adds large-scale

    storage o multi-structureddata to the mix o content

    that organizations can

    capture and refne or

    business analysis.

    executive

    summary

    research

    background

    he new analytical

    ecosystem

    big data trends

    integrating

    with hadoop

    Future oF hadoop

    recommendations

  • 7/29/2019 BigDataReport Final

    3/35

    ExploItIng BIg Data: StratEgIES or IntEgratIng wItH HaDoop to DElIvEr BuSInESS InSIgHtS 3

    executive summary

    proessionals are bullish about its uture. Hadoop can provide the leverage

    we need to take our BI environment to the next level, said one BI pro.

    The growing popularity o Hadoop presents a curious challenge to estab-

    lished BI vendors. On one hand, Hadoop represents a new source o data andopportunity or established vendors to drive revenue rom new and existing

    products in an increasingly mature market. On the other, Hadoop, in its ull-

    grown incarnation, could possibly supplant established commercial database

    and data integration products with comparable open source products. But

    today, both Hadoop and SQL database vendors are working together to build

    a robust analytical ecosystem. There are our types o technical integration

    vendors are pursuing, each with its advantages and disadvantages.

    Whatever the uture holds, its clear that SQL tools and Hadoop will be

    joined at the hip, giving users more options or storing, processing and analyz-ing data. n

    executive

    summary

    research

    background

    he new analytical

    ecosystem

    big data trends

    integrating

    with hadoop

    Future oF hadoop

    recommendations

  • 7/29/2019 BigDataReport Final

    4/35

    ExploItIng BIg Data: StratEgIES or IntEgratIng wItH HaDoop to DElIvEr BuSInESS InSIgHtS 4

    rh Bgd

    The purpose of this report is to describe the emerging big data ecosystem in

    which Hadoop and analytical databases play an increasingly important role.

    It will then examine ways that traditional BI, extract, transorm and load (ETL)

    and database vendors integrate with Hadoop. Finally, it will speculate on

    how Hadoop and analytical ecosystems will evolve and aect traditional BIapproaches, technologies and vendors.

    The research is based on interviews with BI practitioners, briengs with BI

    providers, including sponsors o this report, and a survey o BI proessionals.

    The ve-minute survey was promoted to the BI Leadership Forum, an online

    group o about 1,000 BI directors and managers, and my 2,000-plus Twitter

    ollowers in April 2012. The survey was started by 170 people and completed

    by 152 people or an 89.4% completion rate. Survey results are based on 158

    respondents who completed the survey and indicated their positions as BI

    or IT proessional, BI sponsor or user or BI consultant. Responses rompeople who selected the position o BI vendor or Other or who didnt com-

    plete the survey were excluded rom the results. Among this set o qualied

    BaSED on 158 rESponDEntS, BI LeADeRshIp FoRuM, January 2012.

    75%BIprofessional

    18%BIconsultant

    7%BIsponsororuser

    fgure 1.

    Dgph:

    rpd P

    executive

    summary

    research

    background

    he new analytical

    ecosystem

    big data trends

    integrating

    with hadoop

    Future oF hadoop

    recommendations

    http://www.bileadership.com/http://www.bileadership.com/http://www.bileadership.com/
  • 7/29/2019 BigDataReport Final

    5/35

    ExploItIng BIg Data: StratEgIES or IntEgratIng wItH HaDoop to DElIvEr BuSInESS InSIgHtS 5

    researcH BackGrounD

    respondents, most are BI or IT proessionals (75%) rom large companies

    with more than $1 billion in annual revenues (53%). The industries with the

    highest percentage o respondents are consulting (14%) and banking (11%)

    (see fgures 1, 2 and 3). n

    fgure 2.

    Dgph: cp sz

    fgure 3.

    Dgph: id

    53%Large($1B+)

    22%Medium($100Mto$1B)

    25%Small(ware

    HealthcareRetail

    Transpora0on

    Telecommunica0ons

    Internet

    Government/Nonprofit

    Educa0on

    Media

    Other

    http://www.bileadership.com/http://www.bileadership.com/http://www.bileadership.com/http://www.bileadership.com/http://www.bileadership.com/http://www.bileadership.com/
  • 7/29/2019 BigDataReport Final

    6/35

    ExploItIng BIg Data: StratEgIES or IntEgratIng wItH HaDoop to DElIvEr BuSInESS InSIgHtS 6

    th nw a e

    Theres a revoluTion aoot in the way organizations architect the fow o data

    to support reporting and analysis applications. The ringleader o the revolu-

    tion is big data. There are two types o big data products in the market today.

    There is open source sotware, centered largely on on Hadoop, which elimi-

    nates up-ront licensing costs or managing and processing large volumes odata. And there are new analytical engines, including appliances and column

    stores, which provide signicantly higher price-perormance than general-

    purpose relational databases.

    Both sets o big data sotware deliver higher returns on investment than

    previous generations o data management technology, but in vastly dierent

    ways. Organizations are gradually imple-

    menting both types o systems, among oth-

    ers, to create a new analytical ecosystem.

    undersTanding hadoop

    For many, big data has become syn-

    onymous with Hadoopan open source

    ramework or parallel computing that runs

    on a distributed le system (such as the

    Hadoop Distributed File System, or HDFS.)

    For the rst time, Hadoop enables orga-

    nizations to cost-eectively store all theirdata as well as multi-structured data, such as Web server logs, sensor data,

    email and extensible markup language (XML) data. Because multi-structured

    data consists o 80% o all data by most estimates, the advent o Hadoop

    heralds the dawn o a new age o data processing in which organizations can

    pluck the needle out o the data haystack, which consists o terabytes, i not

    petabytes, o inormation.

    Because multi-structureddata consists o 80% o all

    data by most estimates,

    the advent o Hadoop

    heralds the dawn o a new

    age o data processing.

    executive

    summary

    research

    background

    he new analytical

    ecosystem

    big data trends

    integrating

    with hadoop

    Future oF hadoop

    recommendations

  • 7/29/2019 BigDataReport Final

    7/35

    ExploItIng BIg Data: StratEgIES or IntEgratIng wItH HaDoop to DElIvEr BuSInESS InSIgHtS 7

    tHe neW analytical ecosystem

    B. Hadoop, an open source project hosted by the Apache Sotware

    Foundation, brings three major benets to the world o analytical processing:

    1.lw . Because Hadoop is open source sotware that runs on com-modity servers, it radically alters the nancial equation or storing and

    processing large volumes o dataat least in terms o up-ront licensing

    costs. With Hadoop, organizations can nally store all the data they gen-

    erate in its raw orm without having to justiy its business value up ront.

    This creates a low-cost staging and rening area and osters greater data

    exploration and reuse.

    2.ld d g. Compared with relational databases, Hadoop does not

    require developers to convert data to a specic ormat and schema (suchas elds with xed data types, lengths, labels and relationships) to load

    and store it. Rather, Hadoop is a load-and-go environment that handles

    any data ormat and speeds load cycles, which is critical when ingesting

    terabytes o data.

    3. Pd d. Finally, Hadoop MapReduce enables developers to use

    Java, Python or other programming languages to process the data. This

    means developers dont have to learn SQL, which is not as expressive as

    procedural code or certain types o analytical work, such as data miningoperations or inter-row calculations. MapReduce is also implemented in a

    number o analytical databases to augment the amiliarity and enterprise

    adoption o SQL.

    Given these benets, the data management industry is understandably

    abuzz about big data. Organizations are now exploiting creative ways to use

    Hadoop. For example, Vestas Wind Systems, a wind turbine maker, uses

    Hadoop to model larger volumes o weather data so it can pinpoint the opti-

    mal placement o wind turbines. And a nancial services customer usesHadoop to improve the accuracy o its raud models by addressing much larg-

    er volumes o transaction data.

    Dwb. But Hadoop is not a data management panacea. Its essentially

    a 1.0 product that is missing many critical ingredients o an industrial-proo

    data processing environmentrobust security, a universal metadata catalog,

    executive

    summary

    research

    background

    he new analytical

    ecosystem

    big data trends

    integrating

    with hadoop

    Future oF hadoop

    recommendations

  • 7/29/2019 BigDataReport Final

    8/35

    ExploItIng BIg Data: StratEgIES or IntEgratIng wItH HaDoop to DElIvEr BuSInESS InSIgHtS 8

    tHe neW analytical ecosystem

    backup sotware, rich management utilities and in-memory parallel pipelining.

    Moreover, Hadoop is a batch-processing environment that doesnt support

    speed-o-thought, iterative querying. And although higher-level languages

    exist, Hadoop today requires data scientists who know how to write Javaor other programs to run data processing jobs, and these olks are in scarce

    supply.

    To run a production-caliber Hadoop environment, you need to get sot-

    ware rom a mishmash o Apache projects, which have colorul names like

    Flume, Sqoop, Oozie, Pig, Hive and ZooKeeper. These independent projects

    oten have competing unctionality and separate release schedules, and they

    arent always tightly integrated. And each project evolves rapidly. Thats why

    there is a healthy market or Hadoop distributions that package these compo-

    nents into a reasonable set o imple-mentable sotware. Its also why the

    jury is still out on the total cost o

    ownership or Hadoop. Today the

    cost o putting Hadoop into produc-

    tion is high, although this will improve

    as the environment stabilizes and

    administrative and management utili-

    ties are introduced and mature.

    Many BI vendorsincluding data-base, data integration and reporting

    and analysis vendorsare working to

    integrate with Hadoop, making it eas-

    ier to access, manipulate and query

    Hadoop data than is currently possible with native Apache sotware. The

    rapid merging o traditional BI and Hadoop environments creates a new ana-

    lytical ecosystem that expands the ways organizations can eectively exploit

    data or business gain.

    adp . Leading-edge companies rom a variety o industries have

    already implemented Hadoop, and many more are considering it. A survey o

    the BI Leadership Forum in April 2012 shows that 10% o organizations are

    implementing Hadoop or have already put it into production, while 20% are

    experimenting with it in house and 32% are considering it. On the downside,

    38% have no plans to use Hadoop (see fgure 4, page 9).

    Today the cost o putting

    Hadoop into production is

    high, although this will im-

    prove as the environment

    stabilizes and administrative

    and management utilitiesare introduced and mature.

    executive

    summary

    research

    background

    he new analytical

    ecosystem

    big data trends

    integrating

    with hadoop

    Future oF hadoop

    recommendations

  • 7/29/2019 BigDataReport Final

    9/35

    ExploItIng BIg Data: StratEgIES or IntEgratIng wItH HaDoop to DElIvEr BuSInESS InSIgHtS 9

    tHe neW analytical ecosystem

    Many BI directors have heard the buzz around Hadoop and are starting to

    explore its value. Most are ocused on use cases that Hadoop can support,

    ranging rom technical workloads such as data processing, reporting and data

    mining to business applications such as raud detection, risk modeling, churn

    analysis, sentiment analysis and cross-selling. But most havent sucientlyexamined how Hadoop ts into and extends their current BI architecture.

    undersTanding analyTical daTabases

    The other type o big data predates Hadoop by several years. It is less a

    movement than an extension o existing relational database technology

    optimized or query processing or advanced analytics. These analytical plat-

    orms span a range o technology, rom appliances and columnar databases

    to shared nothing, massively parallel processing (MPP) databases. And theyhave unique extensions that in some cases include a MapReduce processing

    ramework or advanced analytics. The common thread among them is that

    most are read-only environments that deliver exceptional price-perormance

    compared with general-purpose relational databases originally designed to

    run transaction processing applications.

    Teradata laid the groundwork or the analytical platorm market when it

    fgure 4.

    adp Hdp

    38%Noplans

    20%Experimenting

    4%

    Inproduction

    BaSED on 158 rESponDEntS,BI LeADeRshIp FoRuM, aprIl 2012.

    executive

    summary

    research

    background

    he new analytical

    ecosystem

    big data trends

    integrating

    with hadoop

    Future oF hadoop

    recommendations

    http://www.bileadership.com/http://www.bileadership.com/http://www.bileadership.com/
  • 7/29/2019 BigDataReport Final

    10/35

    ExploItIng BIg Data: StratEgIES or IntEgratIng wItH HaDoop to DElIvEr BuSInESS InSIgHtS 10

    tHe neW analytical ecosystem

    launched the rst analytical appliance in the early 1980s. Sybase was also an

    early orerunner, shipping the rst columnar database in the mid-1990s. IBM

    Netezza kicked the current market into high gear in 2003 when it unveiled

    a popular analytical appliance and was soon ollowed by dozens o startups.Oracle launched the Exadata appliance in 2008, and it has been one o the

    companys most successul products ever. Aster Data (now part o Tera-

    data) tightly integrated MapReduce with SQL processing. Startup ParAccel,

    which oers a sotware-only analytical database, has also achieved consider-

    able traction, especially among nancial services and other companies that

    require the highest level o perormance or complex queries and analytical

    workloads. And several BI vendors, notably

    MicroStrategy, Oracle and Tableau, have

    been architecturally designed to exploitthese big data sources.

    B. Although the price tag o ana-

    lytical databases can oten exceed $1 mil-

    lion including hardware, customers nd that

    they are well worth the cost. Some use the

    technology to ofoad complex analytical

    workloads rom existing data warehouses

    and avoid costly upgrades and time-con-suming rewrites o existing applications.

    Others use SQL-MapReduce capabilities

    embedded in some analytical databases to analyze multi-structured data at

    scale and increase the accuracy o customer behavior models. But most use

    analytical databases to replatorm aging data warehouses in an eort to elimi-

    nate perormance bottlenecks and run a backlog o critical applications.

    For example, Virginia-based XO Communications recovered $3 million in

    lost revenue rom a new revenue assurance application it built on an analytical

    appliance, even beore it had paid or the system. Gilt Groupe, an online luxuryretailer, uses SQL-MapReduce processing to tie together clickstream inorma-

    tion with email logs, ad viewing inormation, Twitter and operational inorma-

    tion to discover consumer preerences and improve the shopping experience.

    Australian Finance Group, the largest provider o mortgage services in Austra-

    lia, implemented an analytical appliance to accelerate query and log-in peror-

    mance o a critical application used by its mortgage brokers, achieving a 42%

    Although the price

    tag o analytical data-

    bases can oten exceed

    $1 million including

    hardware, customers

    fnd that they are well

    worth the cost.

    executive

    summary

    research

    background

    he new analytical

    ecosystem

    big data trends

    integrating

    with hadoop

    Future oF hadoop

    recommendations

  • 7/29/2019 BigDataReport Final

    11/35

    ExploItIng BIg Data: StratEgIES or IntEgratIng wItH HaDoop to DElIvEr BuSInESS InSIgHtS 11

    tHe neW analytical ecosystem

    return on investment in three years.

    chg. Given the up-ront costs o analytical databases, organizations

    usually do a thorough evaluation o these systems beore jumping on board.First, companies determine whether an analytical database outperorms their

    existing data warehouse databases to a degree that warrants migration and

    retraining costs. This requires a proo o concept in which a customer tests

    the systems in its own data center using its own data across a range o que-

    ries. The good news is that the new analyti-

    cal databases usually deliver jaw-dropping

    perormance or most queries tested. In

    act, many customers dont believe the ini-

    tial results and rerun the queries to makesure that the results are valid.

    Second, companies must choose rom

    more than two dozen analytical databases

    on the market today. For instance, they

    must decide whether to purchase an appli-

    ance or a sotware-only system, a columnar

    database or an MPP database, or an on-premises system or a Web-based

    cloud service through vendors, such as Amazon Web Services. Evaluating

    these options takes time, and many companies create a short list that doesntalways contain comparable products.

    Finally, companies must decide what role an analytical platorm will play in

    their data warehousing architectures. Should it serve as the data warehous-

    ing platorm? I so, does it handle multiple workloads easily and scale linearly

    to handle the growth in numbers o concurrent users? Or is it best used as

    an analytical sandbox? I so, how well does it handle complex queries? Does

    it have a built-in library o analytical algorithms and an extensibility rame-

    work to add new ones? What kinds o data make sense to ofoad to the new

    system? How do you rationalize having two data warehousing environmentsinstead o one?

    Today, small and medium-sized companies which have tapped out their

    Microsot SQL Server or MySQL data warehouses usually replace them with

    analytical appliances to get better perormance. But large companies that have

    invested millions o dollars in a data warehouse typically augment the data

    warehouse with outboard systems designed to handle unique workloads or

    Companies must decide

    what role an analyticalplatorm will play

    in their data warehous-

    ing architectures.

    executive

    summary

    research

    background

    he new analytical

    ecosystem

    big data trends

    integrating

    with hadoop

    Future oF hadoop

    recommendations

  • 7/29/2019 BigDataReport Final

    12/35

    ExploItIng BIg Data: StratEgIES or IntEgratIng wItH HaDoop to DElIvEr BuSInESS InSIgHtS 12

    tHe neW analytical ecosystem

    types o data. Here the data warehouse becomes the hub that eeds multiple

    downstream systems with clean, integrated data. This hub-and-spoke archi-

    tecture helps organizations avoid a costly upgrade to their legacy data ware-

    houses, which might exceed the cost o a new analytical database.

    The new analyTical ecosysTem

    fgure 5 depicts an architecture or a new analytical ecosystem that has the

    ngerprints o big data all over it. The objects in blue represent the traditional

    data warehousing environment, while those in pink represent new architec-

    tural elements made possible by new big data and analytical technologies.

    By extending the traditional BI architecture with Hadoop, analytical data-

    bases and other analytical components, organizations gain the best o top-down reporting and bottom-up analytics in a single ecosystem. Traditionally,

    pera ona sys ems(structureddata)

    CasualuserOperationalExtract,transform and load

    (Batch,nearreal timeorreal time)

    BI

    sys em

    Operational

    system

    CEPengine

    Machinedata

    Hadoopcluster

    Dept.datamart

    Datawarehouse

    Topdownarchitecture

    Web data

    Virtualsandboxes Bottomuparchitecture

    In-

    memorysandbox

    Free

    standing

    sandbox

    Audio/video

    data

    Analyticalplatformornonrelationaldatabase

    PoweruserDocumentsandtext

    Externaldata

    fgure 5.

    th nw a e

    executive

    summary

    research

    background

    he new analytical

    ecosystem

    big data trends

    integrating

    with hadoop

    Future oF hadoop

    recommendations

  • 7/29/2019 BigDataReport Final

    13/35

    ExploItIng BIg Data: StratEgIES or IntEgratIng wItH HaDoop to DElIvEr BuSInESS InSIgHtS 13

    tHe neW analytical ecosystem

    reports and analyses run in separate environmentsreports using data ware-

    houses and analyses using Excel, Microsot Access and renegade data marts.

    But the new analytical ecosystem brings business analysts into the main-

    stream, enabling them to conduct reeorm analyses inside the corporate datainrastructure using a variety o analytical sandboxes.

    tp-dw pg. The top-down world o reporting and dashboard

    (depicted in the top hal o fgure 5) delivers answers to questions that users

    dene in advance. BI developers gather requirements, stamp questions into

    a multidimensional data model, nd relevant data, map it to the target model

    and build interactive reports that answer the predened questions with

    some options or urther exploration. Although the top-down world generally

    processes data in large batches at night, it can also handle continuous datausing an operationalized data warehousing environment or a complex event

    processing (CEP) engine when ultra low-

    latency analytics is required. CEP systems

    examine patterns within high-volume

    event streams and trigger alerts and

    workfows when predened thresholds

    are exceeded.

    B-p pg. The bottom-upworld o analysis (depicted in the lower

    hal o fgure 5) answers new and unan-

    ticipated questions using ad hoc queries,

    visual exploration and predictive analyti-

    cal tools. In the new analytical ecosystem,

    business analysts have many options or

    exploring and analyzing data. While they

    used to be let to their own devices to collect, standardize and integrate data,

    now they can access raw data in Hadoop or standardized, corporate data inthe data warehouse and mix it with their own data. They can then model and

    explore data however they want in ocially sanctioned sandboxes without

    inciting the ire o corporate IT managers. When theyre ready to share insights

    with others, they publish their analyses to a corporate server and turn over

    responsibility to the enterprise BI team to produce the output.

    The new analytical ecosys-

    tem brings business ana-

    lysts into the mainstream,

    enabling them to conduct

    reeorm analyses inside

    the corporate data inra-

    structure using a variety

    o analytical sandboxes.

    executive

    summary

    research

    background

    he new analytical

    ecosystem

    big data trends

    integrating

    with hadoop

    Future oF hadoop

    recommendations

  • 7/29/2019 BigDataReport Final

    14/35

    ExploItIng BIg Data: StratEgIES or IntEgratIng wItH HaDoop to DElIvEr BuSInESS InSIgHtS 14

    a db. New analytical sandboxes let power users explore a

    combination o enterprise and local data in a sae haven that wont interere

    with top-down workloads. Dierent types o analysts use dierent sand-

    boxes. Data scientists query raw, multi-structured data in Hadoop using Java,Python, Perl, Hive or Pig. Business analysts query virtual sandboxes in the data

    warehouse or ree-standing analytical appliances designed or complex query

    processing using SQL, built-in analytical algorithms or a combination o SQL

    and MapReduce capabilities. Business analysts and superusers explore sub-

    sets o data using in-memory visualization tools and immediately publish their

    ndings as dashboards or departmental colleagues to consume.

    D fw. In the new ecosystem, a lot o the source data fows through

    Hadoop, which acts as a staging area and online archive or all o an organiza-tions data. This is especially true or multi-structured data, such as log les

    and machine-generated data, but also or some structured data that com-

    panies cant cost-eectively store and process in SQL engines (such as call

    detail records in a telecommunications company). From Hadoop, much o the

    data is ed into a data warehousing hub, which oten distributes data to down-

    stream systems, such as data marts, operational data stores, analytical appli-

    ances and in-memory visualization applications, where users can query the

    data using amiliar SQL-based reporting and analysis tools. Some data is also

    ed directly into ree-standing analytical databases and appliances. To builddata fows, organizations use graphical data integration and data quality tools

    that hide the technical complexity o Hadoop and MapReduce code genera-

    tion. The big data revolution brings major enhancements to the BI landscape.

    First and oremost, it introduces new technologies, such as Hadoop, that

    make it possible or organizations to cost-eectively consume and analyze

    large volumes o multi-structured data. Second, it complements traditional,

    top-down data-delivery methods with more fexible, bottom-up approaches

    that promote ad hoc exploration and rapid application development.

    But combining the top-down and bottom-up worlds is not easy. BI proes-sionals need to assiduously guard data semantics while opening access to data.

    For their part, business analysts and data scientists need to commit to adhering

    to corporate data standards in exchange or getting the keys to the kingdom. n

    tHe neW analytical ecosystem

    executive

    summary

    research

    background

    he new analytical

    ecosystem

    big data trends

    integrating

    with hadoop

    Future oF hadoop

    recommendations

  • 7/29/2019 BigDataReport Final

    15/35

    ExploItIng BIg Data: StratEgIES or IntEgratIng wItH HaDoop to DElIvEr BuSInESS InSIgHtS 15

    Bg D td

    according To surveys o the BI Leadership Forum, user organizations have

    invested signicantly in analytical databases while early adopters are taking

    the plunge with Hadoop.

    Almost hal o respondents to a 2011 BI Leadership Forum survey said they

    have either implemented an analytical database (i.e., a sotware-only system)or an analytical appliance (i.e., a hardware-sotware combination). In contrast,

    only 11% have implemented Hadoop (see fgure 6).

    Companies have deployed all three types o platorms in a variety o ways.

    One signicant standout is that 79% o companies that have deployed ana-

    lytical appliances have implemented them to run data warehouses. In most

    cases, this is to replace a legacy relational database or a SQL Server or MySQL

    implementation that ran out o gas. For instance, many Oracle customers

    have migrated their Oracle database licenses to the Oracle Exadata Database

    46%Analyticaldatabase

    49%Analyticalappliance

    11%Hadoop

    fgure 6.

    Dp td

    BaSED on 302 rESponDEntS,BI LeADeRshIp FoRuM, aprIl 2011.

    executive

    summary

    research

    background

    he new analytical

    ecosystem

    big data trends

    integrating

    with hadoop

    Future oF hadoop

    recommendations

    http://www.bileadership.com/http://www.bileadership.com/
  • 7/29/2019 BigDataReport Final

    16/35

    ExploItIng BIg Data: StratEgIES or IntEgratIng wItH HaDoop to DElIvEr BuSInESS InSIgHtS 16

    BiG Data trenDs

    Machine. Hadoop is also more likely to be used as a prototyping area than

    anything else, which refects the act that most companies today are experi-

    menting with Hadoop rather than putting it in production. Analytical databas-

    es tend to be used as data marts, both dependent and independent, more thananalytical appliances. This refects their use as standalone systems to handle

    unique workloads outside o the data warehouse (see fgure 7).

    hadoop aTTracTing inTeresT

    Although the analytical ecosystem depicted in fgure 5 (see page 12) portrays

    Hadoop as a staging area or enterprise data warehouses and other down-

    stream analytical systems, some big data and BI proessionals think Hadoop

    may attract a growing percentage o analytical queries in the uture.Netfix, or example, uses Hadoop to stage all its operational data beore

    AnalyticalDatabase AnalyticalAppliance Hadoop

    57%

    46%

    79%

    38%

    38%

    Stagingarea

    Datawarehouse47%

    45%

    42%

    16%

    6%Dependentdatamart

    Inde endent data mart

    30%39%

    31%

    19%

    Development/test

    22%36%

    44%Prototyping

    fgure 7.

    u c

    BaSED on 302 rESponDEntS,BI LeADeRshIp FoRuM, aprIl 2011.

    executive

    summary

    research

    background

    he new analytical

    ecosystem

    big data trends

    integrating

    with hadoop

    Future oF hadoop

    recommendations

    http://www.bileadership.com/http://www.bileadership.com/http://www.bileadership.com/
  • 7/29/2019 BigDataReport Final

    17/35

    ExploItIng BIg Data: StratEgIES or IntEgratIng wItH HaDoop to DElIvEr BuSInESS InSIgHtS 17

    BiG Data trenDs

    aggregating and loading it into its Teradata data warehouse, where users can

    access it using MicroStrategy and other SQL-based tools, according to Kurt

    Brown, director o business intelligence platorms at the company. Business

    analysts can also query the raw data in Hadoop using Java, MicroStrategy(through ree-orm HiveQL) or other languages i they need access to the

    atomic-level data or have an urgent request or inormation that cant wait or

    the data to be loaded into the data warehouse.

    Almost all organizations today that have implemented Hadoop are using it

    as a staging area (92%) to land and transorm data, presumably or loading

    into a downstream data warehouse, where users can analyze it using standard

    BI tools. They are also using it as an online archive (92%) so they never have

    to delete data or summarize it and move it ofine where it is less accessible

    (see fgure 8). In some cases, organizations move data rom their data ware-house into Hadoop or saekeeping instead o into an archival system.

    Today In18Months92%

    92%

    92%

    92%

    StagingareaOnlinearchive

    83%

    58%

    92%

    67%

    TransformationEngineAdhocqueries

    42%

    25%

    58%

    67%

    67%

    ScheduledreportsVisualexploration

    83%

    fgure 8.

    Wh a F D Hdp spp y ogz?

    BaSED on rESponDEntS wHo HavE ImplEmEntED HaDoop ( BI LeADeRshIp FoRuM, aprIl 2012).

    executive

    summary

    research

    background

    he new analytical

    ecosystem

    big data trends

    integrating

    with hadoop

    Future oF hadoop

    recommendations

    http://www.bileadership.com/http://www.bileadership.com/
  • 7/29/2019 BigDataReport Final

    18/35

    ExploItIng BIg Data: StratEgIES or IntEgratIng wItH HaDoop to DElIvEr BuSInESS InSIgHtS 18

    BiG Data trenDs

    Far ewer organizations are using Hadoop or running queries or reports or

    exploring or mining data. But this will change in the next 18 months, accord-

    ing to the survey. The percentage o companies that plan to use Hadoop or

    ad hoc queries will jump rom 58% to 67%; reporting will climb rom 42% to67% and data mining rom 58% to 83%. But the biggest growth will come in

    visual exploration, rom 25% to 67%, as companies use visualization tools

    such as Tableau to explore data in Hadoop.

    Hdp d. Organizations are using

    Hadoop to store and process all kinds o data,

    not just multi-structured data. In act, almost

    all respondents are using Hadoop today (92%)

    to store transaction data, and that gure willclimb to 100% in 18 months. In act, Netfix

    lands all its operational data rom its consumer

    websites and networking streaming devices into Hadoop. Very little opera-

    tional data gets into Netfixs data warehouse beore rst passing through

    Hadoop.

    Ater transaction data, Hadoop is most oten used to store Web and sys-

    tems logs (67% each), ollowed by social media and semi-structured data

    (58% each). Some companies are using Hadoop to store documents (33%),

    email (25%) and sensor data (17%), but ew are using it to store audio orvideo data yet.

    Whats interesting is that these same companies plan to expand the types

    o data they store in Hadoop over the next 18 months. The astest growing

    data types stored in Hadoop among early adopters will be audio and video

    data, sensor data and email (42% each) and documents (50%). More com-

    panies also plan on capturing social media data in Hadoop (75%) as well as

    Web logs (75%), transaction data (100%) and semi-structured data (67%)

    (see fgure 9, page 19).

    Despite some o the rhetoric coming out o the Hadoop community, earlyHadoop adopters see the technology as serving a complementary role to the

    data warehouse. Most are using Hadoop to handle new workloads (66.7%)

    that dont run in their data warehouse or they are ofoading existing work-

    loads that are straining the current data warehouse (50%). Another third

    (33%) use Hadoop to share existing data warehousing workloads, while

    another quarter (25%) use it to share new workloads (see fgure 10, page 19).

    Organizations are using

    Hadoop to store and

    process all kinds o data.

    executive

    summary

    research

    background

    he new analytical

    ecosystem

    big data trends

    integrating

    with hadoop

    Future oF hadoop

    recommendations

  • 7/29/2019 BigDataReport Final

    19/35

    ExploItIng BIg Data: StratEgIES or IntEgratIng wItH HaDoop to DElIvEr BuSInESS InSIgHtS 19

    BiG Data trenDs

    67%

    58%

    75%

    67%

    75%

    e

    ogs

    Systemlogs

    Socialmedia

    92%

    58%

    100%

    67%

    Transactiondata

    Semistructureddata

    0%

    25%

    42%

    42%

    42%

    Sensordata

    Audioorvideo

    Email

    33%50%

    Documents

    fgure 9.

    Wh D D Hdp spp?

    BaSED on rESponDEntS wHo HavE ImplEmEntED HaDoop,The BI LeADeRshIp FoRuM, aprIl 2012.

    0%

    50%

    ReplacesitOffloadsexistingworkloads

    67%

    33%

    HandlesnewworkloadsSharesexistingworkloads

    25%

    8%

    SharesnewworkloadsDon'tknow

    fgure 10.

    ip Hdp y D Wh

    BaSED on rESponDEntS wHo HavE ImplEmEntED HaDoop,The BI LeADeRshIp FoRuM, aprIl 2012.

    executive

    summary

    research

    background

    he new analytical

    ecosystem

    big data trends

    integrating

    with hadoop

    Future oF hadoop

    recommendations

    http://www.bileadership.com/http://www.bileadership.com/http://www.bileadership.com/http://www.bileadership.com/
  • 7/29/2019 BigDataReport Final

    20/35

    ExploItIng BIg Data: StratEgIES or IntEgratIng wItH HaDoop to DElIvEr BuSInESS InSIgHtS 20

    BiG Data trenDs

    non-implemenTers

    Companies that have yet to implement Hadoop are airly bullish about their

    plans or the technology. For us, Hadoop is a given, said Mike King, technical

    ellow at FedEx Services, who develops database architectures or the ship-ping and logistics company. Were looking at big data with an eye towards

    [implementing] a ull analytics stack. We see Hadoop as part o the puzzle

    and believe its a good place to land raw data sources or analysis.

    While FedEx has a robust data warehouse, it is used primarily to run report-

    ing and dashboarding applications on structured data sets, according to King.

    Hadoop will support new data sources, such as server log data, system peror-

    mance data, Web clickstream data and reerence dataincluding addresses,

    shipments, invoices and claims. And King expects people will analyze much o

    this data in place, using MapReduce jobs to query the data.Like FedEx, 40% o companies plan to implement Hadoop within 12 months,

    and another 22% will do so within two years, according to the survey. About a

    third (30%) are not sure o their plans, and surprisingly, no one said they will

    never implement Hadoop (see fgure 11).

    Clearly, Hadoop has xed itsel in the current imaginations o BI proes-

    sionals who dont want to get let out in the coldits one o the biggest new

    fgure 11.

    epd adp r Hdp b n-ip

    40%

    22%

    Within12monthsWithin24months

    5%

    3%

    Within36monthsIn3+years

    30%

    0%

    NotsureNever

    BaSED on 76 rESponDEntS wHo HavE not yEt ImplEmEntED HaDoop,The BI LeADeRshIp FoRuM, aprIl 2012.

    executive

    summary

    research

    background

    he new analytical

    ecosystem

    big data trends

    integrating

    with hadoop

    Future oF hadoop

    recommendations

    http://www.bileadership.com/http://www.bileadership.com/http://www.bileadership.com/http://www.bileadership.com/
  • 7/29/2019 BigDataReport Final

    21/35

    ExploItIng BIg Data: StratEgIES or IntEgratIng wItH HaDoop to DElIvEr BuSInESS InSIgHtS 21

    BiG Data trenDs

    developments in data management. (The last big trend was the advent o ana-

    lytical appliances and platorms, which are still in the early adoption phase.)

    Hdp . Companies that have yet to implement Hadoop havea dierent idea about how theyll use the new technology. The largest per-

    centage o potential Hadoop users plans to use it or data mining (57%), ol-

    lowed by ad hoc queries (45%). A smaller percentage sees Hadoop as a stag-

    ing area (37%), online archive (23%) or transormation engine (39%), which

    diers sharply rom organizations that have already implemented Hadoop

    (see fgure 12).

    Much o the disconnect between current and uture adopters o Hadoop

    comes rom the hype emanating out o the big data community, which gener-

    ally portrays Hadoop as an analytical system or multi-structured data insteado a staging and pre-processing area. Analytics is the promise o Hadoop, but

    this wont happen until there is a viable market o easy-to-use graphical tools

    or accessing and analyzing Hadoop data. The BI industry has spent the last

    20 years developing such tools or SQL databases, and its only now getting

    the ormula right. Fortunately, Hadoop vendors will be able to piggyback on

    the lessons learned by BI vendors, i not the actual tools and techniques them-

    fgure 12.

    epd u Hdp b n-ip

    37%

    23%

    39%

    StagingareaOnlinearchive

    TransformationEngine45%

    5%

    27%

    AdhocqueriesScheduledreportsVisualexploration

    57%

    23%

    5%

    DataminingNotsure

    OtherBaSED on 76 rESponDEntS wHo HavE not yEt ImplEmEntED HaDoop (tHE BI LeADeRshIp FoRuM, aprIl 2012,www.BIlEaDErSHIp.com).

    executive

    summary

    research

    background

    he new analytical

    ecosystem

    big data trends

    integrating

    with hadoop

    Future oF hadoop

    recommendations

    http://www.bileadership.com/http://www.bileadership.com/http://www.bileadership.com/
  • 7/29/2019 BigDataReport Final

    22/35

    ExploItIng BIg Data: StratEgIES or IntEgratIng wItH HaDoop to DElIvEr BuSInESS InSIgHtS 22

    selves (see Integrating with Hadoop, page 25).

    Hdp d. Non-implementers also share a similar prole as Hadoop

    implementers when it comes to the kind o data they expect to store inHadoop with one exception: Non-implementers are less bullish about storing

    transaction data in Hadoop (see fgure 13). In act, only 47% said they plan to

    store transaction data in Hadoop; thats compared with 100% o implement-

    ers. For non-implementers, Hadoop is synonymous with non-structured data.

    ip d wh. Non-implementers share much the same pro-

    le in the way Hadoop will aect their existing data warehouses. First, no one

    plans to replace a data warehouse with Hadoop (see fgure 11, page 20). The

    largest percentage o respondents plans to use it to handle new workloads(58%), ollowed by sharing new workloads (37%), ofoad existing workloads

    (27%) and share existing workloads (24%) (see fgure 14, page 23).

    A majority o non-implementers view Hadoop as a means to analyze new

    multi-structured data in its native orm or process that data and load it into

    their data warehouses or analysis. One respondent wrote, I need more gory

    processing o semi-structured and unstructured data in an MPP environ-

    ment prior to getting that content linked up to our relational data [in the data

    BiG Data trenDs

    fgure 13.

    D th n-ip W s Hdp

    53%

    33%

    47%

    Weblogs

    Systemlogs

    Socialmedia

    44%50%

    24%

    8%

    Transactiondata

    Semistructureddata

    Sensordata

    Audioorvideo

    18%

    18%

    11%

    Email

    Documents

    Notsure

    executive

    summary

    research

    background

    he new analytical

    ecosystem

    big data trends

    integrating

    with hadoop

    Future oF hadoop

    recommendations

  • 7/29/2019 BigDataReport Final

    23/35

    ExploItIng BIg Data: StratEgIES or IntEgratIng wItH HaDoop to DElIvEr BuSInESS InSIgHtS 23

    warehouse.] Another said, Hadoop will augment our current business intel-

    ligence and data warehousing environment with large unstructured data sets,

    mostly social networking logs. It will in no way replace our existing data ware-

    house and data marts or ETL environments.

    hadoop hype

    s p. For some, Hadoop is a curiosity that they eel obligated

    to explore because o all the hype. Were looking at Hadoop to see what all

    the excitement is about and whether it oers any benets or us, said John

    Rome, deputy chie inormation ocer and BI strategist at Arizona State Uni-

    versity. All we hear about is big data, big data, big data.

    Rome said Hadoop might be an appropriate place to house the universitysvoluminous Web log, swipe card and possibly social media data as well as

    data rom its learning management system, which generates our million

    records daily. But he wonders whether Hadoop would be overkill or these

    applications since its a good possibility IT could fow this data directly into

    the universitys data warehouse. We look orward to applying these emerg-

    ing tools to administrative tasks and gaining a better understanding o how we

    BiG Data trenDs

    fgure 14.

    th ip Hdp D Wh n-p

    0%

    27%

    ReplacesitOffloadsexistingworkloads

    58%

    24%

    37%

    HandlesnewworkloadsSharesexistingworkloadsSharesnewworkloads

    12%

    3%

    Don'tknowDoesn'tapply

    BaSED on 76 rESponDEntS wHo HavE not yEt ImplEmEntED HaDoop (tHE BI LeADeRshIp FoRuM, aprIl 2012,www.BIlEaDErSHIp.com).executive

    summary

    research

    background

    he new analytical

    ecosystem

    big data trends

    integrating

    with hadoop

    Future oF hadoop

    recommendations

    http://www.bileadership.com/http://www.bileadership.com/http://www.bileadership.com/
  • 7/29/2019 BigDataReport Final

    24/35

    ExploItIng BIg Data: StratEgIES or IntEgratIng wItH HaDoop to DElIvEr BuSInESS InSIgHtS 24

    can help those at ASU who are working on big data challenges and research

    problems, he said.

    A senior vice president o analytics at a major online publishing company

    concurs with Rome. I am not convinced Hadoop is the right solution or theproblem they are trying to solve; however, our team is taking a look. My per-

    sonal belie is that the team partially just wants to play with Hadoop because

    it is a hot technology.

    Nonetheless, both managers plan to evaluate Hadoop in the near uture.

    Rome plans to carve out time this summer or all to evaluate Hadoop, while

    the analytics manager is awaiting the results o his teams experimental usage

    o Hadoop. n

    BiG Data trenDs

    executive

    summary

    research

    background

    he new analytical

    ecosystem

    big data trends

    integrating

    with hadoop

    Future oF hadoop

    recommendations

  • 7/29/2019 BigDataReport Final

    25/35

    ExploItIng BIg Data: StratEgIES or IntEgratIng wItH HaDoop to DElIvEr BuSInESS InSIgHtS 25

    igg wh Hdp

    vendors of all sTripes are working to integrate their products with Hadoop so

    users can do any kind o workload with any tools they want. In many respects,

    there seems to be a race between members o the Apache Sotware Founda-

    tion and established BI and data warehousing (DW) vendors to see who can

    wrap Hadoop with the enterprise and analytical unctionality it currently lacks.Both sides are reaching out to the other to improve interoperability and inte-

    gration between Hadoop and established BI and DW environments.

    QuesT for inTeroperabiliTy

    aph pj. The Apache Sotware Foundation has multiple

    projects to address Hadoop shortcomings in the arena o reporting and ana-

    lytics. One promising project is Hive, a higher-level data access language that

    generates MapReduce jobs and is geared to BI developers and analysts. Hiveprovides SQL-like access to Hadoop, although it does nothing to overcome

    its batch processing paradigm. Another is HBase, which overcomes Hadoops

    latency issues but is designed or ast row-based reads and writes to support

    high-perormance transactional applications. Both create table-like structures

    on top o Hadoop les. Another high-level scripting language is Pig Latin,

    which like Hive runs on MapReduce, but is better suited to building complex,

    parallelized data fows, so its something to which ETL programmers might

    gravitate.

    Another emerging Apache project that addresses the lack o built-in meta-data in Hadoop is HCatalog, which takes Hives metadata and makes it more

    broadly available across the Hadoop ecosystem, including Pig, MapReduce

    and HBase. Soon SQL vendors will be able to query HCatalog with MapRe-

    duce, Web and Java Database Connectivity (JDBC) interaces to determine

    the structure o Hadoop data beore launching a query. Armed with this meta-

    data, SQL products will be able to issue ederated queries against SQL and

    executive

    summary

    research

    background

    he new analytical

    ecosystem

    big data trends

    integrating

    with hadoop

    Future oF hadoop

    recommendations

  • 7/29/2019 BigDataReport Final

    26/35

    ExploItIng BIg Data: StratEgIES or IntEgratIng wItH HaDoop to DElIvEr BuSInESS InSIgHtS 26

    inteGratinG WitH HaDooP

    Hadoop data sources and return a unied result set without users having to

    know where data lies or how to access it.

    On the data-mining ront, Apache Mahout is a set o data-mining algo-

    rithms or clustering, classication and batch-based collaborative ltering thatrun on MapReduce. Finally, Apache Sqoop is a project to accelerate bulk loads

    between Hadoop and relational databases data ingestion, and Apache Flume

    streams large volumes o log data rom multiple sources into Hadoop.

    These projects are still in the early stages, although many leading-edge

    adopters have implemented them and are providing ample eedback, i not

    code, to the Apache community to make these sotware environments enter-

    prise-ready. They are assisted by many commercial open source vendors that

    contribute most o the code to Apache Hadoop projects and want to create a

    rich, enterprise-ready ecosystem.

    Bi d DW h. In addition, many BI and DW vendors have either

    extended their tool sets to connect to or interoperate with Hadoop. For exam-

    ple, many vendors now connect to Hadoop and can move data back and orth

    seamlessly. Some even enable BI and ETL

    developers to create and run MapReduce

    jobs in Hadoop using amiliar graphical

    development environments. Some startup

    companies go a step urther and buildBI and ETL products that run natively on

    Hadoop. Many database vendors are ship-

    ping appliances that bundle relational data-

    bases, Hadoop and other relevant sotware

    in an attempt to oer customers the best o

    both worlds.

    Most o these Apache projects are led by

    commercial open source vendors, typically

    the Hadoop distribution vendors, whichstrive to enrich their distribution and make

    it more enterprise-ready than the next one. By working under the Apache

    Sotware Foundation umbrella, they accelerate the time to market o these

    eatures and ensure that compatibility will be maintained throughout the

    process.

    Many BI and DW vendors

    have either extended theirtool sets to connect to or

    interoperate with Hadoop.

    For example, many ven-

    dors now connect to Ha-

    doop and can move data

    back and orth seamlessly.

    executive

    summary

    research

    background

    he new analytical

    ecosystem

    big data trends

    integrating

    with hadoop

    Future oF hadoop

    recommendations

  • 7/29/2019 BigDataReport Final

    27/35

    ExploItIng BIg Data: StratEgIES or IntEgratIng wItH HaDoop to DElIvEr BuSInESS InSIgHtS 27

    inteGratinG WitH HaDooP

    tp g. Since many organizations plan to add Hadoop to their

    existing BI environments, its important to understand the degree to which BI

    and DW products can interoperate with Hadoop. There are our types o inte-

    gration, each with its own advantages and disadvantages:

    c. Prebuilt data connectors enable developers to view and

    move data in bulk between the two environments.

    Hbd . These products blend SQL and MapReduce unctionality

    into a single data processing environment.

    n Hdp. These run natively on Hadoop but are accessed via a

    graphical user interace that hides the complexities o the underlying pro-cessing environment.

    ipb. Application programming interaces and metadata cata-

    logs enable developers in one environment to design and dynamically

    execute native jobs in another environment.

    Type 1. connecTiviTy

    A data connector enables developers to view and download data in a remotesystem. Since Hadoop stores data in a distributed le system, the data con-

    nector can simply be a le system viewer that connects to Hadoops name

    node, which contains a list o all les in the cluster. The connector also usually

    executes Hadoop le system commands to retrieve specied les distributed

    across the cluster and move them to the requesting system.

    Many data connectors connect via an application programming interace

    (API), like Open Database Connectivity (ODBC), and JDBC drivers, which link

    client applications to relational databases. With an API approach, the con-

    nector enables users to invoke unctions on the remote system, which is thesecond level o integration (see Type 4. Interoperablity, page 30). Most

    vendors dont charge or connectors.

    For example, ParAccel oers a bidirectional parallel connector or Hadoop

    that enables developers to issue SQL calls that invoke appropriate prewritten

    MapReduce unctions in Hadoop and transer les to ParAccel or urther pro-

    cessing. Similarly, Oracle oers its Direct Connector or Hadoop Distributed

    executive

    summary

    research

    background

    he new analytical

    ecosystem

    big data trends

    integrating

    with hadoop

    Future oF hadoop

    recommendations

  • 7/29/2019 BigDataReport Final

    28/35

    ExploItIng BIg Data: StratEgIES or IntEgratIng wItH HaDoop to DElIvEr BuSInESS InSIgHtS 28

    inteGratinG WitH HaDooP

    File System, which queries HDFS data using SQL and loads it into the Oracle

    Database or joins it with data already stored there.

    P d . The benet o data connectors is that they are relativelyeasy to build and use. The downsides are that they require moving entire

    data sets, which in a big data environment is not ideal. I the data volumes are

    sizeable, the download may take a long time and create network bottlenecks.

    Administrators who want to move data between Hadoop and relational data-

    base management systems will use Sqoop or Flume or some proprietary bulk

    load utility instead.

    Type 2. hybrid sysTemsHybrid systems create a single environment that supports both Hadoop and

    SQL data and unctions. There are no remote unction calls because all pro-

    cessing happens locally. The single system runs both a relational database

    and MapReduce and supports SQL and le system storage.

    Hybrid systems vary considerably in architecture. Some embed relational

    databases inside Hadoop, while others embed Hadoop and a relational data-

    base inside a single data warehousing appliance.

    For instance, the Teradata Aster MapReduce platorm embeds a MapRe-

    duce engine inside a relational database. Administrators load structuredand multi-structured data into the platorm or data discovery. Teradatas

    patented Aster Data SQL-MapReduce ramework enable users to invoke

    and execute MapReduce jobs that run inside the Aster platorm using stan-

    dard SQL and BI tools. As a result, SQL-MapReduce enables organizations to

    store all their databoth structured and multi-structuredin a single place

    and query it using standard SQL or a variety o programming languages with

    MapReduce. This strategy eectively puts Hadoop and unstructured data into

    the amiliar land o SQL-based processing.

    Hadapt takes the opposite approach in its cloud-based product, which is setto launch this summer. Hadapt installs a Postgres relational database on every

    node in a Hadoop cluster and then converts SQL calls to MapReduce, where

    needed, to process both sets o data at once. Like Aster SQL-MapReduce,

    Hadapt lets users store all their data in one place and use a single query para-

    digm to access all data. These specialized environments are ideal or support-

    ing applications, such as customer analytics, that require querying both struc-

    executive

    summary

    research

    background

    he new analytical

    ecosystem

    big data trends

    integrating

    with hadoop

    Future oF hadoop

    recommendations

  • 7/29/2019 BigDataReport Final

    29/35

    ExploItIng BIg Data: StratEgIES or IntEgratIng wItH HaDoop to DElIvEr BuSInESS InSIgHtS 29

    inteGratinG WitH HaDooP

    tured and unstructured data to get a complete picture o customer behavior.

    P d . The benets o a hybrid system are straightorward: (1) It

    creates a unied environment or all types o data, (2) you buy and supportone system instead o two, (3) you can issue a single query that runs against

    any data type and (4) all processing runs locally, providing ast perormance.

    The major downside o such systems is they are somewhat redundant i you

    already manage an enterprise data warehouse or Hadoop environment or

    both. As such, hybrid systems make good analytical sandboxes or discovery

    platorms to handle new applications that require a synthesis o data types

    and analytical techniques.

    Type 3. naTive hadoop

    Some vendors, including a bevy o startups and some established BI and DW

    players, are aggressively courting the Hadoop market by building applications

    that use MapReduce as the underlying execution engine.

    As a new platorm, Hadoop needs to encourage developers to write appli-

    cations that run on top o it. Some stalwart BI and data integration veterans

    have taken up the challenge and have built new applications rom scratch or

    are porting existing applications to run on Hadoop. In most cases, the new

    applications provide a graphical interace that hides the underlying process-ing rom end users, who dont know where the application runs or the nature

    o its execution engine. All they know is that they can query Hadoop data

    using a point-and-click interace.

    For example, startup Datameer oers a native Hadoop application or

    creating dashboards. The tool builds a meta catalog o Hadoop data that it

    imports using 20-plus connectors to a variety o sources, including most rela-

    tional databases, NoSQL databases and social media eeds, such as Twitter.

    Developers use the catalog to populate a Web-based spreadsheet with a sub-

    set o data that they use to design the dashboard. Once theyre satised withthe design, they schedule a job to create the dashboard by running it against

    the entire data set in Hadoop.

    Some established BI and DW vendors are contemplating shipping native

    Hadoop versions o their products. For instance, one leading data integration

    vendor plans this year to ship a native Hadoop version o its ETL tool that will

    convert existing mappings to run as MapReduce jobs on Hadoop. This capa-

    executive

    summary

    research

    background

    he new analytical

    ecosystem

    big data trends

    integrating

    with hadoop

    Future oF hadoop

    recommendations

  • 7/29/2019 BigDataReport Final

    30/35

    ExploItIng BIg Data: StratEgIES or IntEgratIng wItH HaDoop to DElIvEr BuSInESS InSIgHtS 30

    inteGratinG WitH HaDooP

    bility is part o a wider set o unctionality that pushes down processing to the

    most ecient platorm in the BI and DW ecosystem.

    P d . The benet o this native approach is that it provides seam-less access to Hadoop data and optimizes MapReduce unctions. Presumably,

    the new tool would better stay abreast o new developments in the Hadoop

    world and optimize MapReduce processing better than SQL vendors. The

    downside is that Hadoop MapReduce is a batch environment that does not

    support iterative queries, at least today. Users need to design the entire dash-

    board beore they hit the execute button, since there is sizable latency when

    running MapReduce jobs. Also, the application would need to make ODBC or

    JDBC calls via MapReduce to access SQL data in relational databases when

    joining this data with Hadoop data or use data services to retrieve this datadynamically.

    Type 4. inTeroperabiliTy

    As mentioned above, interoperability takes a data connector one step urther

    and lets users execute native unctions on the remote system, oten using

    a graphical interace that doesnt require users to write code. This is appli-

    cation-level interoperability using custom or standard APIs and a metadata

    catalog rather than data-level connectivity that links directly to a database orle system. Obviously, writing code to an API requires more time and testing

    than does creating a data connector.

    For example, Tableau uses Clouderas ODBC driver or Hive to query

    Hadoop data directly. The driver shields users rom having to write HiveQL

    statements. Since Hive does not support iterative queries, Tableau users can

    view, connect and download the data they want to an in-memory cache that

    users can query in an ad hoc manner. MicroStrategy also uses Clouderas

    ODBC driver but goes urther by embedding Hive metadata into its semantic

    layer (such as Projects) and Query Builder, dynamically generating HiveQLunder the covers. It also lets developers execute predened HiveQL and Pig

    scripts and HiveQL using the ree-orm query capability.

    Hortonworks, a startup that provides a sotware distribution o Apache

    Hadoop along with training, service and support, establishes deep levels

    o integration with established BI and DW products. Its developers, most

    o whom helped build and manage Yahoos 42,000-node Hadoop cluster,

    executive

    summary

    research

    background

    he new analytical

    ecosystem

    big data trends

    integrating

    with hadoop

    Future oF hadoop

    recommendations

  • 7/29/2019 BigDataReport Final

    31/35

    ExploItIng BIg Data: StratEgIES or IntEgratIng wItH HaDoop to DElIvEr BuSInESS InSIgHtS 31

    contribute the majority o code to Apache Hadoop. One o the key pieces o

    Apache Hadoop that it is building is HCatalog, a SQL-like metadata catalog

    accessible via Hive, MapReduce and various REST-based APIs. Talend, an

    open source ETL vendor, already uses HCatalog to design MapReduce work-fows and transormations. HCatalog is likely to become Hadoops de acto

    metadata repository and key piece in any interoperability ramework.

    P d . In the case o Hadoop, the benet o an application-level

    interace is signicant. Users o SQL-based tools can issue a query against

    Hadoop and retrieve just the data they want without having to download a

    much larger data set. In other words, they can use SQL to locate Hadoop data

    and invoke a MapReduce job to process or query that data. In addition, rather

    than execute an import-export unction, users query Hadoop on demand aspart o a single SQL query, which may also join the Hadoop data with other

    data to ulll the SQL request. One downside: these APIs are not bidirection-

    alin other words, a Hadoop developer would need to work with a SQL API

    to query data rom a relational database as part o a MapReduce job. n

    inteGratinG WitH HaDooP

    executive

    summary

    research

    background

    he new analytical

    ecosystem

    big data trends

    integrating

    with hadoop

    Future oF hadoop

    recommendations

  • 7/29/2019 BigDataReport Final

    32/35

    ExploItIng BIg Data: StratEgIES or IntEgratIng wItH HaDoop to DElIvEr BuSInESS InSIgHtS 32

    F Hdp

    cooperaTion or compeTiTion? Vendors have been quick to jump on the big

    data bandwagon largely because it opens up a new market or their database,

    data integration and BI products but also because Hadoop represents a poten-

    tial threat. Established sotware vendors stand to lose signicant revenue i

    Hadoop evolves without them or gains robust data management and analyti-cal unctionality that reduces the appeal o their existing products. Most are

    hedging their bets. They are working to interoperate with Hadoop, in case it

    becomes a major new data source, and ensure it remains subservient to their

    existing products.

    Both sides are eager to partner and work together. Hadoop vendors benet

    as more applications run on Hadoop, including BI, ETL and relational database

    management system (RDBMS) products. And commercial vendors benet i

    their existing tools have a new source o data to connect to and plumb. Its a

    big new market whose honey attracts a hive ull o bees.Some customers are already asking whether data warehouses and BI tools

    will eventually be olded into Hadoop environments or the reverse. As one

    survey respondent put it, We are still evaluating between two schools o

    thought: pull rom all sources including Hadoop clusters into the RDBMS or

    pull rom all, including the RDBMS, into the Hadoop cluster.

    Its no one wonder BI proessionals are conused. In the past year, they

    have been bombarded with big data hype, some o which makes it sounds like

    the days o the data warehouse and relational database are numbered. They

    think, Why spend millions o dollars on a new analytical RDBMS i we cando that processing without paying a dime in license costs using Hadoop? Or

    Why spend hundreds o thousands o dollars on data integration tools i our

    data scientists can turn Hadoop into a huge data staging and transormation

    layer? or Why invest in traditional BI and reporting tools i our power users

    can exploit Hadoop using reely available programs, such as Java, Python, Pig,

    Hive or Hbase?

    executive

    summary

    research

    background

    he new analytical

    ecosystem

    big data trends

    integrating

    with hadoop

    Future oF hadoop

    recommendations

  • 7/29/2019 BigDataReport Final

    33/35

    ExploItIng BIg Data: StratEgIES or IntEgratIng wItH HaDoop to DElIvEr BuSInESS InSIgHtS 33

    Given the hype it shouldnt be surprising that some user organizations are

    opting or Hadoop when a NoSQL database or more traditional BI solution

    would suce. For example, several customers o Splunk, a NoSQL vendor that

    applies search technology to log data, planned to implement Hadoop untilthey came across Splunk. Even though Splunk is commercial sotware, the

    customers ound the sotware much easier to use than Hadoop, making its

    total cost o ownership much lower. Also, a customer that has large volumes

    o XML data was considering Hadoop but recently discovered MarkLogic,

    another NoSQL database vendor that uses an XML schema to process data. It

    is now investigating whether MarkLogic is a better t than Hadoop.

    Right now, its too early to divine the uture o the big data movement and

    name winners and losers. Its possible that in the uture all data management

    and analysis will run entirely on open source platorms and tools. But its justas likely that commercial vendors will co-opt (or outright buy) open source

    products and unctionality and use them as pipelines to magniy sales o their

    commercial products. Since Hadoop vendors are venture-backed, its likely

    well discover the answer soon enough as investors get restless and push or

    a sale. In this case, a commercial vendor will gain access to the major Hadoop

    distributions and more than likely pursue a hybrid strategy.

    More than likely, well get a mlange o open source and commercial capa-

    bilities that interoperate to create the big data ecosystem as described in

    section one o this report. Ater all, 30 years ater the mainrame revolution,mainrames are still an integral component o many corporate data process-

    ing architectures. And multidimensional databases didnt replace relational

    databases as the building blocks o data warehousing environments; rather,

    multidimensional databases became an external data mart or cubing technol-

    ogy geared to intensive, multidimensional analytical workloads. Our corporate

    computing environments have an amazing ability to ingest new technologies

    and nd an appropriate niche or each within the whole. In IT, nothing ever

    dies; it just nds its niche in an evolutionary ecosystem.

    Accordingly, Hadoop will most likely serve as a staging area and pre-pro-cessing environment or multi-structured data beore it is loaded into ana-

    lytical databases. Like any staging area, some users with requisite skills and

    permission will query and report o this layer, but the majority o users will

    wait until the data is saely integrated, aggregated, cleansed and loaded into

    high-perormance analytical databases, where it can be queried and analyzed

    with SQL-based tools.

    Future oF HaDooP

    executive

    summary

    research

    background

    he new analytical

    ecosystem

    big data trends

    integrating

    with hadoop

    Future oF hadoop

    recommendations

  • 7/29/2019 BigDataReport Final

    34/35

    ExploItIng BIg Data: StratEgIES or IntEgratIng wItH HaDoop to DElIvEr BuSInESS InSIgHtS 34

    rd

    1.ip db. I you havent already, its time to

    implement an analytical database, either to augment or replace your data

    warehousing engine or create an analytical sandbox or discovery platorm or

    business analysts or both. As part o the process, identiy a critical applica-

    tion or the new system that can kick-start adoption and build support amongexecutives to expand the analytics program.

    2.r -d d. Identiy multi-structured data in yourorganization and how harvesting it might provide business value. Talkto business-side people and ask what they would do i they could query this

    multi-structured data and what it would be worth to them.

    3.ig Hdp. Understand the pros and cons o using Hadoop to

    manage your multi-structured data versus other approaches. Identiya critical application or Hadoop, such as improving the accuracy o raud or

    cross-sell models, to justiy the commitment o time and money and keep it

    rom being labeled a science experiment.

    4.ig hw g wh Hdp. Understand the our types ointegration with Hadoop rom SQL environments. Identiy which levelsyou need or the various Hadoop applications you have identied.

    5.udd d bw Hdp d db.When you should process data in Hadoop versus new analytical data-

    bases is not always clear. Obviously, Hadoop is geared more toward unstruc-

    tured data because its based on a le system, while analytical databases are

    geared more toward structured data. But that doesnt mean you cant do any

    kind o processing in either environment. It simply depends on data volumes,

    internal expertise and the nature o the consuming application. n

    executive

    summary

    research

    background

    he new analytical

    ecosystem

    big data trends

    integrating

    with hadoop

    Future oF hadoop

    recommendations

  • 7/29/2019 BigDataReport Final

    35/35

    aBout tHe autHor

    w ek hs bee hh

    ede i he d ehsi, bsi-

    ess ieiee (BI) d ee

    ee feds sie 1995. He hs

    ded es i-deh esehsdies d is he h he bes-

    sei bk prfrmanc Dabard:

    Maring, Mnitring, and Managing Yr Bin. He is

    ed kee seke d be d he ss d

    ds kshs bsiess is, ee

    dshbds d BI, he is. es,

    Ekes seed s die edi d eseh

    the D wehsi Isie, hee he es he

    s e d ii s d hied is

    BI Eeie Si.

    Ekes is die eseh tehte, hee

    he ies eek b ed wes wd,hih ses ids eds d eies bes -

    ies i he ii BI (see www.b-eye-network.

    com/blogs/eckerson). Ekes is s eside BI

    lede csi (www.bileader.com) d de BI

    ledeshi (www.bileadership.com), ek

    BI dies h ee e ehe ides b

    bes ies i BI d ede he e BI i.

    He be ehed [email protected].

    executive

    summary

    research

    background

    he new analytical

    ecosystem

    big data trends

    integrating

    with hadoop

    Future oF hadoop

    recommendations

    exliting Big Data: stratgi fr

    Intgrating wit had t Dlivr

    Bin Inigt is

    BeyeNETWORK

    e-bii.

    h st

    Edii Die

    J smi Edi, E-pbiis

    J s

    Edi i chie

    l K

    Die oie Desi

    mk b

    pbishe

    [email protected]

    e lt

    Die Ses

    [email protected]

    2012 tehte I. n his b-ii be sied eded

    i b es ih ieeissi he bishe. tehte e-is e ibe hh The YGS Group.

    at TTt: tehte bishesedi ii eh es-sis. me h 100 sed ebsies

    ebe qik ess dee se es,die d sis b he ehies,ds d esses i jb.o ie d i ees ie die

    ess ideede ee ed die. a It Kede Ehe, si i, e die d

    she sis ih ees d ees.

    http://www.b-eye-network.com/blogs/eckersonhttp://www.b-eye-network.com/blogs/eckersonhttp://www.bileader.com/http://www.bileadership.com/mailto:[email protected]://www.b-eye-network.com/http://searchmanufacturingerp.techtarget.com/http://searchbusinessanalytics.techtarget.com/mailto:mbolduc%40techtarget.com%20?subject=http://reprints.ygsgroup.com/m/techtargethttp://reprints.ygsgroup.com/m/techtargetmailto:mbolduc%40techtarget.com%20?subject=http://searchbusinessanalytics.techtarget.com/http://searchmanufacturingerp.techtarget.com/http://www.b-eye-network.com/mailto:[email protected]://www.bileadership.com/http://www.bileader.com/http://www.b-eye-network.com/blogs/eckersonhttp://www.b-eye-network.com/blogs/eckerson