What does big data mean in 2013?

Entire content © 2013 Gleanster, LLC. All rights reserved. Unauthorized use or reproduction prohibited.

Note: This document is intended for individual use. Electronic distribution via email or by post-ing on a personal website is in violation of the terms of use.

November 2013

About the Pie Chart

The data presented in the pie chart is derived from the Q2 2013 Customer Engagement Survey for this Deep Dive, which provides analyst commentary related to a particular aspect of the topic. The objective is to provide additional perspective and illuminate certain key considerations regarding the implementation of the related technology-enabled business initiative.

To learn more about Gleanster’s research methodology, please click here or email [email protected].

Deep Dive

What does ’Big Data’ Mean in 2013?

While the term “Big Data” is well established, it still means many things to many different audiences. Some view the term with cynical skepticism, while others embrace it as a term that encompasses a series of dramatic shifts in both data storage and analysis. If you attend conferences such as O’Reilly Strata, you’ll see that Big Data at the highest end of the market often refers to an ecosystem firmly anchored in Hadoop. This is general consensus around the term, but you’ll also see various vendors and organizations attempting to expand “Big Data” to encompass any non-relational database or any practice that employs parallel or distributed processing of large amounts of data.

For the purposes of this Deep Dive, Gleanster is aligned with the school of thought that the truly transformational aspects of “Big Data” lie with the continued adoption of Hadoop. As more and more companies reach critical limits of existing technologies, the rate of adoption of Hadoop and its related ecosystem of tools continues to accelerate. This Deep Dive outlines several tiers of Big Data and provides a set of recommendations and tips for organizations looking to take advantage of novel approaches to both storage and analysis at scale.

Big Data Realities in 2013While Big Data certainly has its skeptics, it would be a mistake not to

admit that there has, in fact, been a sea change in the industry. Analytics and storage in 2013 is a very different beast from analytics and storage in 1997. Here

83%The Percentage of Top Performers that regard “Generate

customer Insights” as a top reason to monitor Social Media

68%

23%

Percentage of Top Performers ranking customer data proliferation as the number one back-office challenge impacting marketing and

sales effectiveness.

http://www.gleanster.com

http://www.gleanster.com/about-us/research-methodology

http://www.gleanster.com/about-us/research-methodology

mailto:research%40gleanster.com?subject=Research%20Metholology%20Question


What does ’Big Data’ Mean in 2013? 2


are some of the differences:

• In 2013, analytics at scale requires heavy technology investment. In the 1990s a company’s analytics initiative might have been connecting Crystal Reports to their relational database. In 2013, if you have data at scale, you are hiring data scientists and engineers with very specialized skills to manage thousands of servers storing data in HDFS.

• In 2013, there is no best practice. The term “best practice” implies that there is one approach to solving a particular set of problems that could be considered best of all options. This is not the case. While it is true that some of the biggest players in the market have standardized on Cloudera for Hadoop, if you ask 100 companies that have petabyte-scale initiatives how they query and organize this data, you will get 100 different answers.

• In 2013, fortune favors the bold. Companies at the top of this market are not sitting back and waiting for innovation; they are often driving it

directly. If you are selling to some of these companies, prepare to walk into a room of engineers who could give your own engineering staff a run for its money. Disruption is a constant.

What’s problematic at this stage is the definition of analytics at scale. What does that really mean? Let’s assume that you still use the term “Big Data.” What, exactly, does that mean? The following sections capture Big Data at several scales, starting with the biggest instances of Big Data. (See Figure 1.)

Tip of the Pyramid: 100s of Petabytes to Single-Digit ExabytesFor an organization running a global social network, this may mean hundreds of petabytes of information, with several organizations poised to exceed exabytes of information in 2013 and beyond. In 2013 it was reported that the NSA was building a storage facility in Utah with a capacity of between 3 and 12 exabytes. Other companies in the social networking and search space have publicly announced that they are

100s of Petabytes –Single-Digit Exabytes

10s – 100s of Petabytes

10TBs – 9 Petabytes

1st Tier Big Data(As of 2013, only a handful of

companies & government)

2nd Tier Big Data(Large Enterprise)

3rd Tier Big Data

(Just Enough to be called Big

Data)

Size

of D

atab

ase

Figure 1: Top Three Tiers of Big Data




starting to operate in the exabyte range.

What’s established is that if you are one of the biggest players in the market, you are starting to amass exabytes of data, and you are likely collecting tens or hundreds of terabytes a day. This is the tip of the Big Data pyramid – companies like Amazon and Google and organizations like the NSA define state of the art for processing data. Many of these organizations spend so much on the technology that they end up shaping the tools and approaches the rest of the market uses.

Companies like Cloudera and Hortonworks are focused on gaining and retaining several customers at this level. Both companies likely employ entire teams of developers devoted to these A-list clients. No fully proprietary solutions exist at this level, which, at the moment, is driven by a core set of technologies such as Hadoop.

Tip: If you work at one of these organizations, there’s no tip to be given. You likely already understand the challenge and ongoing cost of supporting data storage and analysis at scale. Your engineering team likely understands that the only way forward at this stage is the creative use of vendors when necessary coupled with a heavy investment in open source participation. At the tip of the pyramid, you will often find organizations working together to share common problems.

Second Tier: Tens of Petabytes to Hundreds of PetabytesLet’s take a step back from the Googles, NSAs, and GCHQs of the world and think about large corporations in general. Here we have companies like Siemens, Merck, Exxon-Mobil – any company with an international presence. These companies very likely have one

or two petabyte-scale data centers, but this data is likely isolated to single business units. More often than not, these companies are assembled from a collection of separate business units, and one business unit’s Big Data initiative may be another business unit’s trivial relational database. In this second tier of Big Data, companies often have multiple, competing Big Data initiatives that fail to coalesce into a single initiative.

Vendors often see these companies as many smaller companies that happen to share a name. Selling software for analytics to a large corporation with multiple initiatives often involves several, often competing, sales and marketing efforts. Because these companies tend to have multiple, competing initiatives, there are often several approaches to big data.

Tip: If you work at a company in this tier of Big Data it is time to start advocating for organization-wide approaches to data storage and analysis. As the burden of big data increases exponentially, so does the cost of maintaining technology and talent required to efficiently store and analyze this data. Starting to move toward a single, unified approach to Big Data will make it easier to enforce data quality and data security standards across multiple business units. Your businesses are also going to benefit from sharing both the cost and experience of working with data at scale.

Third Tier: (Just Big Enough to Be Called) Big Data – 10 TBs to PBsIn 2013, a majority of SMBs should fall into this category (depending on how much data they retain). There’s too much data to load into a single instance of a database. Analytics

In Big Data, it isn’t so much top performers

that set trends; it is the top spenders that

set these trends.




tools start to break down at this level, and you need to start thinking about technology in a novel way. Once your organization reaches this level of scale, you will need to start hiring advanced talent that is keeping its eye on current trends. In this tier, you have no option but to start making the transition toward new approaches to data storage and analytics. There is no pretending that you can avoid using Hadoop; surrender to it.

A 10 TB database is still somewhat small compared with the massive data warehouses at the highest level of this spectrum, but 10 TB is the point at which more traditional approaches to data analysis will begin to falter. Organizations at this level often fall into the trap of believing that a Big Data inflection point can be avoided by dividing larger data sets into smaller, more manageable systems. To avoid a transition to what are perceived as more “difficult” to use tools such as Hadoop, architects and administrators may start to selectively cull data or delete archived data, throwing valuable data away to keep data at a manageable size.

Tip: In this tier you still have many choices as to the technologies you can use to create scalable systems to both store and analyze data. Open source projects such as Cassandra and Accumulo and vendor solutions such as Vertica and Netezza provide solutions that can scale from this tier to the next. If you are planning an initiative at this stage you’ll be presented with several competing choices. At this stage no one solution provides a best practice, but if you expect to grow to the second tier and beyond, adopt technologies that have a proven track record of being able to scale.

Fourth Tier: (Not Really that Big) Big Data – 750 GB to TBsRelational databases like PostgreSQL can handle data warehouses in the TB scale, and there are many commercial tools available that have been designed to allow for quick analysis at this level. Companies at this level can reasonably expect to rely on a single vendor to support a single technology. Problems at this scale have been largely solved, and it is more a question of following a very well-established best-practice rather that engaging in research and development.

This is the first experience with Big Data many corporations have. At this point in an organization’s development, several members of the technology staff will be tempted to start thinking big. This increases the risk of over-adoption and creates situations where an organization spends multiple millions of dollars on a Big Data initiative that could have been solved by the wise application of hundreds of thousands of dollars to scale a simpler approach. The key point about this phase is that many organizations still have a choice when it comes to implementation strategy.

Tip: If your internal staff is advocating for Hadoop, you should push for justification at this level. Hundreds of gigabytes or even a few terabytes does not Big Data make, and several existing relational databases such as PostgreSQL can easily be configured to handle single or sharded databases that can both store and analyze data in acceptable timeframes. To technical staff, Big Data can be a tantalizing oasis of interesting technology, but many business have found premature adoption of Big Data to be a costly




mistake, especially for an organization that cannot afford to adequately support the resources and personnel necessary to ensure the success of a Big Data initiative.

Fifth Tier: (Don’t Bother with) Big Data – Less than 500 GBWhile you can certainly load data at this scale into technologies like Hadoop, doing so can be overkill. At this scale organizations could reasonably implement systems that hold a majority of the data they need in physical memory. Unless there is a compelling reason to do so because of expectations of future growth, many of these organizations are best served sticking with an existing relational database and investing in tools to run reporting queries.

Tip: At this stage your data problems are approximately one million times smaller than the largest companies in the Big Data space, and the data in your database is small enough to fit on a hefty server. Bringing data

together in memory is an unbeatable recipe for performance, and you should do this above all else. Adopting “Big Data” technologies at this scale would needlessly increase the cost of technology support and force you to compete with top-tier Big Data organizations over scarce talent. Buy a bigger server; you’d be surprised how long that strategy will succeed

ConclusionBig Data is more than just a marketing moniker. It is a sea change in the way companies handle and process data at scales once thought impossible. While Big Data has captured the attention of an entire industry, the audience for tools and technology at the highest level of the industry is often very small. While very few organizations need to think about storing and analyzing hundreds of petabytes of data, the hype cycle has created an industry of over-adopters. By assessing your place on this spectrum you can help guide your organization toward a more rational decision making process when it comes to adopting technologies such as Hadoop.

Deep Dive Talking Points• 5 Tiers of Big Data

» Tier 1: 100s of Petabytes to Single-Digit Exabytes

» Tier 2: Tens of Petabytes to Hundreds of Petabytes

» Tier 3: (Just Big Enough to Be Called) Big Data – 10 TBs to PBs

» Tier 4: (Not Really that Big) Big Data – 750 GB to TBs

» Tier 5: (Don’t Bother with) Big Data – Less than 500 GB

• In 2013, analytics at scale requires heavy technology investment.

• In 2013, there is no best practice.

• In 2013, fortune favors the bold.




HeadquartersGleanster, LLC 825 Chicago Avenue - Suite C Evanston, Illinois 60202

For customer support, please contact [email protected] or +1 877.762.9727

For sales information, please contact [email protected] or +1 877.762.9726

Related ResearchRecently published research that may be of interest to senior industry practitioners include:

An Intro to Big Data for Marketers

The Right Data at the Right Time: Bringing Agility to Data Management

A Stack, Portfolio or Toolbox? Three Approaches to BI System Deployment

Making Business Intelligence More Agile By Focusing on Business Requirements and Internal Alignment

The Gleanster website also features carefully vetted white papers on these and other topics as well as Success Stories that bring the research to life with real-world case studies. To download Gleanster content, or to view the future research agenda, please visit www.gleanster.com.

About Gleanster Gleanster benchmarks best practices in technology-enabled business initia-tives, delivering actionable insights that allow companies to make smart business decisions and match their needs with vendor solutions.

Gleanster research can be downloaded for free. All of it.

For more information, please visit www.gleanster.com.

Lead Author

Tim O’BrienPrincipal Analyst & CTO

http://www.gleanster.com/reports/an-intro-to-big-data-for-marketers

http://www.gleanster.com/reports/reports/the-right-data-at-the-right-time-bringing-agility-to-data-management

http://www.gleanster.com/reports/reports/a-stack-portfolio-or-toolbox-three-approaches-to-bi-system-deployment

http://www.gleanster.com/reports/reports/making-business-intelligence-more-agile-by-focusing-on-business-requirements-and-internal-alignment

http://www.gleanster.com/reports/reports/making-business-intelligence-more-agile-by-focusing-on-business-requirements-and-internal-alignment

http://www.gleanster.com

Data & Analytics

What does big data mean in 2013?