43
DOES BIG DATA MEAN BIG STORAGE? Mikhail Gloukhovtsev Sr. Cloud Solutions Architect Orange Business Services

Does Big Data Mean Big Storage - Education Services Home,€¦ · 2.6 Taxonomy of Storage Architectures for Big Data ... of Big Data, the data streams in a continuous fashion and

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Does Big Data Mean Big Storage - Education Services Home,€¦ · 2.6 Taxonomy of Storage Architectures for Big Data ... of Big Data, the data streams in a continuous fashion and

DOES BIG DATA MEAN BIG STORAGE?

Mikhail GloukhovtsevSr. Cloud Solutions Architect Orange Business Services

Page 2: Does Big Data Mean Big Storage - Education Services Home,€¦ · 2.6 Taxonomy of Storage Architectures for Big Data ... of Big Data, the data streams in a continuous fashion and

2014 EMC Proven Professional Knowledge Sharing 2

Table of Contents

1. Introduction ......................................................................................................................... 4

2. Types of Storage Architecture for Big Data ......................................................................... 7

2.1 Storage Requirements for Big Data: Batch and Real-Time Processing ............................. 7

2.2 Integration of Big Data Ecosystem with Traditional Enterprise Data Warehouse ............... 8

2.3 Data Lake ......................................................................................................................... 9

2.4 SMAQ Stack ..................................................................................................................... 9

2.5 Big Data Storage Access Patterns ...................................................................................10

2.6 Taxonomy of Storage Architectures for Big Data .............................................................11

2.7 Selection of Storage Solutions for Big Data .....................................................................14

3. Hadoop Framework ...........................................................................................................14

3.1 Hadoop Architecture and Storage Options .......................................................................14

3.2 Enterprise-class Hadoop Distributions .............................................................................18

3.3 Big Data Storage and Security .........................................................................................20

3.4 EMC Isilon Storage for Big Data ......................................................................................21

3.5 EMC Greenplum Distributed Computing Appliance (DCA) ...............................................22

3.6 NetApp Storage for Hadoop.............................................................................................23

3.7 Object-Based Storage for Big Data ..................................................................................23

3.7.1 Why Is Object-based Storage for Big Data Gaining Popularity? ................................23

3.7.2 EMC Atmos ...............................................................................................................26

3.8 Fabric Storage for Big Data: SAN Functionality at DAS Pricing .......................................27

3.9 Virtualization of Hadoop ...................................................................................................27

4. Cloud Computing and Big Data ..........................................................................................30

5. Big Data Backups ..............................................................................................................30

5.1 Challenges of Big Data Backups and How They Can Be Addressed ...............................30

5.2 EMC Data Domain as a Solution for Big Data Backups ...................................................32

6. Big Data Retention .............................................................................................................34

Page 3: Does Big Data Mean Big Storage - Education Services Home,€¦ · 2.6 Taxonomy of Storage Architectures for Big Data ... of Big Data, the data streams in a continuous fashion and

2014 EMC Proven Professional Knowledge Sharing 3

6.1 General Considerations for Big Data Archiving ................................................................34

6.1.1 Backup vs. Archive ....................................................................................................34

6.1.2 Why Is Archiving Needed for Big Data? ....................................................................34

6.1.3 Pre-requisites for Implementing Big Data Archiving ...................................................34

6.1.4 Specifics of Big Data Archiving ..................................................................................35

6.1.5 Archiving Solution Components ................................................................................36

6.1.6 Checklist for Selecting Big Data Archiving Solution ...................................................37

6.2 Big Data Archiving with EMC Isilon ..................................................................................37

6.3 RainStor and Dell Archive Solution for Big Data ..............................................................39

7. Conclusions .......................................................................................................................39

8. References ........................................................................................................................41

Disclaimer: The views, processes or methodologies published in this article are those of the

author. They do not necessarily reflect the views, processes or methodologies of EMC

Corporation or Orange Business Services (my employer).

Page 4: Does Big Data Mean Big Storage - Education Services Home,€¦ · 2.6 Taxonomy of Storage Architectures for Big Data ... of Big Data, the data streams in a continuous fashion and

2014 EMC Proven Professional Knowledge Sharing 4

1. Introduction

Big Data has become a buzz word today and we can hear about Big Data from early morning –

reading the newspaper that tells us “How Big Data Is Changing the Whole Equation for

Business”1 – through our entire day. A search for “big data” on Google returned about

2,030,000,000 results in December 2013. So what is Big Data?

According to Krish Krishnan,2 the so-called three V’s definition of Big Data that became popular

in the industry was first suggested by Doug Laney in a research report published by META

Group (now Gartner) in 2001. In a more recent report,3 Doug Laney and Mark Beyer define Big

Data as follows: "’Big Data’ is high-volume, -velocity and -variety information assets that

demand cost-effective, innovative forms of information processing for enhanced insight and

decision making.”

Let us briefly review these characteristics of Big Data in more detail.

1. Volume of data is huge (for instance, billions of rows and millions of columns). People

create digital data every day by using mobile devices and social media. Data defined as

Big Data includes machine-generated data from sensor networks, nuclear plants, X-ray

and scanning devices, and consumer-driven data from social media. According to IBM,

as of 2012, every day 2.5 exabytes of data were created and 90% of the data in the

world today was created in the last 2 years alone.4 This data growth is being accelerated

by the Internet of Things (IoT), which is defined as the network of physical objects that

contain embedded technology to communicate and interact with their internal states or

the external environment (IoT excludes PCs, tablets, and smartphones). IoT will grow to

26 billion units installed in 2020, representing an almost 30-fold increase from 0.9 billion

in 2009, according to Gartner.5

2. Velocity of new data creation and processing. Velocity means both how fast data is

being produced and how fast the data must be processed to meet demand. In the case

of Big Data, the data streams in a continuous fashion and time-to-value can be achieved

when data capture, data preparation, and processing are fast. This requirement is more

challenging if we take into account that the data generation speed changes and data

size varies.

Page 5: Does Big Data Mean Big Storage - Education Services Home,€¦ · 2.6 Taxonomy of Storage Architectures for Big Data ... of Big Data, the data streams in a continuous fashion and

2014 EMC Proven Professional Knowledge Sharing 5

3. Variety of data. In addition to traditional structured data, the data types include semi-

structured (for example, XML files), quasi-structured (for example, clickstream string),

and unstructured data.

A misunderstanding that big volume is the key characteristic defining Big Data can result in a

failure of a Big Data–related project unless it also focuses on variety, velocity, and complexity of

the Big Data, which are becoming the leading features of Big Data. What is seen as a large data

volume today can become a new normal data size in a year or two.

A fourth V – Veracity – is frequently added to this definition of Big Data. Data Veracity deals

with uncertain or imprecise data. How accurate is that data in predicting business value? Does

a Big Data analytics give meaningful results that are valuable for business? Data accuracy must

be verifiable.

Just retaining more and more data of various types does not create any business advantage

unless the company has developed a Big Data strategy to get business information from Big

Data sets. Business benefits are frequently higher when addressing the variety of the data

rather than addressing just the data volume. Business value can also be created by combining

the new Big Data types with the existing information assets that results in even larger data type

diversity. According to research done by MIT and the IBM Institute for Business Value6,

organizations applying analytics to create a competitive advantage within their markets or

industries are more than twice as likely to substantially outperform their peers.

The requirement of time-to-value warrants innovations in data processing that are challenged by

Big Data complexity. Indeed, in addition to a great variety in the Big Data types, the combination

of different data types presenting different challenges and requiring different data analytical

methods in order to generate a business value makes data management more complex.

Complexity with an increasing volume of unstructured data (80%–90% of the data in existence

is unstructured) means that different standards, data processing methods, and storage formats

can exist with each asset type and structure.

The level of complexity and/or data size of Big Data has resulted in another definition as data

that cannot be efficiently managed using only traditional data-capture technology and processes

or methods. Therefore, new applications and new infrastructure as well as new processes and

procedures are required to use Big Data. The storage infrastructure for Big Data applications

Page 6: Does Big Data Mean Big Storage - Education Services Home,€¦ · 2.6 Taxonomy of Storage Architectures for Big Data ... of Big Data, the data streams in a continuous fashion and

2014 EMC Proven Professional Knowledge Sharing 6

should be capable of managing large data sets and providing required performance.

Development of new storage solutions should address the 3V+V characteristics of Big Data.

Big Data creates great potential for business development but at the same time it can mean “Big

Mistakes” by spending a lot of money and time on poorly defined business goals or

opportunities. The goal of this article is to help readers anticipate how Big Data will affect

storage infrastructure design and data lifecycle management so that they can work with storage

vendors to develop Big Data road maps for their companies and spend the budget for Big Data

storage solutions wisely.

While this article considers the storage technologies for Big Data, I want readers to keep in mind

that Big Data is about more than just a technology. To get business advantages of Big Data,

companies have to make changes in the way they do business and develop enterprise

information management strategies to address the Big Data lifecycle, including hardware,

software, services, and policies for capturing, storing, and analyzing Big Data. For more detail,

refer to the excellent EMC course, Data Science and Big Data Analytics8.

Page 7: Does Big Data Mean Big Storage - Education Services Home,€¦ · 2.6 Taxonomy of Storage Architectures for Big Data ... of Big Data, the data streams in a continuous fashion and

2014 EMC Proven Professional Knowledge Sharing 7

2. Types of Storage Architecture for Big Data

2.1 Storage Requirements for Big Data: Batch and Real-Time Processing

Big Data architecture is based on two different technology types: for real-time, interactive

workloads and for batch processing requirements. These classes of technology are

complementary and frequently deployed together – for example, Pivotal One Platform that

includes Pivotal Data Fabric9,10 (Pivotal is partly owned by EMC, VMware, and General Electric).

Big Data frameworks such as Hadoop are batch process-oriented. They address the problems

of the cost and speed of Big Data processing by using open source software and massively

parallel processing. Server and storage costs are reduced by implementing scale-out solutions

based on commodity hardware.

NoSQL databases that are highly optimized key–value data stores, such as HBase, are used for

high performance, index-based retrieval in real-time processing. NoSQL database will process

large amounts of data from various sources in a flexible data structure with low latency. It will

also provide real-time data integration with Complex Event Process (CEP) engine to enable

actionable real-time Big Data Analytics. High-speed processing of Big Data in-flight – so-called

Fast Big Data – is typically done using in-memory computing (IMC). IMC relies on in-memory

data management software to deliver high speed, low-latency access to terabytes of data

across a distributed application. Some Fast Big Data solutions such as Terracotta BigMemory11

maintain all the data in-memory with the motto “Ditch the Disk” and use disk-based storage only

for storing data copies and redo logs for DB startup and fault recovery as SAP HANA, an in-

memory database, does.12 Other solutions for Fast Big Data such as Oracle Exalytics In-

Memory Machine,13 Teradata Active EDW platform,14 and DataDirect Networks' SFA12KX

series appliances15 use hybrid storage architecture (see Section 2.3).

As data of various types are captured, they are stored and processed in traditional DBMS (for

structured data), simple files, or distributed-clustered systems such as NoSQL data stores and

Hadoop Distributed File System (HDFS). Due to the size of the Big Data sets, the raw data is

not moved directly to a data warehouse. The raw data undergoes transformation using

MapReduce processing, and the resulted reduced data sets are loaded into the data warehouse

environment, where they are used for further analysis – conventional BI reporting/dashboards,

statistical, semantic, correlation capabilities, and advanced data visualization.

Page 8: Does Big Data Mean Big Storage - Education Services Home,€¦ · 2.6 Taxonomy of Storage Architectures for Big Data ... of Big Data, the data streams in a continuous fashion and

2014 EMC Proven Professional Knowledge Sharing 8

Figure 1: Enterprise Big Data Architecture (Ref. 10)

Storage requirements for batch processing and for real-time analytics are very different. As the

capability to store and manage hundreds of terabytes in a cost-effective way is an important

requirement for batch processing of Big Data, low I/O latency is the key for large-capacity

performance-intensive Fast Big Data analytics applications. Storage architectures for such I/O-

performance applications that include using all-flash storage appliances and/or in-memory

databases are not in the scope of this article.

2.2 Integration of Big Data Ecosystem with Traditional Enterprise Data Warehouse

The integration of a Big Data platform with traditional BI enterprise data warehouse (EDW)

architecture is important, and this requirement, which many enterprises have, results in

development of so-called consolidated storage systems. Instead of a “rip & replace” of the

existing storage ecosystems, organizations can leverage the existing storage systems and

adapt their data integration strategy using Hadoop as a form of preprocessor for Big Data

integration in the data warehouse. The consolidated storage includes storage tiering and is used

for very different data management processes: for primary workloads, real-time online analytics

queries, and offline batch-processing analytics. These different data processing types result in

heterogeneous or hybrid storage environments discussed later. Readers can find more

information about consolidated storage in Ref. 16.

Page 9: Does Big Data Mean Big Storage - Education Services Home,€¦ · 2.6 Taxonomy of Storage Architectures for Big Data ... of Big Data, the data streams in a continuous fashion and

2014 EMC Proven Professional Knowledge Sharing 9

The new integrated EDW should have three main capabilities:

1. Hadoop-based analytics to process and analyze any data type across commodity server

clusters

2. Real-time stream processing with Complex Event Processing (CEP) engine with sub-

millisecond response times

3. Data Warehousing providing insight with advanced in-database analytics

The integration of Big Data and EDW environments is reflected in the definition of “Data Lake”.

2.3 Data Lake

Booz Allen Hamilton introduced the concept of Data Lake.17 Instead of storing information in

discrete data structures, the Data Lake consolidates an organization’s complete repository of

data in a single, large table. The Data Lake includes data from all data sources – unstructured,

semi-structured, streaming, and batch data. To be able to store and process terabytes to

petabytes of data, a Data Lake should scale in both storage and processing capacity in an

affordable manner. Enterprise Data Lake should provide high-availability, protected storage;

support existing data management processes and tools as well as real-time data ingestion and

extraction; and be capable of data archiving.

Support for Data Lake architecture is included in Pivotal's Hadoop products that are designed to

work within existing SQL environments and can co-exist alongside in-memory databases for

simultaneous batch and real-time analytic queries.18 Customers implementing Data Lakes can

use the Pivotal HD and HAWQ platform for storing and analyzing all types of data – structured

and unstructured.

Pentaho has created an optimized system for organizing the data that is stored in the Data

Lake, allowing customers to use Hadoop to sift through the data and extract the chunks that

answer the questions at hand.19

2.4 SMAQ Stack

The term SMAQ stack, coined by Ed Dumbill in a blog post at O’Reilly Radar,20 relates to a

processing stack for Big Data that consists of layers of Storage, MapReduce technologies, and

Query technologies. SMAQ systems are typically open source and distributed and run on

commodity hardware. Similar to the commodity LAMP stack of Linux, Apache, MySQL and

PHP, which has played a critical role in development of Web 2.0, SMAQ systems are expected

Page 10: Does Big Data Mean Big Storage - Education Services Home,€¦ · 2.6 Taxonomy of Storage Architectures for Big Data ... of Big Data, the data streams in a continuous fashion and

2014 EMC Proven Professional Knowledge Sharing 10

to be a framework for development of Big Data-driven products and services. While Hadoop-

based architectures dominate in SMAQ, SMAQ systems also include a variety of NoSQL

databases.

Figure 2: The SMAQ Stack for Big Data (Ref. 8)

As expected, storage is a foundational layer of the SMAQ stack and is characterized by

distributed and unstructured content. At the intermediate layer, MapReduce technologies enable

the distribution of computation across many servers as well as supporting a batch-oriented

processing model of data retrieval and computation. Finally, at the top of the Stack are the

Query functions. Characteristic of this function is the ability to find efficient ways of defining

computation and providing a platform for “user-friendly” analytics.

2.5 Big Data Storage Access Patterns

Typical Big Data storage access patterns are write once, read many times workloads with

metadata lookups and large block–sized reads (64 MB to 128 MB, e.g. Hadoop HDFS) as well

as small-sized accesses for HBase. Therefore, the Big Data processing design should provide

efficient data reads.

Page 11: Does Big Data Mean Big Storage - Education Services Home,€¦ · 2.6 Taxonomy of Storage Architectures for Big Data ... of Big Data, the data streams in a continuous fashion and

2014 EMC Proven Professional Knowledge Sharing 11

Both scale-out file systems and object-based systems can meet the Big Data storage

requirements. Scale-out file systems provide a global namespace file system, whereas use of

metadata in object storage (see Section 3.7 below) allows for high scalability for large data sets.

2.6 Taxonomy of Storage Architectures for Big Data

Big Data storage architectures can be categorized into shared-nothing, shared primary, or

shared secondary storage. Implementation of Hadoop using Direct Attached Storage (DAS) is

common, as many Big Data architects see shared storage architectures as relatively slow,

complex, and, above all, expensive. However, DAS has its own limitations (first of all,

inefficiency in storage use) and is an extreme in the broad spectrum of storage architectures. As

a result, in addition to DAS-based HDFS systems, enterprise-class storage solutions using

shared storage (scaled-out NAS, i.e. EMC Isilon®, or SAN), alternative distributed file systems,

cloud object-based storage for Hadoop (using REST API like CDMI, S3, or Swift; see Ref.7 for

detail), decoupled storage and compute nodes such as solutions using vSphere BDE (see

Section 3.9) are gaining popularity (see Figure 3).

For Hadoop workloads, the storage resource to compute resource ratios vary by application and

it is often difficult to determine them in advance. This challenge makes it imperative that a

Hadoop cluster is designed for flexibility and scaling storage and compute independently.

Decoupling storage and compute resource is a way to scale storage independent of compute.

Examples of such architectures are SeaMicro Fabric Storage and Hadoop virtualization using

VMware BDE, which are discussed in Sections 3.8 and 3.9, respectively.

Page 12: Does Big Data Mean Big Storage - Education Services Home,€¦ · 2.6 Taxonomy of Storage Architectures for Big Data ... of Big Data, the data streams in a continuous fashion and

2014 EMC Proven Professional Knowledge Sharing 12

Figure 3: Technologies Evaluated or Being Deployed to Meet Big Data Requirements (Ref 21)

Figure 3 presents technologies in the priority order in which companies are evaluating them or

have deployed them to meet Big Data requirements.21 EMC Isilon is an example of shared

storage (scaled-out NAS) as primary storage for Big Data Analytics. A Big Data “Stack” like the

EMC Big Data Stack presented below needs to be able to operate on a multi-petabyte scale to

handle structured and unstructured data.

Page 13: Does Big Data Mean Big Storage - Education Services Home,€¦ · 2.6 Taxonomy of Storage Architectures for Big Data ... of Big Data, the data streams in a continuous fashion and

2014 EMC Proven Professional Knowledge Sharing 13

Technology Layer EMC Product

Collaborative – Act Documentum® xCP,

Greenplum® Chorus

Real Time - Analyze Greenplum + Hadoop, Pivotal

Structured/unstructured data

Pivotal HD, Isilon

Storage, Petabyte Scale Isilon, Atmos®

Table 1. EMC Big Data Stack

A challenge for IMC-based Big Data Analytics is that the volume of data that companies want to

analyze grows at a faster rate than affordability of memory. The 80/20 rule applies to many

enterprise analytic environments – only 20% of the data generate 80% of I/Os for a given period

of time. As the data ages, it is more rational to implement dynamic storage tiering in a hybrid

storage architecture rather than placing all the data in-memory. The goal of hybrid storage

architecture is to address the great variety of storage performance requirements that various

types of Big Data have by implementing dynamic storage tiering which moves data chunks

between different storage pools of SSD, SAS, and SATA drives. The hybrid storage in the

Teradata Active EDW platform14 and DataDirect Networks’ Storage Fusion Architecture15

exemplifies this storage architecture type.

DataDirect Networks' SFA12KX series appliances are built on the company's integrated Big

Data-oriented Storage Fusion Architecture. A single SFA12KX appliance that can accommodate

a mix of up to 1,680 SSD, SATA, or SAS drives delivers up to 1.4 million input/output operations

per second (IOPS) and pushes data through at a rate of 48 GB per second.15

Nutanix offers Converged Storage Architecture. Nutanix Complete Cluster using the Nutanix

Distributed File System (NDFS) enables MapReduce to be run without HDFS and its

NameNode. The Nutanix Complete Cluster consolidates DAS with compute resources in four-

node Intel-based appliances called “Compute + Storage Together.”22 The internal storage that is

a combination of PCIe-SSD (Fusion-io) and SATA hard disks from all nodes is virtualized into a

Page 14: Does Big Data Mean Big Storage - Education Services Home,€¦ · 2.6 Taxonomy of Storage Architectures for Big Data ... of Big Data, the data streams in a continuous fashion and

2014 EMC Proven Professional Knowledge Sharing 14

unified pool by Nutanix Scale-out Converged Storage and can be dynamically allocated to any

virtual machine or guest operating system. A Nutanix Controller Virtual Machine (VM) on each

host manages storage for VMs on the host. Controller VMs work together to manage storage

across the cluster as a pool using the Nutanix Distributed File System (NDFS).

2.7 Selection of Storage Solutions for Big Data

Many Big Data solutions emphasize low cost; however, there are also high-cost solutions such

as those using enterprise-class storage that remain cost effective because they yield significant

benefits. Choosing a storage architecture for Big Data is a mix of science and art to find the right

balance between TCO and value for the business.

3. Hadoop Framework

3.1 Hadoop Architecture and Storage Options

The Apache Hadoop platform,23 an open-source software framework supporting data-intensive

distributed applications, has two core components: the Hadoop Distributed File System (HDFS),

which manages massive unstructured data storage on commodity hardware, and MapReduce,

which provides various functions to access the data on HDFS. HDFS architecture evolved from

the Google Filesystem architecture. MapReduce consists of a Java API as well as software to

implement the services that Hadoop needs to function. Hadoop integrates the storage and

analytics in a framework that provides reliability, scalability, and management of the data.

The three principal goals of the HDFS architecture are:

1. Process extremely large data volumes (large number of files) ranging from

gigabytes to petabytes

2. Streaming data processing to read data at high-throughput rates and process

data on read

3. Capability to execute on commodity hardware with no special hardware

requirements

Hadoop supports several different node types:

Multiple Data Nodes

The NameNode, which manages the HDFS name space by determining which

DataNode contains the data requested by the client and redirects the client to that

particular DataNode

Page 15: Does Big Data Mean Big Storage - Education Services Home,€¦ · 2.6 Taxonomy of Storage Architectures for Big Data ... of Big Data, the data streams in a continuous fashion and

2014 EMC Proven Professional Knowledge Sharing 15

The Checkpoint node, which is a secondary NameNode that manages the on-disk

representation of the NameNode metadata

The JobTracker node, which manages all jobs submitted to the Hadoop cluster and

facilitates job and task scheduling

Subordinate nodes provide both TaskTracker and DataNode functionality. These nodes perform

all of the real work done by the cluster. DataNodes store the data and they serve I/O requests

under the control of NameNode. The NameNode houses and manages the metadata and when

a TaskTracker gets a read or write request for an HDFS block, the NameNode informs the

TaskTracker where a block exists or where one should be written. TaskTrackers execute the job

tasks assigned to them by the JobTracker.

Hadoop is “rack aware” – the NameNode utilizes a data structure that determines which

DataNode is preferred based on the “network distance” between them. Nodes that are “closer”

are preferred (same rack, different rack, same data center). The HDFS uses this when

replicating data to try to keep different copies of the data on different racks. The goal is to

reduce the impact of a rack power outage or switch failure so that even if these events occur,

the data may still be readable.

HDFS uses “shared nothing” architecture for primary storage – all the nodes have direct

attached SAS or SATA disks. Direct Attached Storage (DAS) means that a server attaches

directly to storage system ports without a switch. Internal drives in the server enclosure fall into

this category. Since DAS actually uses a point-to-point connection, it provides high bandwidth

between the server and storage system. No DAS-type storage is shared as the disks are locally

attached and there are no disks attached to two or more nodes. The default way to store data

for Hadoop is HDFS on local direct attached disks. However, it can be seen as HDFS on HA-

DAS because the data is replicated across nodes for HA purposes. Compute nodes are

distributed file system clients if scale-out NAS servers are used.

The import stage (putting data into HDFS for processing) and export stage (extracting data from

the system after processing) can be significantly accelerated by replacing conventional hard

disk drives (HDDs) with solid state disks (SSDs). Random read times especially benefit from

using SSDs. For example, Intel has shown that replacement of conventional HDDs with the Intel

SSD 520 Series reduced the time to complete the workload by approximately 80 percent —

from about 125 minutes to about 23 minutes.24 Even though the cost of SSDs continues to

Page 16: Does Big Data Mean Big Storage - Education Services Home,€¦ · 2.6 Taxonomy of Storage Architectures for Big Data ... of Big Data, the data streams in a continuous fashion and

2014 EMC Proven Professional Knowledge Sharing 16

plummet, it is still prohibitively expensive to use all-SSD storage for Hadoop clusters except

some use cases when time-to-value justifies the cost of the SSD-based storage. Therefore, a

tiered storage model combining conventional HDDs and SSDs in the same server can provide

the right balance between the performance and storage cost.

Pros and cons of various storage options25,26 for HDFS are presented in the table below. As

seen from the table, HDFS has a few issues the Apache Hadoop community is working to

address. One of the top issues for Hadoop v.1.0 is that NameNode represents a single point of

failure (SPOF). When it goes offline, the cluster shuts down and has to be restarted at the

beginning of the process that was running at the time of the failure. Version 2.0 of Hadoop

(2.2.0 is the first stable release in the 2.x line; the GA date of Hadoop 2.0 was October 16,

2013) introduces both manual and automated failover to a standby NameNode without needing

to restart the cluster.27 Automatic failover adds two new components to an HDFS deployment: a

ZooKeeper quorum and the ZKFailoverController process.

Vendors are also coming to market with fixes such as a NameNode failover mode in HDFS, as

well as file system alternatives that do not use a NameNode function (that means no

NameNode to fail).

There are some disadvantages of triple-mirror server-based replication used by HDFS for

ingestion and redistribution of data:

Inefficient storage utilization; for example, three TBs of raw capacity are required to store

just one usable terabyte (TB) of data.

Server-based triple replication creates a significant load on the servers themselves.

Server-based triple replication creates a significant load on the network.

Vendors’ solutions (EMC Isilon, NetApp E-series, and FAS) that address this are described

later.

Page 17: Does Big Data Mean Big Storage - Education Services Home,€¦ · 2.6 Taxonomy of Storage Architectures for Big Data ... of Big Data, the data streams in a continuous fashion and

2014 EMC Proven Professional Knowledge Sharing 17

Pros Cons

DAS Writes are highly parallel and tuned

for Hadoop jobs; JobTracker tries to

make local reads.

High replication cost compared with that in

shared storage. NodeName keeping track

of data location is still a SPOF (failover,

introduced in Hadoop 2.0, can also be

addressed in dispersed storage solutions

[Cleversafe]).

SAN Array capabilities (redundancy,

replication, dynamic tiering, virtual

provisioning) can be leveraged. As

the storage is shared, a new node

can be easily assigned to a failed-

node data. Centralized

management. Using shared storage

eliminates or reduces the need for

three-way data replication between

data nodes.

Cost, limited scalability of scale-up storage

arrays.

Distributed

File

System

/Scale-out

NAS

Shared data access, POSIX-

compatible and works for non-

Hadoop apps just as a local file

system, centralized management

and administration.

While HDFS is highly optimized for Hadoop,

it is not likely to get the same level of

optimization for a general Distributed File

System (DFS). Strict POSIX compliance

leads to unnecessary serialization. Scaling

limitations as some DFSs are not designed

for thousands of nodes.

Table 2: Hadoop Storage Options

A tightly coupled Distributed File System (DFS) for Hadoop is a general purpose shared file

system implemented in the kernel with a single name space.26 Local awareness is part of the

DFS so there is no need for NameNode. Compute nodes may or may not have local storage.

Remote storage is accessed using a file system-specific internode protocol. If DFS uses local

disks, compute nodes are part of DFS with data spread across nodes.

Page 18: Does Big Data Mean Big Storage - Education Services Home,€¦ · 2.6 Taxonomy of Storage Architectures for Big Data ... of Big Data, the data streams in a continuous fashion and

2014 EMC Proven Professional Knowledge Sharing 18

3.2 Enterprise-class Hadoop Distributions

Several vendors now offer Hadoop distributions as enterprise-class products packaged with

maintenance and technical support options. The goals of commercial distributions of Hadoop

are to address Hadoop challenges such as inefficient data staging and loading processes and

lack of multi-tenancy, backup, and DR capabilities. Vendors providing Hadoop distributions are

evaluated in a recent Forrester Review.28 Some of the vendors are EDW vendors, such as EMC

Greenplum, IBM, Microsoft, and Oracle, which are modifying their products to support Hadoop.

For example, the Hadoop distribution called Pivotal HD is based on Hadoop 2.0 and integrates

the Greenplum database with Apache Hadoop.29 This integration reduces the need of data

movement for processing and analysis. Pivotal value-add components include advanced

database services (HAWQ) (high-performance, “True SQL” query interface running within the

Hadoop cluster, Extensions Framework providing support for HAWQ interfaces on external data

providers [HBase, Avro, etc.]), and advanced analytics functions (MADLib). Pivotal HD is

available as a software-only or appliance-based solution.

Pivotal HD provides Unified Storage Service (USS), enabling user access to data residing on

multiple platforms without data copying. USS is a "pseudo" Hadoop File System (HDFS) that

delegates file system operations directed at it to other file systems in an "HDFS-like" way. Using

USS, users do not need to copy data from the underlying storage system to HDFS to process

the data using Hadoop framework, significantly reducing time and operational costs. Large

organizations typically have multiple data sets residing on various storage systems. As moving

this data to a central Data Lake environment would be time consuming and costly, USS can be

used to provide a unified view of underlying storage systems for Big Data analytics.

Page 19: Does Big Data Mean Big Storage - Education Services Home,€¦ · 2.6 Taxonomy of Storage Architectures for Big Data ... of Big Data, the data streams in a continuous fashion and

2014 EMC Proven Professional Knowledge Sharing 19

Figure 4: Pivotal HD Architecture (Ref. 29)

A growing list of vendors (both systems and storage vendors) are incorporating Hadoop into

preconfigured products and offer them as an appliances – EMC Greenplum HD (a bundle that

combines MapR’s version of Hadoop, the Greenplum database and a standard x86 based

server), Pivotal HD (discussed above), the Dell/Cloudera Solution (which combines Dell

PowerEdge C2100 servers and PowerConnect switches with Cloudera’s Hadoop distribution

and its Cloudera Enterprise management tools), and Pentaho Data Integration 4.2.

EMC Greenplum is the first EDW vendor to provide a full-featured enterprise-grade Hadoop

appliance and offer an appliance platform that integrates its Hadoop, EDW, and data integration

in a single rack. These solutions provide an easier way for users to benefit from Hadoop-based

analytics without in-house development of integration that was required in early Hadoop

implementations. For example, Cisco offers a comprehensive solution stack: the Cisco UCS

Common Platform Architecture (CPA) for Big Data includes compute, storage, connectivity, and

unified management.30

Page 20: Does Big Data Mean Big Storage - Education Services Home,€¦ · 2.6 Taxonomy of Storage Architectures for Big Data ... of Big Data, the data streams in a continuous fashion and

2014 EMC Proven Professional Knowledge Sharing 20

3.3 Big Data Storage and Security

Regulation and compliance are also important considerations for Big Data. The first versions of

Hadoop offered limited (if any) ways to respond to corporate security and data governance

policies. The security model of Hadoop has been improving through development of Apache

projects such as Apache Accumulo and the release of “security-enhanced” distributions of

Hadoop by vendors (for example, Cloudera Sentry and Intel secure Hadoop distribution using

Intel Expressway API Manager functioning as a security gateway enforcement point for all

REST Hadoop APIs). The release of the 2.x distributions of Hadoop addresses many security

issues including security enhancements for HDFS (enforcement of HDFS file permissions). This

article considers the storage-related aspects of the Hadoop security model, namely encryption

of data at rest. Readers interested in other security aspects of Hadoop can find many reviews,

for example Refs. 31 and 32.

Hadoop 2.0 does not include encryption for data at rest on HDFS. If encryption for data on

Hadoop clusters is required, there are two options: using third-party tools for implementing

HDFS disk-level encryption or security-enhanced Hadoop distributions, such as Intel

distribution. The Intel Distribution for Apache Hadoop software33 is optimized for Intel Advanced

Encryption Standard New Instructions (Intel AES-NI), a technology that is built into Intel Xeon

processors. The encryption can apply transparently to users at a file-level granularity and be

integrated with external standards-based key management applications.

Following best practices for data security, sensitive files must be encrypted by external security

applications before they arrive at the Apache Hadoop cluster and are loaded into HDFS. Each

file must arrive with the corresponding encryption key. If files were encrypted only after arrival,

they would reside on the cluster in their unencrypted form, which would create vulnerabilities.

When an encrypted file enters the Apache Hadoop environment, it remains encrypted in HDFS.

It is then decrypted as needed for processing and re-encrypted before it is moved back into

storage. The results of the analysis are also encrypted, including intermediate results. Data and

analysis results are neither stored nor transmitted in unencrypted form.33

In 2013, Intel launched an open source effort called Project Rhino to improve the security

capabilities of Hadoop and the Hadoop ecosystem and contributed code to Apache.34 The

objective of Project Rhino is to take a holistic hardening approach to the Hadoop framework,

with consistent concepts and security capabilities across projects. In order to achieve this, a

splittable AES codec implementation is introduced to Hadoop, allowing distributed data to be

Page 21: Does Big Data Mean Big Storage - Education Services Home,€¦ · 2.6 Taxonomy of Storage Architectures for Big Data ... of Big Data, the data streams in a continuous fashion and

2014 EMC Proven Professional Knowledge Sharing 21

encrypted and decrypted from disk. The key distribution and management framework will make

it possible for MapReduce jobs to perform encryption and decryption.

3.4 EMC Isilon Storage for Big Data

Isilon is an enterprise storage system that can natively integrate with the HDFS.35 The solution

uses shared scaled-out NAS storage as a large repository of Hadoop data for data protection,

archive, security, and data governance purposes. EMC Isilon storage is managed by intelligent

software to scale data across vast quantities of commodity hardware, enabling explosive growth

in performance and capacity. While HDFS creates a 3x replica for redundancy, Isilon OneFS®

dramatically reduces the need for a three-way copy. Every node in the cluster is connected to

the internal InfiniBand network. Clients connect using standard protocols like NFS, CIFS, FTP,

and HTTP from the front end network which is either 1 or 10 Gb/s Ethernet. OneFS uses the

internal InfiniBand network to allocate and stripe data across all nodes in the cluster

automatically. As OneFS distributes the Hadoop NameNode to provide high-availability and load

balancing, it eliminates the single point of failure (see Table 3). Isilon storage provides a single

file system/single volume scalable up to 15 PB.35 Data can also be staged from other protocols

to HDFS by using OneFS as a staging gateway. Integration with EMC ViPR® Software-Defined

Storage, offering access to object storage APIs from Amazon S3, EMC Atmos and others,

enables leveraging cloud-based applications and workflows.

The benefits of using Isilon for Big Data are presented in the table below.

Page 22: Does Big Data Mean Big Storage - Education Services Home,€¦ · 2.6 Taxonomy of Storage Architectures for Big Data ... of Big Data, the data streams in a continuous fashion and

2014 EMC Proven Professional Knowledge Sharing 22

Hadoop/DAS Challenges Hadoop with Isilon Solutions to Address

Them

Dedicated storage infrastructure for Hadoop

only.

Scale-out storage platform. Multiple

applications and workflows.

Single-point of failure (NameNode failover in

Hadoop 2.0)

No single point of failure – Distributed

namespace.

Lack of enterprise class data protection – no

snapshots, backup, replication

End-to-end data protection – SnapshotIQ,

SyncIQ, NDMP backup.

Poor storage efficiency – three-way mirroring Storage efficiency, >80% storage utilization.

Manual import/export Multi-protocol support. Industry standard

protocols: NFS, CIFS, FTP, HTTP, HDFS

Fixed scalability – rigid compute to storage

ratio

Independent scalability: decoupling compute

and storage – add compute and storage

independently.

Table 3: Benefits of Using Isilon for Hadoop

Isilon provides multi-tenancy in the Hadoop environment.

One directory within OneFS per tenant, one subdirectory per data scientist

Access controlled by group and user rights

Leveraging SmartQuotas to set resource limits and report usage

3.5 EMC Greenplum Distributed Computing Appliance (DCA)

Combining Isilon and Greenplum HD provides the best of both worlds for Big Data Analytics.

Greenplum Data Base (GPDB) with Hadoop delivers a solution for the analytics of structured,

semi-structured, and unstructured data.36 The Greenplum DCA is a massively parallel

architecture and the GPDB is a scalable analytic database. It features “shared nothing,” in

contrast to Oracle and DB2. Operations are extremely simple – once data is loaded,

Greenplum’s automated parallelization and tuning provide the rest; no partitioning is required.

To scale, simply add nodes (Greenplum DCA fully leverages the industry standard x86

platform); storage, performance, and load bandwidth are managed entirely in software.

Page 23: Does Big Data Mean Big Storage - Education Services Home,€¦ · 2.6 Taxonomy of Storage Architectures for Big Data ... of Big Data, the data streams in a continuous fashion and

2014 EMC Proven Professional Knowledge Sharing 23

Users can perform complex, high-speed, interactive analytics using GPDB, as well as streaming

the data directly from Hadoop into GPDB to incorporate unstructured or semi-structured data in

the above analyses within GPDB. Hadoop also can be used to transform unstructured and

semi-structured data into a structured format that can then be fed into GPDB for high speed,

interactive querying.31

3.6 NetApp Storage for Hadoop

The NetApp Open Solution for Hadoop preserves the shared nothing architectural model. It

provides DAS storage in the form of a NetApp E-series array (for example, E2660) to each Data

Node within the Hadoop cluster.37 Compute and storage resources are decoupled with SAS-

attached NetApp E2660 arrays and the recoverability of a failed Hadoop NameNode is

improved with a NFS-attached FAS2040. The E2660 array is configured as four volumes of

DAS so that each Data Node has its own non‐shared set of disks and each Data Node “sees”

only its share of disk. The FAS2040 is used as storage for the NameNode, mitigating loss of

cluster metadata due to NameNode failure. It functions as a single, unified repository of cluster

metadata that supports faster recovery from disk failure. It also serves as a repository for other

cluster software including scripts. Instead of three-way data mirroring consuming storage

capacity and network bandwidth, data is mirrored to a direct attached NetApp E2660 array via 6

Gb/s SAS connections.

3.7 Object-Based Storage for Big Data

3.7.1 Why Is Object-based Storage for Big Data Gaining Popularity?

The challenge of management of traditional block-based storage for Big Data results in many

organizations’ interest in object storage which can use the same types of hardware systems as

the traditional approach but stores data as objects that are self-contained groups of logically

related data.

While in block-based storage data is stored in groups of blocks with a minimal amount of

metadata storage with the content, object-based storage stores data as an object with a unique

global identifier (128-bit Universally Unique ID [UUID]) that is used for data access or retrieval.

The Object-based Storage Device (OSD) is a new disk interface technology being standardized

by the ANSI T10 technical committee (Fig. 5). Metadata that includes everything needed to

manage content is attached to the primary data and is stored contiguously with the object. The

object can be any unstructured data, file, or group of files, for example, audio, documents,

email, images, and video files. By combining metadata with content, objects are never locked to

Page 24: Does Big Data Mean Big Storage - Education Services Home,€¦ · 2.6 Taxonomy of Storage Architectures for Big Data ... of Big Data, the data streams in a continuous fashion and

2014 EMC Proven Professional Knowledge Sharing 24

a physical location on a disk, enabling automation and massive scalability required for cloud and

Big Data solutions. Incorporation of metadata into objects simplifies the use of data

management (preservation, retention, and deletion) policies and, therefore, reduces the

management overhead. To applications, all of the information appears as one big pool of data.

With flat-address-space design, there is no need to use file systems (file systems have an

average overhead of 25%) or to manage LUNs and RAID groups. Access to object-based

storage is provided using web protocols such as REST and SOAP. Object-based systems

typically secure information via Kerberos, Simple Authentication and Security Layer, or some

other Lightweight Directory Access Protocol-based authentication mechanism.

Object-based storage was brought to the market first as content-addressed storage (CAS)

systems such as EMC Centera®. The main goal has been to provide regulatory compliance for

archived data. Cloud-oriented object-based systems appeared in 2009 and have become the

next generation of object-based storage. These cloud-oriented systems have to support data

transfer across wide geographical areas, such as global content distribution of primary storage,

and function as low-cost storage for backup and archiving in the cloud.

The advantages of using object-based storage in cloud infrastructures are exemplified in

solutions offered by the biggest cloud service providers such as Amazon and Google since

object-based storage can simplify many cloud operations.

Page 25: Does Big Data Mean Big Storage - Education Services Home,€¦ · 2.6 Taxonomy of Storage Architectures for Big Data ... of Big Data, the data streams in a continuous fashion and

2014 EMC Proven Professional Knowledge Sharing 25

Figure 5: Block-Based vs. Object-Based Storage Models (from Ref 38)

Cleversafe announced plans to build the Dispersed Compute Storage solution by combining the

power of Hadoop MapReduce with Cleversafe’s Dispersed Storage System.39 The object-based

Dispersed Storage systems will be able to capture data at 1 TB per second at exabyte capacity.

Combining MapReduce with the Dispersed Storage Network system on the same platform and

replacing the HDFS which relies on three copies to protect data, will significantly improve

reliability and enable analytics at a scale previously unattainable through traditional HDFS

configurations.

DataDirect Networks (DDN) object storage solution is built on a software and hardware stack

that has been tuned for on-premise cloud/Big Data storage, self-service data access, and

reduced time to insight. DDN Web Object Scaler (WOS) software (WOS 3.0) and WOS7000

storage appliance can federate a group of WOS clusters to achieve management of up to 983

petabytes and 32 trillion unique objects.40 Such an implementation requires 32 federated WOS

clusters, each of which supports up to 1 trillion objects and is made up of 256 WOS object

Page 26: Does Big Data Mean Big Storage - Education Services Home,€¦ · 2.6 Taxonomy of Storage Architectures for Big Data ... of Big Data, the data streams in a continuous fashion and

2014 EMC Proven Professional Knowledge Sharing 26

storage servers. According to DDN,40 the platform can retrieve 256 million objects per second

with sub-50 millisecond latency and achieve throughput performance of 10 TB per second. A

fully populated rack of preconfigured WOS 3.0-powered WOS7000 appliances offers 2.5 PB of

storage capacity.

3.7.2 EMC Atmos

Atmos®, EMC’s software solution for object-based storage, uses the Infiniflex HW platform to

deliver cloud and Big Data storage services.41 It is the first multi-petabyte information

management solution designed to help automatically manage and optimize the delivery of rich,

unstructured information across large-scale, global cloud storage environments. For example,

eBay used EMC Atmos to manage over 500 million objects per day.42 New applications can be

introduced to the Atmos cloud without having to specifically tie them to storage systems or

locations.

At EMC World 2012, EMC announced a suite of enhancements to the EMC Atmos Cloud

platform that transforms how service providers and enterprises manage Big Data in large,

globally distributed cloud storage environments. EMC also announced new Atmos Cloud

Accelerators that make it even easier and faster to move data in and out of Atmos-powered

clouds.41 As a platform, Atmos version 2 offers several other features that tie cloud-based

storage to Big Data and application use. The Big Data paradigm is supported by the scalability

of the Atmos platform, on which sites can be added to increase storage relatively simply. Atmos

GeoDrive further enables support for Big Data and data resilience by providing access to a

storage cloud instantly from any Microsoft Windows desktop or server anywhere, without writing

a single line of code. Atmos 2.1.4, announced in September 2013, further extends features such

as GeoParity, age-based Policy Management, S3 API support, and a host of cloud delivered

services with new capabilities.

Other vendors of object-based storage products include Amazon S3, NetApp (StorageGRID),

Dell (DX Object Storage Platform), HDS (HCP), Caringo (CAStor), Cleversafe (Dispersed

Storage), DataDirect Networks (Web Object Scaler) (both Cleversafe Dispersed Storage and

DataDirect Networks WOS are reviewed in Section 3.7.1), NEC (Hydrastor), and Amplidata

(AmpliStor).

Page 27: Does Big Data Mean Big Storage - Education Services Home,€¦ · 2.6 Taxonomy of Storage Architectures for Big Data ... of Big Data, the data streams in a continuous fashion and

2014 EMC Proven Professional Knowledge Sharing 27

3.8 Fabric Storage for Big Data: SAN Functionality at DAS Pricing

Scale-out fabric storage offered by AMD SeaMicro as a Big Data storage solution provides

massive scale-out capacity with commodity drives.43 Decoupling storage from compute and

network to grow storage independently enables moving from DAS, with a rigid storage-to-

compute ratio, to flexible scale-out fabric storage up to 5 PB and creating pools of compute,

storage, and network I/O that can be the right size for an Apache Hadoop deployment.

According to AMD SeaMicro43, the SM15000 Server Platform optimized for Big Data and cloud

can reduce power dissipation by half and supply SAN functionality at DAS pricing by coupling

data storage through a "Freedom Fabric" switch that removes the constraints of traditional

servers. Unlike the industry standard model, where disk storage is located remotely from

processing nodes, SeaMicro has worked out a networking switched fabric that connects servers

to the “in rack” disk drives and is extensible beyond the SM15000 rack frame, allowing

construction of cumulatively very large systems.

3.9 Virtualization of Hadoop

VMware vSphere Big Data Extensions (BDE 1.0), announced in September 2013, allow

vSphere 5.1 or later to manage Hadoop clusters.44 BDE development has been enabled by

Serengeti, an open source project initiated by VMware to automate deployment and

management of Apache Hadoop clusters on virtualized environments such as vSphere. BDE is

a downloadable virtual appliance with a plugin to vCenter server and its deployment is simple:

downloading the OVA and importing it into the existing vSphere environment.

The Serengeti virtual appliance includes two virtual machines: the Serengeti Management

Server and the Hadoop Template Server. The creation of a Hadoop cluster, including creation

and configuration of the virtual machines, is managed by the Serengeti Management Server.

The Hadoop Template virtual machine installs the Hadoop distribution software and configures

the Hadoop parameters based on the specified cluster configuration settings. Once the Hadoop

cluster creation is complete, the Serengeti Management Server starts the Hadoop service. BDE

is controlled and monitored through the vCenter server. By default, the basic Apache

Foundation distribution of Hadoop is included, but VMware BDE also supports other major

Hadoop distributions including Cloudera, Pivotal HD, Hortonworks, and MapR.

Page 28: Does Big Data Mean Big Storage - Education Services Home,€¦ · 2.6 Taxonomy of Storage Architectures for Big Data ... of Big Data, the data streams in a continuous fashion and

2014 EMC Proven Professional Knowledge Sharing 28

Hadoop virtualization dramatically accelerates Big Data Analytics implementations by making it

affordable for companies of various sizes. Virtualized Hadoop clusters can be provisioned on-

demand and elastically expanded or shrunk using the service catalog. The benchmark results

show that virtualized Hadoop performs comparably with respect to physical configuration.45

By decoupling Hadoop nodes from the underlying physical infrastructure, VMware can bring the

benefits of cloud infrastructure7 – rapid deployment, high availability, optimal resource utilization,

elasticity, and secure multi-tenancy – to Hadoop. VMware defines the Big Data Extensions core

value propositions44 as:

1. Operational Simplicity with Performance: BDE automates configuration and deployment

of Hadoop clusters, and IT departments can provide self-service tools. As Hadoop

deployment requires configuration of multiple cluster nodes, vSphere tools and

capabilities such as cloning, templates, and resource allocation significantly simplify and

accelerate deployment of Hadoop. The integration with vCloud Automation Center can

be used to create Hadoop-as-a-Service, enabling users to select pre-configured

templates and customize them according to the user requirements.

2. Maximization of Resource Utilization on New or Existing Hardware: Big Data Extensions

enable IT departments to lower the total cost of ownership (TCO) by maximizing

resource utilization on new or existing infrastructure. Virtualizing Hadoop can improve

data center efficiency by increasing the types of mixed workloads that can be run on a

virtualized infrastructure. This includes running different versions of Hadoop itself on the

same cluster or running Hadoop along with other customer applications, forming an

elastic environment. Shared resources lead to higher consolidation ratios that result in

cost savings, as less hardware, software, and infrastructure are required to run a given

set of business apps. To facilitate elasticity, BDE can automatically scale the number of

compute virtual machines in a Hadoop cluster based on contention from other workloads

running on the same shared physical infrastructure. Compute virtual machines are

added or removed from the Hadoop cluster as needed to give the best performance to

Hadoop when needed and make resources available for other applications or Hadoop

clusters at other times. Isolating different tenants running Hadoop in separate VMs

provides stronger resource and security isolation for multi-tenancy. Multi-tenancy can be

provided by deploying separate compute clusters for different tenants sharing HDFS.

Users can run mixed workloads simultaneously on a single physical host. Additional

Page 29: Does Big Data Mean Big Storage - Education Services Home,€¦ · 2.6 Taxonomy of Storage Architectures for Big Data ... of Big Data, the data streams in a continuous fashion and

2014 EMC Proven Professional Knowledge Sharing 29

efficiency can be achieved by running Hadoop and non-Hadoop applications on the

same physical cluster.

3. Architect Scalable and Flexible Big Data Platforms: Big Data Extensions is designed to

support multiple Hadoop distributions and hardware architectures.

To provide data-locality or “rack-awareness” (see Section 3.1) to virtualized Hadoop clusters,

VMware has contributed the Hadoop Virtual Extensions (HVE) into Apache Hadoop 1.2. HVEs

help Hadoop nodes become "data locality"-aware in a virtual environment. Data locality

knowledge is important to keep compute tasks close to required data. Native Hadoop knows

about data locality to the node and rack level, but with the extensions, Hadoop becomes more

"virtualization aware" with a concept of "node groups" that basically correspond to the set of

Hadoop virtual nodes running in each physical hypervisor server. Pivotal HD is the first Hadoop

distribution to include HVE plug-ins, enabling easy deployment of virtualized Hadoop.

Figure 6: Scenarios of Virtual Hadoop Deployment (Ref. 46).

An interesting option is to run compute nodes and data storage nodes as separate VMs to

support orthogonal scaling and optimal usage of each resource (Fig. 6). Another option is to

leverage SAN storage. Extending the concept of data-compute separation, multiple tenants can

be accommodated on the virtualized Hadoop cluster by running multiple Hadoop compute

clusters against the same data service.46

Page 30: Does Big Data Mean Big Storage - Education Services Home,€¦ · 2.6 Taxonomy of Storage Architectures for Big Data ... of Big Data, the data streams in a continuous fashion and

2014 EMC Proven Professional Knowledge Sharing 30

4. Cloud Computing and Big Data

Large data volumes along with the variety of data types and complexity are common features

with what we see in cloud storage.7 Therefore, storage architectures are the place where Big

Data meets Cloud Data.

The use of cloud-infrastructure for Big Data application faces many challenges. Performance

issues and data transport are the first of them. Indeed, adoption of cloud-based solutions for Big

Data demands technologies for moving data into and out of cloud. How much data needs to be

moved, and at what cost? Moving large volumes of data to and from the cloud may be cost-

prohibitive. Real-time data requires enormous resources to manage, and data that streams

nonstop may be better processed locally.

Cloud providers for Big Data services, such as Amazon and Rackspace, suggest that customers

ship data on portable storage devices as base data transfer following data sync via storage

gateways. This approach has its own challenges as the shipments can be delayed and storage

devices can be damaged or lost during shipping.

These challenges have led to development of new technologies for Big Data transport. For

example, Aspera has developed its fasp™ transfer technology to offer a suite of On Demand

Transfer Products, that solves both technical problems of the WAN and the cloud I/O bottleneck

and delivers efficient transfer of large files, or large collections of files, in and out of the cloud.47

According to Aspera, file transfer times can be guaranteed regardless of the network distance

and conditions, including transfers over satellite, wireless, and unreliable long-distance

international links. Security is built-in, including secure endpoint authentication, on-the-fly data

encryption, and integrity verification.

5. Big Data Backups

5.1 Challenges of Big Data Backups and How They Can Be Addressed

Challenges to provide storage solutions for growing volumes of Big Data may overshadow

challenges related to Big Data protection and recovery. However, data protection should be the

key component of the enterprise strategy for Big Data lifecycle management.48

The needs for Big Data backups can be categorized based on the Big Data definition:

Velocity: needs to be protected quickly

Volume: requires deduplication to protect efficiently

Page 31: Does Big Data Mean Big Storage - Education Services Home,€¦ · 2.6 Taxonomy of Storage Architectures for Big Data ... of Big Data, the data streams in a continuous fashion and

2014 EMC Proven Professional Knowledge Sharing 31

Variety: protect structured and unstructured files

Huge data volumes that are typical for Big Data come as the first challenge for \Big Data backup

solutions. Petabyte-size datastores would not allow for completing backups with accepted

backup windows. Furthermore, traditional backup is not designed to handle millions of small

files, which can be common for Big Data environments. The challenge becomes manageable if

we understand that not all Big Data information may need to be backed up. We have to review

which part of these data volumes really needs to be backed up. If data can be easily

regenerated from another system that is already being backed up, there is no need to back up

these data sets at all. When we compare the cost of protecting our data with the cost of

regenerating the data, we may find that, in many instances, while the source data needs to be

protected, post-processed data may be less expensive to reproduce by rerunning the process

rather than protecting this data.

Therefore, the real problem is backup of the unique data that cannot be recreated. This is often

machine-generated data coming from devices or sensors (for example, the Internet of Things

discussed earlier). It is essentially point-in-time data that cannot be regenerated. As data is

often copied within the Big Data environment so that it can be safely analyzed, it results in some

redundancy. Thus, data deduplication becomes critical to eliminate redundancy and compress

much of the data to optimize backup capacity. Since the Hadoop file system is based on

additions rather than updates/deletion of data, large storage savings are achieved when

applying deduplication.

As mentioned, many Big Data environments consist of millions or even billions of small files. By

using backup software products such as Symantec NetBackup Accelerator,49 a very large file

system with millions or billions of files can be fully backed up within the amount of time required

for an incremental backup. NetBackup Accelerator uses change tracking to reduce the file

system overhead associated with traversing a large file system, identifying and accessing only

changed data. An optimized synthetic full backup is created and catalogued inline, providing full

restore capabilities and shortened Recovery Time Objective (RTO).

One more issue Big Data backup systems face is scanning the file systems each time the

backup solutions start their jobs. For file systems in Big Data environments, this can be time-

consuming. One of the solutions for the scanning issue is the OnePass feature developed by

Commvault for its Simpana data protection software.50 OnePass feature is a converged process

Page 32: Does Big Data Mean Big Storage - Education Services Home,€¦ · 2.6 Taxonomy of Storage Architectures for Big Data ... of Big Data, the data streams in a continuous fashion and

2014 EMC Proven Professional Knowledge Sharing 32

for backup, archive, and reporting from a single data collection. The created single catalog is

then shared, accessed, and used by each of Simpana software's archiving, backup, and

reporting modules.

Which storage media should be used for Big Data backups, disks, or tapes? In most cases, the

answer is both of them. Deduplicated data can be stored on low-cost, high-capacity disks (for

example, on Data Domain® appliances, discussed later) as near-term data sets that are not

being analyzed at that moment or on tape for long-term storage of less frequently accessed

data (to write deduplicated data to tape, the data should be “rehydrated” first and then written in

the original form). Many Big Data projects may not be cost effective if tape is not integrated into

the solution. Tapes last longer than disks. Physical lifetimes for digital magnetic tape are at least

10 to 20 years.51 Self-contained Information Retention Format (SIRF), discussed later, is a way

to keep data on tapes retrievable while transitioning to the future technologies. For disks, the

median lifespan is six years.52 Tape can be leveraged as part of the access tier through the use

of an Active Archive. Active Archiving is the ability to combine high performance primary disk

tier with secondary disk tier and then tape to create a single, fully integrated access point (see

Big Data Archive in the next section). The Active Archive software would automatically move the

data between the various tiers based on access or the movement could be pre-programmed

into the application. In the Active Archive process, new data can be copied to disk and tape

simultaneously, meaning that backups are happening as data is received. Active Archive also

helps with a major restore. Instead of having to restore the entire data set, only data that‘s

currently needed must be recovered. Tape libraries like those from Spectra Logic can leverage

the Active Archive technology, and Linear Tape File System (LTFS) for data transfer can

become a major part of the Big Data infrastructure.53

5.2 EMC Data Domain as a Solution for Big Data Backups

EMC Data Domain deduplication storage systems are ideal for Big Data backups, as they

overcome the Big Data backup challenges48 (Table 4). Data Domain systems are purpose-built

backup appliances used in conjunction with EMC or any third party backup application or native

backup utility (Fig. 7). Regardless of the big data system (EMC Greenplum, EMC Isilon,

Teradata, Oracle Exadata, etc.), Data Domain systems offer advanced integration with effective

backup tools for that environment to provide fast backup and recovery.

Page 33: Does Big Data Mean Big Storage - Education Services Home,€¦ · 2.6 Taxonomy of Storage Architectures for Big Data ... of Big Data, the data streams in a continuous fashion and

2014 EMC Proven Professional Knowledge Sharing 33

Backup Challenge Data Domain Solution

Data volume High speed inline deduplication

Performance Up to 248 TB can be backed up in less than 8 hours (31 TB/hr)

Scale Protects up to 65 PB of logical capacity in a single system

Data Islands Simultaneously supports NFS, CIFS, VTL, DD Boost, and NDMP

Integration with major

backup software

Qualified with leading backup and archive applications

Table 4: Data Domain Solutions for Big Data Backup Challenges

Data Domain provides high-speed inline deduplication leading to 10 to 30x reduction in backup

storage required to enable Big Data backup completion within backup windows. For example,

the DD990 offers up to 31 TB/hour ingestion; it can backup up to 248 TB in less than 8 hours. In

addition, Data Domain systems can protect up to 65 PB of logical capacity in a single system.

Data Domain systems eliminate data islands by enabling backup of the entire environment

(using NFS, CIFS, VTL, DD Boost, and/or NDMP) to a single Data Domain system.

Figure 7: Big Data Backup Using Data Domain (Ref. 54)

Page 34: Does Big Data Mean Big Storage - Education Services Home,€¦ · 2.6 Taxonomy of Storage Architectures for Big Data ... of Big Data, the data streams in a continuous fashion and

2014 EMC Proven Professional Knowledge Sharing 34

6. Big Data Retention

6.1 General Considerations for Big Data Archiving

6.1.1 Backup vs. Archive

Archiving and backup are sometimes considered very similar data retention solutions, as an

assumption is made that a backup can be a substitute for an archive. However, there are

significant differences between these two categories of data management. Backup is a data

protection solution for operational purposes, whereas data archiving objectives are information

retrieval, regulatory compliance, and data footprint reduction.

Backups are a secondary copy of data and are used for operational data recovery to restore

data that may have been lost, corrupted, or destroyed. Backup retention periods usually are

relatively short – days, weeks, months. Conversely, a data archive is a primary copy of

information and archived data are typically retained long term (years, decades, or forever) and

maintained for analysis or compliance as a managed repository. Archiving provides data

footprint reduction capabilities by deleting fixed content and duplicate data. An archive is used

to meet regulatory requirements by enforcing retention policies. Financial, healthcare, and other

industries can have archive retention periods of 10-15 years or even up to 100 years.

6.1.2 Why Is Archiving Needed for Big Data?

The reasons for Big Data archiving are similar to those for traditional data archiving and include

regulatory compliance (for example, X-rays are stored for periods of 75 years). Retention of 20

years or more is required by 70% of repositories.55

Many archiving technologies have advanced file deduplication and compression techniques,

which significantly reduce Big Data footprint. For example, a database archiving solution

supports moving and converting the inactive data from production databases to an optimized file

archive that is highly compressed. Archiving moves inactive data to a lower-cost infrastructure,

providing cost reduction by using storage tiering. As a result, archiving solutions make it

possible to keep data easily accessible without the need to locate it on tape and restoring it.

6.1.3 Pre-requisites for Implementing Big Data Archiving

1. Data classification. Data should be classified according to the business value

(determined by the current data position in the Big Data lifecycle) and security

requirements.

Page 35: Does Big Data Mean Big Storage - Education Services Home,€¦ · 2.6 Taxonomy of Storage Architectures for Big Data ... of Big Data, the data streams in a continuous fashion and

2014 EMC Proven Professional Knowledge Sharing 35

2. Review to ensure that regulatory requirements do not prevent the use of data

deduplication techniques for streamlining both data and data access.

3. Storage tiering policies determining the data access latency by placing the data on

different disk types and/or different storage systems are developed based on the data

classification. As it leads to data footprint reduction, storage tiering reduces energy cost

and data center raised floor use.

6.1.4 Specifics of Big Data Archiving

The differentiating feature of Big Data retention is the need to continually reanalyze the same

machine-generated data sets. Data scientists need to identify patterns with timeframes of hours,

days, months, and years. Companies are not just retaining Big Data, they re-use them.

While the massively parallel processing (MPP) systems for Big Data analytics solutions are

designed to run complex large-scale analytics where performance is the prime objective, these

systems are not suitable targets for long-term retention of Big Data content. Big Data archiving

solutions should be cost effective. If it is cost prohibitive to retain the needed historical data or

too difficult to organize the data for timely ad hoc retrieval, companies would not be able to

extract value from their collected information. The key question is whether the current storage

environment can handle this new data explosion and the Big Data retention challenges resulting

from such data growth.

The primary data management challenge associated with Big Data is to ensure that the data is

retained (to satisfy compliance needs at the lowest possible costs) while also keeping up with

the unique and fast-evolving scaling requirements associated with new business analytics

efforts. Companies that achieve this balance will increase efficiency, reduce data storage cost,

and be in a far better position to capitalize on Big Data Analytics.

Security is a challenge for Big Data archiving, as traditional database management systems

support security policies that are quite granular, whereas Big Data applications generally have

no such security controls. Companies including any sensitive data in Big Data operations must

ensure that the data itself is secure and that the same data security policies that apply to the

data when it exists in databases or files are also enforced in the Big Data archives (see also

Section 3.3).

Page 36: Does Big Data Mean Big Storage - Education Services Home,€¦ · 2.6 Taxonomy of Storage Architectures for Big Data ... of Big Data, the data streams in a continuous fashion and

2014 EMC Proven Professional Knowledge Sharing 36

6.1.5 Archiving Solution Components

An archiving solution (for example, Symantec eVault) typically includes:

Archiving software that automates the movement of data from primary storage to archival

storage based on policies established in the data classification and rationalization process.

Archive software can delete files at the end of their retention period.

E-discovery software that uses archiving software as a base to provide advanced search

features that enable users and administrators to quickly search all files, emails, texts, and other

data related to a specific topic for use in data mining services or in response to legal inquiries.

Some applications combine e-discovery and archive into purpose-built platforms such as an

email-archive solution or document and records management solutions.

Physical media for data archiving are hard disks and tapes. As the IDC survey56 shows,

companies increase the use of disk-based storage for long-term data retention. However, the

growth of disk-based archives does not mean that tape-based archives are becoming relics.

There are still some financial and practical reasons to choose tape storage for Big Data

archives:

Longer media life expectancy

Low cost per TB over time

As data storage technologies change over time, how can we be sure that the archived data can

be retrieved 20-40 years from now? To address this challenge, the SNIA Long Term Retention

Workgroup has developed Self-contained Information Retention Format (SIRF).57 SIRF is a

logical data format of a storage container that is self-describing (can be interpreted by different

systems), self-contained (all data needed for the interpretation is in the container), and

extensible so it can meet future needs. Therefore, SIRF provides a way for collecting all the

information that will be needed for transition to new technologies in the future. Development of

SIRF serialization for Linear Tape File System (LTFS) makes it possible to provide economically

scalable containers for long-term retention of Big Data.

Page 37: Does Big Data Mean Big Storage - Education Services Home,€¦ · 2.6 Taxonomy of Storage Architectures for Big Data ... of Big Data, the data streams in a continuous fashion and

2014 EMC Proven Professional Knowledge Sharing 37

6.1.6 Checklist for Selecting Big Data Archiving Solution

Big Data active archival storage solutions should address the following major requirements:

Ability to rapidly and intelligently move data from primary storage into the active archive system.

This ability ensures that the source application continues to run at maximum efficiency in terms

of performance and reliability.

Flexibility in data ingestion capability. The amount of data to be archived according to the

archiving policy can vary significantly from time to time, depending on the amount of activity. For

example, financial trade monitoring systems can experience very high levels of activity due to a

market event that in turn could trigger a sudden surge in the number of trades. The active

archive target should be able to manage such workload variations and be able to ingest data at

different rates as required.

Rapid, non-disruptive scalability of archival storage capacity and I/O performance. The solution

should be capable of scaling out by non-disruptively adding more units. It can outgrow the

archive system module, but it should not outgrow the archival platform capacity and I/O

performance. When the data size is about hundreds of terabytes to multiple petabytes,

migrating to a new platform should be a last resort option.

Data portability for future technology changes. The selected solution should provide capability to

transition to new technologies developed in the future – see the above discussion of SIRF.

6.2 Big Data Archiving with EMC Isilon

EMC offers a Big Data archive solution that is based on the Isilon scale-out NAS platform and

meets the criteria for selecting archive solutions reviewed above. The solution can meet the

large-scale data retention needs of enterprises, reduce costs, and help customers comply with

governance or regulatory requirements.58 Isilon scale-out NAS delivers efficient utilization of

capacity, reducing the overall storage footprint and delivering significant savings for capital

expenditures (CapEx) and operating expenditures (OpEx) (see also Section 3.4). It provides

over 15 petabytes of capacity per cluster or more. The ability of Isilon NAS to scale quickly,

easily, and without disrupting users makes it an attractive platform for large-scale data archives

– see the discussion of the criteria for selecting archive solutions above.

Page 38: Does Big Data Mean Big Storage - Education Services Home,€¦ · 2.6 Taxonomy of Storage Architectures for Big Data ... of Big Data, the data streams in a continuous fashion and

2014 EMC Proven Professional Knowledge Sharing 38

The following features make Isilon an excellent archive solution for Big Data:

Performance: Isilon clusters provide instant access to large data archives with scalable

performance to over 100 GB/s of throughput. Automatic load-balancing of the archive

servers’ access across the cluster using SmartConnect maximizes performance and

utilization of cluster resources (CPU, memory, and network). SmartConnect

automatically balances incoming client connections across all available interfaces on the

Isilon storage cluster.

Automanagement and self-healing: Isilon clusters utilize an automatic management and

provisioning capability that is used to monitor system health and automatically correct for

any failures.

Cost-efficiency based on storage tiering: Isilon clusters provide automigration between

working sets and archives using a policy-based approach available with Isilon

SmartPools software. SmartPools is tightly integrated with Isilon OneFS, so all data,

regardless of physical location, is in the same single file system. This means that

SmartPools data movements are completely transparent to the end user application,

removing management, backup, and other issues related to stub-based tiering

architectures such as those present in hierarchical storage management (HSM)

implementations.

Scalability: Data can be moved seamlessly and automatically as new nodes are

introduced or as capacity is added. This enables very long-term archiving without the

problems inherent in moving to new systems.

Flexibility: A single Isilon cluster supports the concurrent use of write-protected archival

data alongside online, active content. WORM and non-WORM data can be mixed in one

general-purpose system. Retention defaults can be set at the directory and file level.

SmartLock® software adds a layer of “write once, read many” (WORM) data protection

and security, which protects archived data against accidental, premature, or malicious

alteration or deletion.

Remote replication: Isilon SnapshotIQ™ and Isilon SyncIQ® can be leveraged to

efficiently replicate the archive among multiple remote sites for business continuity and

disaster recovery.

Page 39: Does Big Data Mean Big Storage - Education Services Home,€¦ · 2.6 Taxonomy of Storage Architectures for Big Data ... of Big Data, the data streams in a continuous fashion and

2014 EMC Proven Professional Knowledge Sharing 39

6.3 RainStor and Dell Archive Solution for Big Data

Dell’s Big Data retention solution combines the Dell DX Object Storage Platform with RainStor

specialized database and compression technology to help significantly reduce the cost of

retaining Big Data through extreme data reduction.59,60 RainStor provides a Big Data database

that runs natively on Hadoop. As the data volumes grow in Hadoop, node capacity can be

increased due to RainStor’s compression and deduplication capabilities, allowing for significant

reductions in storage footprint – as much as 20-40 times. RainStor’s built-in data deduplication

and compression dramatically speed up query and analysis by as much as 10-100 times and

provide high-speed data load at rates of up to multi-billions of records per day.

The RainStor database, which provides online data retention at a massive scale, can be

deployed on any combination of Dell servers and storage, on premise or in a cloud

configuration. When paired with the Dell DX Object Storage platform, the solution provides a

single system for retaining structured, semi-structured, and unstructured data across various

data sources, formats, and types, providing cost savings.61

7. Conclusions

The 3V+V characteristics of Big Data have required development of new types of storage

architecture. The Big Data storage architectures represent a broad spectrum of shared-nothing

DAS to shared storage (scaled-out NAS, object-based storage), to hybrid and converged

storage. As Big Data processing is part of the new integrated enterprise data warehouse (EDW)

environment, an ideal storage solution should be able to support all storage functionality needed

for this integrated environment, including data discovery, capture, processing, load, analysis,

data protection, and retention. Such integrated EDW environment assumes co-existence and

symbiosis of traditional EDW storage architectures and new evolving storage technologies with

distributed massive parallel processing architectures to parse large data sets. As the business

value of Big Data changes over time, it is important to implement Big Data lifecycle

management. This would control the storage cost by stemming the data growth with data

retention and archiving policies. Along with Big Data protection (backup) solutions, it will also

address the security and regulatory requirements for Big Data.

The scope of the Big Data project determines the type of storage solution. It may be

implemented as Big Data appliances integrating server, networking, and storage resources into

a single enclosure and running analytics software. Or it may be a large multi-system

environment storing and processing hundreds of terabytes or tens of petabytes of data. In all

Page 40: Does Big Data Mean Big Storage - Education Services Home,€¦ · 2.6 Taxonomy of Storage Architectures for Big Data ... of Big Data, the data streams in a continuous fashion and

2014 EMC Proven Professional Knowledge Sharing 40

these cases, incorporation of Big Data architectures into the existing EDW environment should

comply with the company policies for data management and data governance.

As discussed, cost is not the primary factor in Big Data solution selection, as a storage solution

can turn Big Data into actionable time-to-value business knowledge providing an impressive

ROI. However, Big Data does not necessary mean big budget. To be able to choose the best

solutions, we as users have to review with vendors – both established EDW vendors and Big

Data startups offering emerging technologies – their development roadmaps for Big Data and

Big Data Analytics so that we can design our own cost-effective Big Data service strategy.

Page 41: Does Big Data Mean Big Storage - Education Services Home,€¦ · 2.6 Taxonomy of Storage Architectures for Big Data ... of Big Data, the data streams in a continuous fashion and

2014 EMC Proven Professional Knowledge Sharing 41

8. References

1. S. Rosenbush and M. Totty. How Big Data Is Changing the Whole Equation for

Business. Wall Street Journal (http://online.wsj.com), March 10, 2013.

2. K. Krishnan. Data Warehousing in the Age of Big Data. Morgan Kaufmann Publishers.

2013.

3. M. A. Beyer and D. Laney. The Importance of 'Big Data': A Definition. Gartner, 2012.

4. Bringing Smarter Computing to Big Data. IBM, 2011.

5. Gartner Inc. Gartner Says the Internet of Things Installed Base Will Grow to 26 Billion

Units By 2020. Press Release. December 12, 2013.

6. New Intelligent Enterprise. The MIT Sloan Management Review and the IBM Institute for

Business Value, 2011.

7. M. Gloukhovtsev. Does the Advent of Cloud Storage Mean “Creation by Destruction” of

Traditional Storage? EMC Proven Professional Knowledge Sharing, 2013.

8. Data Science and Big Data Analytics. EMC Educational Course. EMC, 2012.

9. http://www.gopivotal.com/press-center/11122013-pivotal-one

10. M. Crutcher. Big and Fast Data: The Path To New Business Value. EMC World, 2013.

11. Ditch the Disk: Designing a High-Performance In-Memory Architecture. Terracotta, 2013.

12. SAP HANA Storage Requirements. White Paper. SAP, 2013.

13. Oracle Exalytics In-Memory Machine. Oracle White Paper. Oracle, 2013.

14. Hybrid Storage. White Paper EB-6743. Teradata, 2013.

15. SFA12K Product Family. Datasheet. DataDirect Networks, 2012.

16. M. Murugan. Big Data: A Storage Systems Perspective. 2013 SNIA Analytics and Big

Data Summit.

17. The Data Lake: Turning Big Data into Opportunity. Booz Allen Hamilton, 2012.

18. http://www.gopivotal.com/products/pivotal-hd.

19. http://www.pentahobigdata.com/ecosystem/platforms/hadoop

20. http://radar.oreilly.com/2010/09/the-smaq-stack-for-big-data.html

21. S. Childs and M. Adrian. Big Data Challenges for the IT Infrastructure Team. Gartner,

2012.

22. http://www.nutanix.com/products.html; Hadoop on Nutanix. Reference Architecture.

Nutanix, 2012.

23. . http://wiki.apache.org/hadoop/ProjectDescription

24. Big Data Technologies for Near-Real-Time Results. White Paper. Intel, 2013.

25. J. Webster. Storage for Hadoop: A Four-Stage Model. SNW, October 2012.

Page 42: Does Big Data Mean Big Storage - Education Services Home,€¦ · 2.6 Taxonomy of Storage Architectures for Big Data ... of Big Data, the data streams in a continuous fashion and

2014 EMC Proven Professional Knowledge Sharing 42

26. S. Fineberg. Big Data Storage Options for Hadoop. SNW, October 2012.

27. https://blogs.apache.org/foundation/entry/the_apache_software_foundation_announces4

8.

28. J. G. Kobielus. The Forrester Wave: Enterprise Hadoop Solutions. Forrester, 2012.

29. S. K Krishnamurthy. Pivotal: Hadoop for Powerful Processing of Unstructured Data for

Valuable Insights. EMC World, 2013.

30. Cisco Common Platform Architecture for Big Data. Solution Brief. Cisco, 2013.

31. B. Yellin. Leveraging Big Data to Battle Cyber Threats – A New Security Paradigm?

EMC Proven Professional Knowledge Sharing, 2013.

32. J. Shih. Hadoop Security Overview. Hadoop & Big Data Technology Conference, 2012.

33. Fast, Low-Overhead Encryption for Apache Hadoop. Solution Brief. Intel, 2013.

34. https://github.com/intel-hadoop/project-rhino

35. EMC Isilon Scale-out NAS for Enterprise Big Data and Hadoop. EMC Forum, 2013.

36. EMC Big Data Storage and Analytics Solution. Solution Overview. EMC, 2012.

37. http://www.netapp.com/us/solutions/big-data/hadoop.aspx.

38. Virtualized Data Center and Cloud Infrastructure. EMC Educational Course for Cloud

Architects. EMC, 2011.

39. http://www.cleversafe.com/overview/how-cleversafe-works

40. https://www.ddn.com/products

41. Atmos Cloud Storage Platform for Big Data in Cloud. EMC World, 2012.

42. D. Robb. EMC World Continues Focus on Big Data, Cloud and Flash. Infostor, May

2011.

43. S. Nanniyur. Fabric Architecture: A Big Idea for the Big Data infrastructure. SNW, April

2012.

44. http://www.vmware.com/products/big-data-extensions

45. Virtualized Hadoop Performance with VMware vSphere 5.1. Technical White Paper.

VMware, 2013.

46. J. Yang and D. Baskett, Virtualize Big Data to Make the Elephant Dance. EMC World,

2013.

47. http://cloud.asperasoft.com/big-data-cloud/

48. S. Manjrekar and G. Maxwell. Big Data Backup Strategies with Data Domain for EMC

Greenplum, EMC Isilon, Teradata & Oracle Exadata. EMC World, 2013.

49. Better Backup for Big Data. Solution Overview. Symantec, 2012.

50. CommVault Simpana OnePass™ Feature. Datasheet. Commvault, 2012.

Page 43: Does Big Data Mean Big Storage - Education Services Home,€¦ · 2.6 Taxonomy of Storage Architectures for Big Data ... of Big Data, the data streams in a continuous fashion and

2014 EMC Proven Professional Knowledge Sharing 43

51. The lifespan of data stored on LTO tape is usually quoted as 30 years. However, tape is

extremely sensitive to storage conditions, and the life expectancy numbers cited by tape

manufacturers assume ideal storage conditions.

52. B. Beach. How long do disk drives last? http://blog.backblaze.com/2013/11/12/how-long-

do-disk-drives-last.

53. http://www.spectralogic.com/

54. EMC Backup Meets Big Data. EMC World, 2012.

55. SNIA – 100 Year Archive Requirement Survey. 2007.

56. Adoption Patterns of Disk-Based Backup. IDC Survey. IDC, 2010.

57. D. Pease. Long Term Retention of Big Data. SNIA Analytics and Big Data Summit, 2012.

58. Archive Solutions for the Enterprise with EMC Isilon Scale-out NAS. White Paper. EMC,

2012.

59. RainStor for Hadoop. Solution Brief, RainStor, 2013.

60. M. Cusack. Making the Most of Hadoop with Optimized Data Compression. SNW 2012.

61. R. L. Villars and M. Amaldas. Rethinking Your Data Retention Strategy to Better Exploit

the Big Data Explosion. IDC, 2011.

EMC believes the information in this publication is accurate as of its publication date. The

information is subject to change without notice.

THE INFORMATION IN THIS PUBLICATION IS PROVIDED “AS IS.” EMC CORPORATION

MAKES NO RESPRESENTATIONS OR WARRANTIES OF ANY KIND WITH RESPECT TO

THE INFORMATION IN THIS PUBLICATION, AND SPECIFICALLY DISCLAIMS IMPLIED

WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

Use, copying, and distribution of any EMC software described in this publication requires an

applicable software license.