12
White Paper Copyright © 2012, Juniper Networks, Inc. 1 UNDERSTANDING BIG DATA AND THE QFABRIC SYSTEM QFabric System Enables a High-Performance, Scalable Big Data Infrastructure with Simplicity

Understanding Big Data and the QFabric System

Embed Size (px)

DESCRIPTION

The QFabric System from Juniper Networks Enables a High-Performance, Scalable Big Data Infrastructure with Simplicity of Management.

Citation preview

Page 1: Understanding Big Data and the QFabric System

White Paper

Copyright © 2012, Juniper Networks, Inc. 1

UNderstaNdINg BIg data aNd the QFaBrIC systemQFabric system enables a high-Performance, scalable Big data Infrastructure with simplicity

Page 2: Understanding Big Data and the QFabric System

2 Copyright © 2012, Juniper Networks, Inc.

White Paper - Understanding Big Data and the QFabric System

Table of Contentsexecutive summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

Big data Use Cases in healthcare . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

apache hadoop—the Big Player . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

the Network’s role in hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

data reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7

Network Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7

QFabric system support for Big data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

Big data Infrastructure strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

Operational simplicity at scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

data reliability at scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10

Performance at scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

about Juniper Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

List of FiguresFigure 1: Big data and hadoop cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

Figure 2: the network’s role in hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

Figure 3: Comparison of a chassis switch and a QFabric system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

Figure 4: an optimized mid-size hadoop cluster with QFabric system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10

Page 3: Understanding Big Data and the QFabric System

Copyright © 2012, Juniper Networks, Inc. 3

White Paper - Understanding Big Data and the QFabric System

Executive SummaryIn today’s increasingly complex and volatile business environment, organizations need to constantly adopt innovative

technologies to compete. Big data refers to a collection of data that is beyond the ability of typical database software

tools to collect, process, and deliver. When analyzed properly, big data can provide new business insights, open new

markets, and create new competitive advantage in many industries. according to mgI1, for example, retailers can realize

the potential of a sixty percent increase in operating margins by fully harnessing big data. In the healthcare industry, big

data can reduce costs and enhance patient outcomes in diagnosis and treatment by driving efficiency, transparency, and

quality. IdC expects the big data technology and services market to grow from $3.2 billion in 2010 to $16.9 billion in 2015,

making it one of the fastest growing areas in the overall information and communication technology (ICt) market2.

Big data has recently become an issue for organizations due to the dramatic increase in data creation and data

gathering, driven by a number of technological innovations. the rise of mobile users has increased the aggregation

of user statistics in the enterprise and, if properly synthesized and analyzed, these same statistics can provide highly

relevant and competitive business intelligence. the increasing use of sensors for everything from traffic patterns,

purchasing and buying behaviors, to real-time inventory management is another good example of the significant

increase in large data sets. much of this data, gathered in real time, can provide unique and powerful intelligence,

especially if it can be analyzed and acted upon quickly.

Contrary to structured data that has historically been stored in data warehouses and analyzed with structured Query

Language (sQL) analysis tools, big data requires a flat, horizontally scalable database, often with unique query tools

that work in real time (as opposed to time delineated snapshots). It organizations must invest in new technologies and

architectures in order to best leverage and gain advantage from the power of these new massive real-time data streams.

In short, the big data phenomenon brings up a challenging question for CIOs and CtOs: What is the best big data

infrastructure strategy? Fortunately, as big data pilots launch and business cases solidify, there are a number of

changes occurring in network architectures that can enhance and help integrate big data processing and insights.

Just as big data applications represent a new way of collecting, analyzing, and taking action on business data, the

underlying network foundation of big data projects should be considered in a new light. Network architectures can

either enhance or inhibit the ability to easily launch, grow, and integrate big data initiatives from pilot projects to large-

scale production.

Consider apache hadoop, the de facto big data platform. to manage and process data in a server cluster, which is

required to scale to thousands of servers, the performance and manageability of the network is critical. In fact, most

data center infrastructures, especially the ones based on multitier networking, face operation and performance

challenges in storing and analyzing big data in a midsize or large size hadoop cluster, which can interconnect

thousands of servers. Organizations need a network solution that overcomes these issues and enables them to gain

the considerable business intelligence benefits of big data analytics.

Organizations can achieve simplified management, improved performance, and optimized data reliability by using

Juniper Networks® QFabric™ family of products. as a single logical switch, the QFabric system supports over 6,000

10gbe ports, and provides a consistent, extremely low latency of <5 microseconds even under load3. this paper

provides an overview of big data use cases, discusses the network’s role in hadoop clusters, and describes how QFabric

technology supports a high-performance and scalable big data infrastructure.

1 For further information on big data use cases, see the mcKinsey global Institute article, Big data: the next frontier for innovation, competition, and productivity, at: www.mckinsey.com/Insights/MGI/Research/Technology_and_Innovation/Big_data_The_next_frontier_for_innovation.

2 For further information concerning IdC Worldwide Big data technology and services 2012-2015 Forecast, visit www.idc.com/getdoc.jsp?containerId=233485.3 For further information concerning the QFabric system, visit www.juniper.net/us/en/products-services/switching/qfx-series/qfabric-system/#literature.

Page 4: Understanding Big Data and the QFabric System

4 Copyright © 2012, Juniper Networks, Inc.

White Paper - Understanding Big Data and the QFabric System

IntroductionBig data refers to a volume of data that is beyond the ability of typical database software tools to collect, process, and

deliver. When analyzed properly, big data can deliver new business insights, open new markets, and create competitive

advantage. Compared to structured data that historically has been stored through data warehousing stores and

analyzed with sQL analysis tools, big data has three major attributes: variety, volume, and velocity.

First, big data extends beyond structured data and includes semi-structured or unstructured data of all varieties.

this can include event data from active machine log files, text from social networking, data streams from financial

data services, click streams from customers’ accessing web applications, activities data from machine-to-machine

interchange, and even audio and video files.

second, big data comes in large sized data sets because its predictability and analytics capabilities rely on sufficient

data points, dealing with everything from traffic patterns, purchasing and buying behaviors, or real-time inventory

management. Organizations are awash with data, easily amassing hundreds of terabytes and petabytes of information.

third, organizations can maximize their data’s business value by analyzing streaming. the rise of the security event

manager (sem) industry is at the heart of gathering, analyzing, and proactively responding to event data from active

machine log files in real time, providing unique and powerful business intelligence.

driven by a combination of technology innovation, ubiquitous social networking, and pervasive mobile devices, the rise

of big data has created an inflection point for organizations as they look for innovative ways to do business effectively

and economically.

Big Data Use Cases in Healthcareaccording to a recent study,4 using big data to address spiraling healthcare costs and make intelligent decisions by

analyzing information across departments and providers will greatly improve business efficiency and in the end,

improve patient outcomes. Besides the clear economic benefit, big data is so compelling in healthcare because of the

sheer volume of raw data generated, and the ability to rapidly analyze that data to enhance knowledge and improve

the quality of patient care.

• Improved Diagnosis and Treatment. Leveraging big data can greatly aid in diagnosing and treating illnesses, improving

patient outcomes, and in many cases saving lives. Considering that up to 25 percent of all diagnoses are not supported

by any data analytics, and that 100,000 americans die as a direct result of medical errors every year, improving patient

outcomes is a key focus for the industry.

• Disease Trend Analysis. the global nature of society has had a profound impact on the spread of disease. regional

diseases can quickly develop into global pandemics, which is why the industry is leveraging big data to offer much needed

disease trend analysis. the world’s governments had a genuine concern over the avian influenza pandemic of 2009, and

while the actual confirmed deaths—approximately 15,000—were less than feared, other pandemics such as West Nile

virus, mad-cow disease and tuberculosis are not only a present day concern but unfortunately a serious concern for the

future as well. Big data can be used to predict and reduce future pandemics.

• Predictive Analysis. how many lives would have been saved if the industry had the ability to more quickly see the

correlation between Vioxx and heart attacks? In congressional testimony regarding Vioxx, dr. david graham stated

that conservatively 100,000 people have had heart attacks as a result of using Vioxx, leading to between 30,000 and

40,000 deaths.

the way big data can positively impact healthcare is clear. In fact, the mcKinsey global Institute report (refer to

footnote 1) opined that the healthcare sector could create more than $300 billion in value every year by implementing

big data analytics. a significant amount of data is available and ready for analysis, so the time is right to make sure

that big data delivers this predicted value.

4 For further information on the article, how Big data Can mend Our Broken healthcare system, ewing marion Kauffman Foundation (april 2012), visit www.smartplanet.com/blog/business-brains/how-big-data-can-mend-our-broken-healthcare-system-study/23728.

Page 5: Understanding Big Data and the QFabric System

Copyright © 2012, Juniper Networks, Inc. 5

White Paper - Understanding Big Data and the QFabric System

Apache Hadoop—the Big Playerapache hadoop is a widely deployed platform for managing and processing big data. many companies such as IBm,

mapr technologies, and Cloudera provide commercially licensed hadoop software stacks with improved performance

and/or value-added features. hadoop includes the following two key functions:

• hadoop distributed File system (hdFs) is a large-scale distributed file system, which consists of hundreds and

thousands of servers, each storing part of the file system’s data. hdFs not only provides high aggregate data bandwidth

to handle large datasets, but it also addresses frequent hardware failures in a large size deployment.

• mapreduce is a distributed analytics program designed for big data. It maps and processes input records in each server,

collects and shuffles intermediate results through the network, and reduces them to a final result. the performance of

each task—the execution time between start and completion of a mapreduce task—can vary from minutes to hours

depending on data complexity and size of the hadoop cluster.

In a typical hadoop cluster, as shown in Figure 1, each server may take any combination of the following four roles:

• the Client collects data from internal data sources within the data center or from external data sources on the cloud

services, submits mapreduce tasks to the Job tracker, and delivers the analytics results to business applications.

• the Job tracker schedules the mapreduce tasks and ultimately produces analytics results.

• the Name Node manages hdFs metadata, by keeping track of the location of each file’s data blocks throughout the

cluster. the data block default size is 64 mB, but it can be set to 512 mB or 1 gB depending on network performance.

• the data Node holds hdFs data blocks on its local drives, but also executes tasks assigned from the Job tracker. the

hadoop cluster acts as a unified storage and analytics pool with hundreds or thousands of data Nodes.

hadoop has been tested and deployed by some of the world’s largest data centers. For example, yahoo! Inc. launched

the yahoo! search Webmap in 2008, a hadoop production application that ran on more than 10,000 core Linux

clusters with over 5 petabytes of raw disk space. the analytics result was used in every yahoo! Web search query5.

Figure 1: Big data and Hadoop cluster

Apps

Video Log Data

Multimedia

Finance

Web

HADOOP CLUSTER

Data Node

Name Node Job Tracker

Client

Data Node Data Node

Data NodeClient

Client Data Node

Hadoop Network Tra c

Big Data Process Flow

Storage and AnalyticsData Collection Delivery Results

FINANCIALMARKET

DATA

SOCIALNETWORK

DATA

5 For further information on yahoo! Launches World’s Largest hadoop Production application, visit http://developer.yahoo.com/blogs/hadoop/posts/2008/02/yahoo-worlds-largest-production-hadoop/.

Page 6: Understanding Big Data and the QFabric System

6 Copyright © 2012, Juniper Networks, Inc.

White Paper - Understanding Big Data and the QFabric System

The Network’s Role in Hadoophadoop can run on most data center networks. however, legacy network architectures are not designed to handle

modern distributed application architectures nor can they deliver the reliability and performance at scale demanded by

big data. In fact, in a legacy multitier network design, the three main limiting factors that adversely impact performance

and operation of a mid- to large size hadoop cluster are network complexity, degraded bandwidth, and inconsistent

latency. Fortunately, organizations can achieve better results and reduce risks in running hadoop on a network designed

for a world where server-to-server and server-to-storage traffic outweighs client-to-server traffic.

Figure 2: The network’s role in Hadoop

Data Reliability a hadoop cluster is built on a fleet of low-cost, high capacity disk drives with a network-based data replication system

and a data fault tolerant system simply because today’s raId technology does not meet required, cost-effective data

storage scalability. For the setting of three copies of data replication, hdFs places one replica on one data Node in a rack,

while placing another two copies on two different data Nodes in different racks, thereby preventing data loss in the event

of rack failure (see B1, B1’, and B1’’ in Figure 2). this suboptimal data placement policy concerns the significantly reduced

bandwidth of inter-rack communications as compared with intra-rack server-to-server bandwidth.

hdFs relies on the network to maintain data reliability in the event of failures, which can be either a disk drive, server, or

network device failure, or a combination of these failures. When a disk drive fails, the data replication event takes hours

to maintain data reliability. For example, it takes approximately 66 minutes to transfer 3 tB of data on a 1gbe network

without considering network latency. the event takes 4 hours and 24 minutes to transfer data when a server that

employs four, 3 tB disk drives fails. When a subset of data Nodes loses connectivity with the Name Node, hdFs may

become unreliable because the network does not have sufficient bandwidth to re-replicate large amounts of data.

MULTITIER NETWORK

Network Hops

Hadoop Data Replication

Core

Distribution2 4

3

Rack 8

TOR 8

Name Node 2

Data Node

Data Node

Data Node

Data Node

Rack 9

TOR 9

Job Tracker 2

Data Node

Data Node

Data Node

Data Node

Rack 10

TOR 10

Client

Data Node

Data Node

Data Node

Data Node

Rack 1

TOR 1

Client

Data Node

Client

Data Node

Data Node

Rack 2

TOR 2

Name Node

Data Node

Data Node

Data Node

Data Node

Rack 3

TOR 3

Job Tracker

Data Node

Data Node

Data Node

Data Node

1 5

B1

B1”

B1’

Page 7: Understanding Big Data and the QFabric System

Copyright © 2012, Juniper Networks, Inc. 7

White Paper - Understanding Big Data and the QFabric System

PerformanceIn a multitier network, the network performance such as latency and bandwidth depends on design considerations

and devices. as a result, inter-rack server-to-server network performance varies widely from one network to another of

similar size. although the latency of a typical 10gbe top-of-rack (tOr) switch is around 1 microsecond, the latency of

intermediate switches at the distribution tier and core tier is significantly higher than that of a tOr switch.

also, a multitier network does not provide efficient inter-rack server-to-server bandwidth for a hadoop cluster because

of the compounded oversubscription introduced by switches at the distribution and core tiers. For example, when each

data node is configured with a 20 gbps connection, the intra-rack server-to-server network bandwidth is 20 gbps.

the average inter-rack server-to-server network bandwidth between rack 1 and rack 2 (Figure 2) is 8 gbps when the

tOr switch operates at oversubscription of 2.5:1, which means total server communication bandwidth is 2.5 times the

inter-rack network bandwidth. When considering the oversubscription of 4:1 on a distribution switch, the compounded

oversubscription becomes 10:1, a common deployment scenario in most three-tier networks. as a result, the average

inter-rack server-to-server bandwidth for replicating B1 and B1’ can be as low as 2 gbps.

Network Operation a multitier network significantly increases network management complexity. For a multitier network such as the one

shown in Figure 2, the tOr switch interconnects servers within the rack. to meet the demand of an increasing number

of servers, many intermediate switches are required to build a tree topology with hierarch switches. since each switch

represents a management endpoint, network redesign is often required for performance assurance, high availability,

and capability planning, as the size of the hadoop cluster grows. Network management also becomes increasingly

complex because of large numbers of endpoints and the increased number of devices involved in server-to-server

connectivity and performance troubleshooting. For example, the network provision or troubleshooting between B1 and

B1’ may involve up to five devices, as shown in Figure 2.

In addition to storing and processing big data, the hadoop cluster needs to collect data. most unstructured data, such

as event data from active machine log files, can be staged in file systems on an array of servers. however, structured

data, such as purchasing transactions and real-time inventory tracking, commonly resides on a Fibre Channel (FC)-

based disk array. to rapidly collect the combination of structured and unstructured data in a real-time fashion, the

client nodes (shown in Figure 2) need a converged network to support direct, optimal access to data through the

storage area network (saN). however, a multitier network may not support such a rich set of storage protocols for

rapid data collection.

In an ideal network, the copies of data replication can be placed into different, unique racks, and the large-scale

network can be simplified and managed as simply as a single physical switch.

Page 8: Understanding Big Data and the QFabric System

8 Copyright © 2012, Juniper Networks, Inc.

White Paper - Understanding Big Data and the QFabric System

QFabric System Support for Big Data the Juniper Networks QFX3500 switch can be deployed for a small size hadoop cluster as a high-performance, ultra-

low latency switch. the QFX3500 switch is a versatile, compact, high-density 10 gbps platform in a 1 U form factor

that runs the same Juniper Networks Junos® operating system software as other Juniper switches, routers, and security

platforms. the QFX3500 delivers feature-rich L2 and L3 connectivity to networked devices such as rack and blade

servers, storage systems, and other switches in highly demanding, high-performance data center environments. the

QFX3500 offers standards-based Fibre Channel over ethernet (FCoe) ports to directly access data stored in the Fibre

Channel-based saN.

When deployed with other components of the Juniper Networks QFabric system, the QFX3500 delivers a fabric-ready

QFabric Node edge solution.

Figure 3: Comparison of a chassis switch and a QFabric System

the QFabric system has the unique ability to support an entire data center—up to 6,144 10gbe ports—with a single

converged ethernet switch. as shown in Figure 3, similar to a standalone modular switch chassis that has three main

function components (line cards, switch fabric, and routing engines), QFabric system is composed of three separate

components in a distributed architecture.6

• QFabric Node—Line card component of a QFabric system, which acts as the entry and exit into the fabric. Up to 128

QFabric Nodes can be interconnected in a single QFabric system.

• QFabric Interconnect—high-speed transport device for interconnecting QFabric Nodes. a QFabric system supports up to

4 QFabric Interconnects.

• QFabric director—device controller and services manager that delivers a common window for managing all components

as a single device.

the QFabric system is environmentally conscious, allowing enterprises to optimize every facet of the data center

network while consuming less power, requiring less cooling, and producing a fraction of the carbon footprint for

multitier data center networks. to achieve performance and economies of scale in hadoop, QFabric technology, with

its simplified operation and consistent low-latency, is an ideal solution to build a big data network infrastructure that

meets different organizations’ needs.

CHASSIS SWITCH DISTRIBUTED SWITCH

I/O Modules

QFabric Director(Route Engine)

QFabric Interconnect(Fabric)

QFabric Node(I/O Modules)

Fabric

Routing Engine

6 For further information, refer to the QFabric architecture–Implementing a Flat data Center Network, at www.juniper.net/us/en/local/pdf/whitepapers/2000443-en.pdf.

Page 9: Understanding Big Data and the QFabric System

Copyright © 2012, Juniper Networks, Inc. 9

White Paper - Understanding Big Data and the QFabric System

Big Data Infrastructure Strategy as big data pilots launch and business cases solidify, there are a number of changes occurring in network architectures

that can enhance and help integrate big data processing and insights. Just as big data applications represent new

ways of collecting, making sense of, and taking action on business data, the underlying network foundation of big data

projects should now be considered in a new light. Network architectures can either enhance or inhibit the ability to

easily begin, grow, and integrate big data initiatives from pilot to large-scale production.

table 1 lists three hadoop cluster deployment sizes and the network capabilities that represent the growing demands

for big data infrastructure. In a typical small size hadoop cluster, the QFX3500 provides the following network

capabilities for 20 servers in a single rack:

• 40 non-blocking 10gbe ports to interconnect 20 servers, each with dual 10 gbps and teaming

• any-to-any server bandwidth, up to 20 gbps

• One network hop between any-to-any servers and extremely low any-to-any latency of less than 1 microsecond

• Feature-rich L2 and L3 connectivity

• One network management platform powered by Junos Os

Table 1: Hadoop Deployments and Network Capability Powered by QFabric System

Hadoop Deployment Small Size Hadoop Mid-size Hadoop Large Size Hadoop

hdFs size 240 terabytes7 2.4 petabytes 24 petabytes

mapreduce processing capability 240 cores 2,400 cores 24,000 cores

Equipmentservers8 20 200 2,000

Network devices 1 standalone QFX3500 Large QFabric system•10QFabricNodes•2QFabricInterconnects•2QFabricDirectors

Large QFabric system•100QFabricNodes•4QFabricInterconnects•2QFabricDirectors

deployment footprint Up to 1 rack 10 racks (average) 100 racks and more

Network Summaryhops between servers 1 1 1

average inter-rack server communication bandwidth

Na 8 gbps 8 gbps

Intra-rack server-to-server communication bandwidth

20 gbps 20 gbps 20 gbps

maximum server-to-server latency 1 microsecond9 5 microseconds 5 microseconds

Operational Simplicity at Scaleas shown in Figure 4, one QFabric system, which consists of 10 QFabric Nodes interconnected by 2 QFabric

Interconnects, can support a typical mid-size hadoop cluster with 200 servers, each with two 10 gbps connecting to

QFabric Nodes. the same QFabric system can easily scale up to a large size hadoop cluster of 2,000 servers, which

provide 10 times greater storage and processing capacity than a mid-size hadoop cluster by just adding two additional

QFabric Interconnects and 90 QFabric Nodes.

7 1 petabyte = 1000 terabyte.8 each server is a 2 rU server with two six-core CPUs, 288 gB ram, four 3 tB hard drives, and a dual 10 gbps NIC.9 1 microsecond = 1/1000 millisecond.

Page 10: Understanding Big Data and the QFabric System

10 Copyright © 2012, Juniper Networks, Inc.

White Paper - Understanding Big Data and the QFabric System

Figure 4: An optimized mid-size Hadoop cluster with QFabric System

With a QFabric system, the large size hadoop cluster provides the same performance as a mid-size hadoop cluster does:

• Intra-rack server communication bandwidth of 20 gbps (average inter-rack server communication bandwidth is 8 gbps)

• Feature-rich L2 and L3 connectivity to interconnect servers and other network devices such as routers and firewalls

• One network hop between any-to-any servers and extreme low any-to-any latency of less than 5 microseconds

• standards-based FCoe ports to directly access data stored in the FC-based saN.

• One network management powered by Junos Os

as all QFabric Nodes are part of one logical switch, network operation such as provisioning and troubleshooting is

greatly simplified. For example, the network operation between B1 and B1’ only involves one logical device, as shown in

Figure 4. In addition to the above-mentioned network operation simplicity, the QFabric system also offers benefits in

power, cooling, space, and CapeX in a mid- and large size hadoop deployment.

the QFabric system allows a hadoop cluster to collect data from an FC-based saN through a converged network.

When a QFabric Node is configured as an FCoe transition switch or FCoe gateway, the client nodes can use an

ethernet-based Converged Network adapter (CNa) to rapidly collect data in the FC-based saN. this saves the cost of

investing in an FC host bus adapter (hBa) for each client node and greatly simplifies network management.

Data Reliability at Scale supported by the QFabric system, hdFs can introduce a new data placement policy which will place three copies of

data into three unique racks without affecting the write performance, significantly improving inter-rack bandwidth.

For example, the average inter-rack network bandwidth between B1 and B1’ is improved to 8 gbps. the new data

placement policy will also improve the read performance, since data copies can be read in parallel from three different

racks. With the QFabric system, organizations can also consider increasing the block size from 256 mB or 512 mB for

better performance.

high-performance networks can rapidly maintain data reliability and reduce risk of failure. When a 3 tB disk drive fails, a

data replication takes approximately 7 minutes on a 10gbe network without considering network latency. the event takes

27 minutes to re-replicate data when a server that employs four 3 tB disk drives fails. and the lossless 10 gbps network

architecture and high availability features provided by QFabric technology actually reduce the risk of network failure.

ONE QFABRIC SYSTEM

1

Rack 8

QFabricNode 8

Name Node 2

Data Node

Data Node

Data Node

Data Node

Rack 9

QFabricNode 9

Job Tracker 2

Data Node

Data Node

Data Node

Data Node

Rack 10

QFabricNode 10

Client

Data Node

Data Node

Data Node

Data Node

Rack 1

QFabricNode 1

Client

Data Node

Client

Data Node

Data Node

Rack 2

QFabricNode 2

Name Node

Data Node

Data Node

Data Node

Data Node

Rack 3

QFabricNode 3

Job Tracker

Data Node

Data Node

Data Node

Data Node

B1

B1”B1’

QFabricInterconnect

Network Hops

Hadoop Data Replication

Page 11: Understanding Big Data and the QFabric System

Copyright © 2012, Juniper Networks, Inc. 11

White Paper - Understanding Big Data and the QFabric System

Performance at ScaleLinear performance scalability of hadoop allows organizations to predict and plan their infrastructure: by doubling the

size of a cluster, organizations can process twice the amount of data in a given time or reduce the execution time of a

given amount of data by half. QFabric architecture supports such performance scalability with consistent, extremely

low, any-to-any latency and efficient inter-rack bandwidth. By eliminating the intermediate switches, the QFabric

system operates at oversubscription of 2.5:1, which means 400 gbps bandwidth within the rack and 160 gbps inter-

rack bandwidth in a mid- and large size hadoop cluster that consists of 20 servers per rack with each server employing

two 10 gbps NICs (shown in table 1). the intra-rack server communication bandwidth is 20 gbps, while the average

inter-rack server-to-server communication bandwidth is 8 gbps.

Conclusionhadoop can run on most data center networks. however, legacy network architectures are not designed to handle

modern distributed application architectures, nor can they deliver the reliability and performance at scale demanded

by big data. Just as big data applications represent a new way of collecting, analyzing, and taking action on business

data, using the Juniper Networks QFabric system as the underlying network foundation of big data projects should

be considered in a new light. Network architectures can either enhance or inhibit the ability to easily initiate, grow,

and integrate big data initiatives from pilot to large-scale production. thus, organizations should further consider the

following key questions:

• If a pilot is successful, how big will the cluster become?

• What is the easiest way to add compute capacity without adding complexity and cost to running a cluster at scale?

• Over the lifetime of the cluster, which hadoop or other applications will be running on the cluster?

• how do we extend data output or inputs to legacy or other applications?

With the QFabric system, organizations can easily build small, mid-size, and large size hadoop clusters. Compared to

the multitiered data center network approach, the QFabric system helps businesses develop a more simplified network

operation, improve hadoop performance, and optimize hadoop data reliability, as shown in table 2.

Table 2: Hadoop Benefits Comparing Multitier Network Approach and QFabric System

Hadoop Features Multitiered Network QFabric System

Operation at scale Complexity grows as the size of the cluster grows

simplified

reliability at scale Current data placement policy concerns the limited inter-rack bandwidth

Optimized

Performance at scale suboptimal performance due to ad-hoc design

Optimized

With the evolving trend in big data combined with continued growth in data creation, big data analytics demand an

elastic data center infrastructure to effectively collect big data, process big data, and deliver actionable information in

real time. Juniper Networks QFabric system is a data center solution that offers a high-performance, scalable, big data

infrastructure with simplified management. With QFabric technology, CIOs and CtOs no longer need to worry about

disruptive transformations in their big data initiatives.

Page 12: Understanding Big Data and the QFabric System

12 Copyright © 2012, Juniper Networks, Inc.

White Paper - Understanding Big Data and the QFabric System

2000483-001-eN June 2012

Copyright 2012 Juniper Networks, Inc. all rights reserved. Juniper Networks, the Juniper Networks logo, Junos, Netscreen, and screenOs are registered trademarks of Juniper Networks, Inc. in the United states and other countries. all other trademarks, service marks, registered marks, or registered service marks are the property of their respective owners. Juniper Networks assumes no responsibility for any inaccuracies in this document. Juniper Networks reserves the right to change, modify, transfer, or otherwise revise this publication without notice.

EMEA Headquarters

Juniper Networks Ireland

airside Business Park

swords, County dublin, Ireland

Phone: 35.31.8903.600

emea sales: 00800.4586.4737

Fax: 35.31.8903.601

APAC Headquarters

Juniper Networks (hong Kong)

26/F, Cityplaza One

1111 King’s road

taikoo shing, hong Kong

Phone: 852.2332.3636

Fax: 852.2574.7803

Corporate and Sales Headquarters

Juniper Networks, Inc.

1194 North mathilda avenue

sunnyvale, Ca 94089 Usa

Phone: 888.JUNIPer (888.586.4737)

or 408.745.2000

Fax: 408.745.2100

www.juniper.net

to purchase Juniper Networks solutions,

please contact your Juniper Networks

representative at 1-866-298-6428 or

authorized reseller.

Printed on recycled paper

About Juniper NetworksJuniper Networks is in the business of network innovation. From devices to data centers, from consumers to cloud

providers, Juniper Networks delivers the software, silicon and systems that transform the experience and economics

of networking. the company serves customers and partners worldwide. additional information can be found at

www.juniper.net.