68
Index 1. About the Workshop 2. The Objectives of Big Data 3. Introduction to Big Data 3.1 What is Big Data 3.2 Why Big Data 3.3 3 Vs of Big Data – Volume, Velocity and Variety 4. Big Data – Basics of Big Data Architecture 4.1 What is Hadoop? 4.2 What is MapReduce? 4.3 What is HDFS ? 5. Why to Learn Big Data? 6. Overview of Big Data Analytics 7. Relationship between Small Data and Big Data 8. Social Media analysis including sentiment analysis in Big Data 9. Applications of Big Data to Security, DHS Web, Social Networks, Smart Grid 10. Tools for Big Data 11. Projects on Big Data 12. Future of Big Data

Big Data BOOK

Embed Size (px)

DESCRIPTION

Big Data BOOKBig Data BOOBig Data BOOBig BvvBig Data BOOig Data BOOBig Data BOOBig Data BOOData BOOBig Data BOOBig Data BOOBig Data BOOBig Data BOOBig Data BOOBig Data BOOBig Data BOOvBig Data BOOBig Data BOOBig Data BOOKBig Data BOOBig Data BOOBig BvvBig Data BOOig Data BOOBig Data BOOBig Data BOOData BOOBig Data BOOBig Data BOOBig Data BOOBig Data BOOBig Data BOOBig Data BOOBig Data BOOvBig Data BOOBig Data BOO

Citation preview

Page 1: Big Data BOOK

Index1. About the Workshop

2. The Objectives of Big Data

3. Introduction to Big Data3.1 What is Big Data 3.2 Why Big Data3.3 3 Vs of Big Data – Volume, Velocity and Variety

4. Big Data – Basics of Big Data Architecture4.1 What is Hadoop?4.2 What is MapReduce?4.3 What is HDFS ?

5. Why to Learn Big Data?

6. Overview of Big Data Analytics

7. Relationship between Small Data and Big Data

8. Social Media analysis including sentiment analysis in Big Data

9. Applications of Big Data to Security, DHS Web, Social Networks, Smart Grid

10. Tools for Big Data

11. Projects on Big Data

12. Future of Big Data

Page 2: Big Data BOOK

About The WorkshopBig Data workshop is designed to provide knowledge and skills to become a successful Hadoop Developer. Big

data is no more an “industry buzz” and the analytics we glean from it is an “industry necessity” for success. Big

data analytics is the process of examining big data to uncover hidden patterns, unknown correlations and other

useful information that can be used to make better decisions. The five days faculty development programme on

Big Data aims at

(i) to understand the challenges in architecture to store & access the Big data

(ii) to perform analytics on Big data for data intensive applications

(iii) analyzing Hadoop and other related tools that provide SQL-like access to unstructured data NoSQL

for

their critical features, data consistency and ability to scale to extreme volumes

(iv) to introduce application of Big data science that applies to various challenging areas.

It paves the way for the academicians / professionals to uncover research issues in the storage and analysis of

huge volume of data from the various sources. Let us discover how big Big Data is!

Page 3: Big Data BOOK

Course ObjectivesAfter the completion of the 'Big Data and Hadoop' Course at, you should be able to:

Master the concepts of Hadoop Distributed File System and Map Reduce framework. Understand Hadoop 2.x Architecture -- HDFS Federation, Name Node High

Availability. Setup a Hadoop Cluster. Understand Data Loading Techniques using Sqoop and Flume. Program in MapReduce. Learn to write Complex MapReduce programs. Perform Data Analytics using Pig and Hive. Implement HBase, MapReduce Integration, Advanced Usage and Advanced

Indexing. Implement best Practices for Hadoop Development. Implement a Hadoop Project. Work on a Real Life Project on Big Data Analytics and gain Hands on Project

Experience.

Who should go for this course?

As Big Data is not only an Industry buzz word but also a hot research topic directed towards understanding of numerous techniques for deriving structured data from unstructured text and the data analytics. This course is designed for professionals aspiring to make a career in Big Data Analytics using Hadoop Framework. Software Professionals, Analytics Professionals, ETL developers, Project Managers, Testing Professionals are the key beneficiaries of this course. Other professionals who are looking forward to acquire a solid foundation of Hadoop Architecture can also opt for this course. Big Data subject is also introduced in the curriculum and therefore we feel that this workshop would be an eye-opener for all the Faculty as it being delivered by the eminent faculty from IITs , NITs and Industry experts and it will provide a scope for the faculty who are doing or willing to do their research activities and to handle the subject and also major projects in the area respectively.

Page 4: Big Data BOOK

Introduction to Big DataWhat is Big Data?

We want to learn Big Data. We have no clue where and how to start learning about it.Does Big Data really mean data is big? What are the tools and software I need to know to learn Big Data? We often have questions in our mind. They are good questions and honestly when we search online; it is hard to find authoritative and authentic answers.

In the next 5 days we will understand what is so big about Big Data.

“Every day, we create 2.5 quintillion bytes of data — so much that 90% of the data in the world today has been created in the last two years alone. This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. This data is “big data.”

What is BIG DATA• Walmart handles more than 1 million customer

transactions every hour.

• Facebook handles 40 billion photos from its user base.

• Decoding the human genome originally took 10years to process; now it can be achieved in one week.

Big Data – Big Thing!

Big Data is becoming one of the most talked about technology trends nowadays. The real challenge with the big organization is to get maximum out of the data already available and predict what kind of data to collect in the future. How to take the existing data and make it meaningful that it provides us accurate insight in the past data is one of the key discussion points in many of the executive meetings in organizations. With the explosion of the data the challenge has gone to the next level and now a Big Data is becoming the reality in many organizations.

Page 5: Big Data BOOK

Why Big Data

• Growth of Big Data is needed

– Increase of storage capacities

– Increase of processing power

– Availability of data(different data types)

– Every day we create 2.5 quintillion bytes of data; 90% of the data in the world today has been created in the last two years alone

Why Big Data

•FB generates 10TB daily

•Twitter generates 7TB of dataDaily

•IBM claims 90% of today’sstored data was generatedin just the last two years.

Page 6: Big Data BOOK

How Is Big Data Different?1) Automatically generated by a machine

(e.g. Sensor embedded in an engine)

2) Typically an entirely new source of data

(e.g. Use of the internet)

3) Not designed to be friendly(e.g. Text streams)

4) May not have much values• Need to focus on the important part

16

What does Big Data trigger?

Page 7: Big Data BOOK

Big Data – A Rubik’s Cube

Compare big data with the Rubik’s cube. You will believe they have many similarities. Just like a Rubik’s cube it has many different solutions. Let us visualize a Rubik’s cube solving challenge where there are many experts participating. If you take five Rubik’s cube and mix up the same way and give it to five different experts to solve it. It is quite possible that all the five people will solve the Rubik’s cube in fractions of the seconds but if you pay attention to the same closely, you will notice that even though the final outcome is the same, the route taken to solve the Rubik’s cube is not the same. Every expert will start at a different place and will try to resolve it with

different methods. Some will solve one color first and others will solve another color first. Even though they follow the same kind of algorithm to solve the puzzle they will start and end at a different place and their moves will be different at many occasions. It is nearly impossible to have a exact same route taken by two experts.

3 Vs of Big Data – Volume, Velocity and Variety

Data is forever. Think about it – it is indeed true. Are you using any application as it is which was built 10 years ago? Are you using any piece of hardware which was built 10 years ago? The answer is most certainly No. However, if I ask you – are you using any data which were captured

50 years ago, the answer is most certainly yes. For example, look at the history of our nation. I am from India and we have documented history which goes back as over 1000s of year. Well, just look at our birthday data – at least we are using it till today. Data never gets old and it is going to stay there forever.  Application which interprets and analysis data got changed but the data remained in its purest format in most cases.

As organizations have grown the data associated with them also grew exponentially and today there are lots of complexities to their data. Most of the big organizations have data in multiple applications and in different formats. The data is also spread out so much that it is hard to categorize with a single algorithm or logic. The mobile revolution which we are experimenting right now has completely changed how we capture the data and build intelligent systems.  Big organizations are indeed facing challenges to keep all the data on a platform which give them a single consistent view of their data. This unique challenge to make sense of all the data coming in from different sources and deriving the useful actionable information out of is the revolution Big Data world is facing.

Page 8: Big Data BOOK

Defining Big Data

The 3Vs that define Big Data are Variety, Velocity and Volume.

Volume

We currently see the exponential growth in the data storage as the data is now more than text data. We can find data in the format of videos, musics and large images on our social media channels. It is very common to have Terabytes and Petabytes of the storage system for enterprises. As the database grows the applications and architecture built to support the data needs to be reevaluated quite often. Sometimes the same data is re-evaluated with multiple angles and even though the original data is the same the new found intelligence creates explosion of the data. The big volume indeed represents Big Data.

Velocity

The data growth and social media explosion have changed how we look at the data. There was a time when we used to believe that data of yesterday is recent. The matter of the fact newspapers is still following that logic. However, news channels and radios have changed how fast we receive the news. Today, people reply on social media to update them with the latest happening. On social media sometimes a few seconds old messages (a tweet, status updates etc.) is not

Page 9: Big Data BOOK

something interests users. They often discard old messages and pay attention to recent updates. The data movement is now almost real time and the update window has reduced to fractions of the seconds. This high velocity data represent Big Data.

Variety

Data can be stored in multiple formats. For example database, excel, csv, access or for the matter of the fact, it can be stored in a simple text file. Sometimes the data is not even in the traditional format as we assume, it may be in the form of video, SMS, pdf or something we might have not thought about it. It is the need of the organization to arrange it and make it meaningful. It will be easy to do so if we have data in the same format, however it is not the case most of the time. The real world have data in many different formats and that is the challenge we need to overcome with the Big Data. This variety of the data represent represents Big Data.

Big Data in Simple Words

Big Data is not just about lots of data, it is actually a concept providing an opportunity to find new insight into your existing data as well guidelines to capture and analysis your future data. It makes any business more agile and robust so it can adapt and overcome business challenges.

Data in Flat File

In earlier days data was stored in the flat file and there was no structure in the flat file.  If any data has to be retrieved from the flat file it was a project by itself. There was no possibility of retrieving the data efficiently and data integrity has been just a term discussed without any modeling or structure around. Database residing in the flat file had more issues than we would like to discuss in today’s world. It was more like a nightmare when there was any data processing involved in the application. Though, applications developed at that time were also not that advanced the need of the data was always there and there was always need of proper data management.

Page 10: Big Data BOOK

Edgar F Codd and 12 Rules

Edgar Frank Codd was a British computer scientist who, while working for IBM, invented the relational model for database management, the theoretical basis for relational databases. He presented 12 rules for the Relational Database and suddenly the chaotic world of the database seems to see discipline in the rules. Relational Database was a promising land for all the unstructured database users. Relational Database brought into the relationship between data as well improved the performance

of the data retrieval. Database world had immediately seen a major transformation and every single vendors and database users suddenly started to adopt the relational database models.

Relational Database Management Systems

Since Edgar F Codd proposed 12 rules for the RBDMS there were many different vendors who started them to build applications and tools to support the relationship between databases. This was indeed a learning curve for many of the developer who had never worked before with the modeling of the database. However, as time passed by pretty much everybody accepted the relationship of the database and started to evolve product which performs its best with the boundaries of the RDBMS concepts. This was the best era for the databases and it gave the world extreme experts as well as some of the best products. The Entity Relationship model was also evolved at the same time. In software engineering, an Entity–relationship model (ER model) is a data model for describing a database in an abstract way.

Enormous Data Growth

Well, everything was going fine with the RDBMS in the database world. As there were no major challenges the adoption of the RDBMS applications and tools was pretty much universal. There was a race at times to make the developer’s life much easier with the RDBMS management tools.

Due to the extreme popularity and easy to use system pretty much every data was stored in the RDBMS system. New age applications were built and social media took the world by the storm. Every organization was feeling pressure to provide the best experience for their users based the data they had with them. While this was all going on at the same time data was growing pretty much every organization and application.

Page 11: Big Data BOOK

Data Warehousing

The enormous data growth now presented a big challenge for the organizations who wanted to build intelligent systems based on the data and provide near real time superior user experience to their customers. Various organizations immediately start building data warehousing solutions where the data was stored and processed.

The trend of the business intelligence becomes the need of everyday. Data was received from the transaction system and overnight was processed to build intelligent reports from it. Though this is a great solution it has its own set of challenges. The relational database model and data warehousing concepts are all built with keeping traditional relational database modeling in the mind and it still has many challenges when unstructured data was present.

Interesting Challenge

Every organization had expertise to manage structured data but the world had already changed to unstructured data. There was intelligence in the videos, photos, SMS, text, social media messages and various other data sources. All of these needed to now bring to a single platform and build a uniform system which does what businesses need.

The way we do business has also been changed. There was a time when user only got the features what technology supported, however, now users ask for the feature and technology is built to support the same. The need of the real time intelligence from the fast paced data flow is now becoming a necessity.

Large amount (Volume) of difference (Variety) of high speed data (Velocity) is the properties of the data. The traditional database system has limits to resolve the challenges this new kind of the data presents. Hence the need of the Big Data Science. We need innovation in how we handle and manage data. We need creative ways to capture data and present to users. Big Data is Reality!

Page 12: Big Data BOOK

Big Data – Basics of Big Data Architecture

We will understand basics of the Big Data Architecture.

Big Data Cycle

Just like every other database related applications, bit data project have its development cycle. Though three Vs (link) for sure plays an important role in deciding the architecture of the Big Data projects. Just like every other project Big Data project also goes to similar phases of the data capturing, transforming, integrating, analyzing and building actionable reporting on the top of the data.

While the process looks almost same but due to the nature of the data the architecture is often totally different. Here are few of the questions which everyone should ask before going ahead with Big Data architecture.

Questions to Ask

How big is your total database?

What is your requirement of the reporting in terms of time – real time, semi real time or at frequent interval?

How important is the data availability and what is the plan for disaster recovery?

What are the plans for network and physical security of the data?

What platform will be the driving force behind data and what are different service level agreements for the infrastructure?This is just basic questions but based on your application and business need you should come up with the custom list of the question to ask. As mentioned earlier this question may look quite simple but the answer will not be simple. When we are talking about Big Data implementation there are many other important aspects which we have to consider when we decide to go for the architecture.

Building Blocks of Big Data Architecture

It is absolutely impossible to discuss and nail down the most optimal architecture for anyBig Data Solution in a single page; however, we can discuss the basic building blocks of big data architecture. Here is the image explains how the building blocks of the Big Data architecture works. Big data can be stored, acquired, processed, and analyzed in many ways. Every big data source has different characteristics, including the frequency, volume, velocity, type, and veracity of the data. When big data is processed and stored, additional dimensions come into play, such as governance, security, and policies. Choosing an architecture and building an appropriate big data solution is challenging because so many factors have to be considered.

Page 13: Big Data BOOK

This "Big data architecture and patterns" series presents a structured and pattern-based approach to simplify the task of defining an overall big data architecture. Because it is important to assess whether a business scenario is a big data problem, we include pointers to help determine which business problems are good candidates for big data solutions.

.

Above image gives good overview of how in Big Data Architecture various components are associated with each other. In Big Data various different data sources are part of the architecture hence extract, transform and integration are one of the most essential layers of the architecture. Most of the data is stored in relational as well as non-relational data marts and data warehousing solutions. As per the business need various data are processed as well converted to proper reports and visualizations for end users. Just like software the hardware is almost the most important part of the Big Data Architecture. In the big data architecture hardware infrastructure is extremely important and failure over instances as well as redundant physical infrastructure is usually implemented.

NoSQL in Data Management

NoSQL is a very famous buzz word and it really means Not Relational SQL or Not Only SQL. This is because in Big Data Architecture the data is in any format. It can be unstructured, relational or in any other format or from any other data source. To bring all the data together relational technology is not enough, hence new tools, architecture and other algorithms are invented which takes care of all the kind of data. This is collectively called NoSQL.

Page 14: Big Data BOOK

What is NoSQL?

NoSQL stands for Not Relational SQL or Not Only SQL. Lots of people think that NoSQL means there is No SQL, which is not true – they both sound same but the meaning is totally different. NoSQL does use SQL but it uses more than SQL to achieve its goal. As per Wikipedia’s NoSQL Database Definition – “A NoSQL database provides a mechanism for storage and retrieval of data that uses looser consistency models than traditional relational databases.“

Why use NoSQL?

A traditional relation database usually deals with predictable structured data. Whereas as the world has moved forward with unstructured data we often see the limitations of the traditional relational database in dealing with them. For example, nowadays we have data in format of SMS, wave files, photos and video format. It is a bit difficult to manage them by using a traditional relational database. I often see people using BLOB filed to store such a data. BLOB can store the data but when we have to retrieve them or even process them the same BLOB is extremely slow in processing the unstructured data. A NoSQL database is the type of database that can handle unstructured, unorganized and unpredictable data that our business needs it.

Along with the support to unstructured data, the other advantage of NoSQL Database is high performance and high availability.

Eventual Consistency

Additionally to note that NoSQL Database may not provided 100% ACID (Atomicity, Consistency, Isolation, Durability) compliance.  Though, NoSQL Database does not support ACID they provide eventual consistency. That means over the long period of time all updates can be expected to propagate eventually through the system and data will be consistent.

Taxonomy

Taxonomy is the practice of classification of things or concepts and the principles. The NoSQL taxonomy supports column store, document store, key-value stores, and graph databases. We will discuss the taxonomy in detail in later blog posts. Here are few of the examples of the each of the No SQL Category.

Page 15: Big Data BOOK

Column: Hbase, Cassandra, Accumulo Document: MongoDB, Couchbase, Raven Key-value : Dynamo, Riak, Azure, Redis, Cache, GT.m Graph: Neo4J, Allegro, Virtuoso, Bigdata

What is Hadoop?

Apache Hadoop is an open-source, free and Java based software framework offers a powerful distributed platform to store and manage Big Data. It is licensed under an Apache V2 license. It runs applications on large clusters of commodity hardware and it processes thousands of terabytes of data on thousands of the nodes. Hadoop is inspired from Google’s MapReduce and Google File System (GFS) papers. The major advantage of Hadoop framework is that it provides reliability and high availability.

What are the core components of Hadoop?

There are two major components of the Hadoop framework and both fo them does two of the important task for it.

Hadoop MapReduce is the method to split a larger data problem into smaller chunk and distribute it to many different commodity servers. Each server have their own set of resources and they have processed them locally. Once the commodity server has processed the data they send it back collectively to main server. This is effectively a process where we process large data effectively and efficiently. (We will understand this in tomorrow’s blog post).

Hadoop Distributed File System (HDFS) is a virtual file system. There is a big difference between any other file system and Hadoop. When we move a file on HDFS, it is automatically split into many small pieces. These small chunks of the file are replicated  and stored on other servers (usually 3) for the fault tolerance or high availability. (We will understand this in the day after tomorrow’s blog post).

Besides above two core components Hadoop project also contains following modules as well.

Hadoop Common: Common utilities for the other Hadoop modules Hadoop Yarn: A framework for job scheduling and cluster resource

management

Page 16: Big Data BOOK

There are a few other projects (like Pig, Hive) related to above Hadoop as well which we will gradually explore in later blog posts.

A Multi-node Hadoop Cluster Architecture

Now let us quickly see the architecture of the a multi-node Hadoop cluster.

A small Hadoop cluster includes a single master node and multiple worker or slave node. As discussed earlier, the entire cluster contains two layers. One of the layer of MapReduce Layer and another is of HDFS Layer. Each of these layer have its own relevant component. The master node consists of a Job Tracker, Task Tracker, NameNode and DataNode. A slave or worker node consists of a DataNode and TaskTracker. It is also possible that slave node or worker node is only data or compute node. The matter of the fact that is the key feature of the Hadoop.

Why Use Hadoop?

There are many advantages of using Hadoop. Let me quickly list them over here:

Robust and Scalable – We can add new nodes as needed as well modify them. Affordable and Cost Effective – We do not need any special hardware for running

Hadoop. We can just use commodity server. Adaptive and Flexible – Hadoop is built keeping in mind that it will handle structured

and unstructured data. Highly Available and Fault Tolerant – When a node fails, the Hadoop framework

automatically fails over to another node.

Page 17: Big Data BOOK

Why Hadoop is named as Hadoop?

In year 2005 Hadoop was created by Doug Cutting and Mike Cafarella while working at Yahoo. Doug Cutting named Hadoop after his son’s toy elephant.

What is MapReduce?

Map Reduce was designed by Google as a programming model for processing large data sets with a parallel, distributed algorithm on a cluster. Though, MapReduce was originally Google proprietary technology, it has been quite a generalized term in the recent time.

MapReduce comprises a Map() and Reduce() procedures. Procedure Map() performance filtering and sorting operation on data where as procedure Reduce() performs a summary operation of the data. This model is based on modified concepts of the map and reduce functions commonly available in functional programing. The library where procedure Map() and Reduce() belongs is written in many different languages. The most popular free implementation of MapReduce is Apache Hadoop which we will explore tomorrow.

Page 18: Big Data BOOK

Advantages of MapReduce Procedures

The MapReduce Framework usually contains distributed servers and it runs various tasks in parallel to each other. There are various components which manages the communications between various nodes of the data and provides the high availability and fault tolerance.

 Programs written in MapReduce functional styles are automatically parallelized and executed on commodity machines. The MapReduce Framework takes care of the details of partitioning the data and executing the processes on distributed server on run time. During this process if there is any disaster the framework provides high availability and other available modes take care of the responsibility of the failed node.

As you can clearly see more this entire MapReduce Frameworks provides much more than just Map() and Reduce() procedures; it provides scalability and fault tolerance as well. A typical implementation of the MapReduce Framework processes many petabytes of data and thousands of the processing machines.

How do MapReduce Framework Works?

A typical MapReduce Framework contains petabytes of the data and thousands of the nodes. Here is the basic explanation of the MapReduce Procedures which uses this massive commodity of the servers.

Map() Procedure

There is always a master node in this infrastructure which takes an input. Right after taking input master node divides it into smaller sub-inputs or sub-problems. These sub-problems are distributed to worker nodes. A worker node later processes them and does necessary analysis. Once the worker node completes the process with this sub-problem it returns it back to master node.

Reduce() Procedure

All the worker nodes return the answer to the sub-problem assigned to them to master node. The master node collects the answer and once again aggregate that in the form of the answer to the original big problem which was assigned master node.

The MapReduce Framework does the above Map () and Reduce () procedure in the parallel and independent to each other. All the Map() procedures can run parallel to each other and once each worker node had completed their task they can send it back to master code to compile it with  a

Page 19: Big Data BOOK

single answer. This particular procedure can be very effective when it is implemented on a very large amount of data (Big Data).

The MapReduce Framework has five different steps:

Preparing Map() Input Executing User Provided Map() Code Shuffle Map Output to Reduce Processor Executing User Provided Reduce Code Producing the Final Output

Here is the Dataflow of MapReduce Framework:

Input Reader Map Function Partition Function Compare Function Reduce Function Output Writer

MapReduce in a Single Statement

MapReduce is equivalent to SELECT and GROUP BY of a relational database for a very large database.

What is HDFS ?

HDFS stands for Hadoop Distributed File System and it is a primary storage system used by Hadoop. It provides high performance access to data across Hadoop clusters. It is usually deployed on low-cost commodity hardware. In commodity hardware deployment server failures are very common. Due to the same reason HDFS is built to have high fault tolerance. The data transfer rate between compute nodes in HDFS is very high, which leads to reduced risk of failure.

HDFS creates smaller pieces of the big data and distributes it on different nodes. It also copies each smaller piece to multiple times on different nodes. Hence when any node with the data crashes the system is automatically able to use the data from a different node and continue the process. This is the key feature of the HDFS system.

Page 20: Big Data BOOK

Architecture of HDFS

The architecture of the HDFS is master/slave architecture. An HDFS cluster always consists of single NameNode. This single NameNode is a master server and it manages the file system as well regulates access to various files. In additional to NameNode there are multiple DataNodes. There is always one DataNode for each data server. In HDFS a big file is split into one or more blocks and those blocks are stored in a set of DataNodes.

The primary task of the NameNode is to open, close or rename files and directory and regulate access to the file system, whereas the primary task of the DataNode is read and write to the file systems. DataNode is also responsible for the creation, deletion or replication of the data based on the instruction from NameNode.

In reality, NameNode and DataNode are software designed to run on commodity machine build in Java language.

Visual Representation of HDFS Architecture

Let us understand how HDFS works with the help of the diagram. Client APP or HDFS Client connects to NameSpace as well as DataNode. Client App access to the DataNode is regulated by NameSpace Node. NameSpace Node allows Client App to connect to the DataNode based by allowing the connection to the DataNode directly. A big data file is divided into multiple data blocks (let us assume that those data chunks are A,B,C and D. Client App will later on write data blocks directly to the DataNode. Client App does not have to directly write to all the node. It just has to write to any one of the node and NameNode will decide on which other DataNode it will have to replicate the data. In our example Client App directly writes to DataNode 1 and detained 3. However, data chunks are automatically replicated to other nodes. All the information like in which DataNode which data block is placed is written back to NameNode.

Page 21: Big Data BOOK

High Availability During Disaster

Now as multiple DataNode have same data blocks in the case of any DataNode which faces the disaster, the entire process will continue as other DataNode will assume the role to serve the specific data block which was on the failed node. This system provides very high tolerance to disaster and provides high availability.

If you notice there is only single NameNode in our architecture. If that node fails our entire Hadoop Application will stop performing as it is a single node where we store all the metadata. As this node is very critical, it is usually replicated on another clustered as well as on another data rack. Though, that replicated node is not operational in architecture, it has all the necessary data to perform the task of the NameNode in the case of the NameNode fails.

The entire Hadoop architecture is built to function smoothly even there are node failures or hardware malfunction. It is built on the simple concept that data is so big it is impossible to have come up with a single piece of the hardware which can manage it properly. We need lots of commodity (cheap) hardware to manage our big data and hardware failure is part of the commodity servers. To reduce the impact of hardware failure Hadoop architecture is built to overcome the limitation of the non-functioning hardware.

 Big Question?

Here are a few questions often received since the beginning of the Big Data Series -

Does the relational database have no space in the story of the Big Data? Does relational database is no longer relevant as Big Data is evolving? Is relational database not capable to handle Big Data? Is it true that one no longer has to learn about relational data if Big Data is the final

destination?

Page 22: Big Data BOOK

Well, every single time when we hear that one person wants to learn about Big Data and is no longer interested in learning about relational database.

It is very clear that one who is aspiring to become Big Data Scientist or Big Data Expert they should learn about relational database.

NoSQL MovementThe reason for the NoSQL Movement in recent time was because of the two important advantages of the NoSQL databases.

1. Performance2. Flexible Schema

In personal experience I have found that when I use NoSQL I have found both of the above listed advantages when I use NoSQL database. There are instances when I found relational database too much restrictive when my data is unstructured as well as they have in thedatatype which my Relational Database does not support. It is the same case when I have found that NoSQL solution performing much better than relational databases. I must say that I am a big fan of NoSQL solutions in the recent times but I have also seen occasions and situations where relational database is still perfect fit even though the database is growing increasingly as well have all the symptoms of the big data.

Situations in Relational Database Outperforms

Adhoc reporting is the one of the most common scenarios where NoSQL is does not have optimal solution. For example reporting queries often needs to aggregate based on the columns which are not indexed as well are built while the report is running, in this kind of scenario NoSQL databases (document database stores, distributed key value stores) database often does not perform well. In the case of the ad-hoc reporting I have often found it is much easier to work with relational databases.

SQL is the most popular computer language of all the time. we have been using it for almost over 10 years and we feel that we will be using it for a long time in future. There are plenty of the tools, connectors and awareness of the SQL language in the industry. Pretty much every programming language has a written drivers for the SQL language and most of the developers have learned this language during their school/college time. In many cases, writing query based on SQL is much easier than writing queries in NoSQL supported languages. I believe this is the current situation but in the future this situation can reverse when No SQL query languages are equally popular.

Page 23: Big Data BOOK

ACID (Atomicity Consistency Isolation Durability) – Not all the NoSQL solutions offers ACID compliant language. There are always situations (for example banking transactions, eCommerce shopping carts etc.) where if there is no ACID the operations can be invalid as well database integrity can be at risk. Even though the data volume indeed qualify as a Big Data there are always operations in the application which absolutely needs ACID compliance matured language.

The Mixed Bag

We have often heard argument that all the big social media sites now a days have moved away from Relational Database. Actually this is not entirely true. While researching about Big Data and Relational Database, I have found that many of the popular social media sites uses Big Data solutions along with Relational Database. Many are using relational databases to deliver the results to end user on the run time and many still uses a relational database as their major backbone.

Here are a few examples:

Facebook uses MySQL to display the timeline. Twitter uses MySQL. Tumblr uses Sharded MySQL Wikipedia uses MySQL for data storage.

There are many for prominent organizations which are running large scale applications uses relational database along with various Big Data frameworks to satisfy their variousbusiness needs.

I believe that RDBMS is like a vanilla ice cream. Everybody loves it and everybody has it.NoSQL and other solutions are like chocolate ice cream or custom ice cream – there is a huge base which loves them and wants them but not every ice cream maker can make it just right for everyone’s taste. No matter how fancy an ice cream store is there is always plain vanilla ice cream available there. Just like the same, there are always cases and situations in the Big Data’s story where traditional relational database is the part of the whole story. In the real world scenarios there will be always the case when there will be need of the relational database concepts and its ideology. It is extremely important to accept relational database as one of the key components of the Big Data instead of treating it as a substandard technology.

Ray of Hope – NewSQL

In this module we discussed that there are places where we need ACID compliance from our Big Data application and NoSQL will not support that out of box. There is a new termed coined for the application/tool which supports most of the properties of the traditional RDBMS and supports Big Data infrastructure – NewSQL.

Page 24: Big Data BOOK

What is NewSQL?

NewSQL stands for new scalable and high performance SQL Database vendors. The products sold by NewSQL vendors are horizontally scalable. NewSQL is not kind of databases but it is about vendors who supports emerging data products with relational database properties (like ACID, Transaction etc.) along with high performance. Products from NewSQL vendors usually follow in memory data for speedy access as well are available immediate scalability.NewSQL term was coined by 451 groups analyst Matthew Aslett

On the definition of NewSQL, Aslett writes:“NewSQL” is our shorthand for the various new scalable/high performance SQL database vendors. We have previously referred to these products as ‘ScalableSQL‘ to differentiate them from the incumbent relational database products. Since this implies horizontal scalability, which is not necessarily a feature of all the products, we adopted the term ‘NewSQL’ in the new report. And to clarify, like NoSQL, NewSQL is not to be taken too literally: the new thing about the NewSQL vendors is the vendor, not the SQL.In other words – NewSQL incorporates the concepts and principles of Structured Query Language (SQL) and NoSQL languages. It combines reliability of SQL with the speed and performance of NoSQL.

Categories of NewSQL

There are three major categories of the NewSQL

New Architecture – In this framework each node owns a subset of the data and queries are split into smaller query to sent to nodes to process the data. E.g. NuoDB, Clustrix, VoltDB

MySQL Engines – Highly Optimized storage engine for SQL with the interface of MySQLare the example of such category. E.g. InnoDB, Akiban

Transparent Sharding – This system automatically split database across multiple nodes. E.g.Scalearc 

Page 25: Big Data BOOK

Why to learn Big Data ?Big Data! A Worldwide Problem? 

Billions of Internet users and machine-to-machine connections are causing a tsunami of data growth. Utilizing big data requires transforming your information infrastructure into a more flexible, distributed, and open environment. Intel offers a choice of big data solutions based on industry-standard chips, servers, and the Apache Hadoop* framework.

According to Wikipedia, "Big data is a collection of large and complex data sets which becomes difficult to process using on-hand database management tools or traditional data processing applications." In simpler terms, Big Data is a term given to large volumes of data that organizations store and process. However, it is becoming very difficult for companies to store, retrieve and process the ever-increasing data. If any company gets hold on managing its data well, nothing can stop it from becoming the next BIG success!

Big data is going to change the way you do things in the future, how you gain insight, and make decisions (the change isn’t going to be a replacement, rather a synergy and extension). The problem lies in the use of traditional systems to store enormous data. Though these systems were a success a few years ago, with increasing amount and complexity of data, these are soon becoming obsolete. The good news is - Hadoop, which is not less than a panacea for all those companies working with BIG DATA in a variety of applications has become an integral part for storing, handling, evaluating and retrieving hundreds or even petabytes of data.

Apache Hadoop! A Solution for Big Data! 

Hadoop is an open source software framework that supports data-intensive distributed applications. Hadoop is licensed under the Apache v2 license. It is therefore generally known as Apache Hadoop. Hadoop has been developed, based on a paper originally written by Google on MapReduce system and applies concepts of functional programming. Hadoop is written in the Java programming language and is the highest-level Apache project being constructed and used by a global community of contributors. Hadoop was developed by Doug Cutting and Michael J. Cafarella. And just don't overlook the charming yellow elephant you see, which is basically named after Doug's son's toy elephant!

Some of the top companies using Hadoop: 

The importance of Hadoop is evident from the fact that there are many global MNCs that are using Hadoop and consider it as an integral part of their functioning, such as companies like Yahoo and Face book! On February 19, 2008, Yahoo! Inc. established the world's largest Hadoop production application. The Yahoo! Search Web map is a Hadoop application that runs on over 10,000 core Linux cluster and generates data that is now widely used in every Yahoo! Web search query.

Page 26: Big Data BOOK

Facebook, a $5.1 billion company has over 1 billion active users in 2012, according to Wikipedia. Storing and managing data of such magnitude could have been a problem, even for a company like Facebook. But thanks to Apache Hadoop! Facebook uses Hadoop to keep track of each and every profile it has on it, as well as all the data related to them like their images, posts, comments, videos, etc.

Opportunities for Hadoopers! 

Opportunities for Hadoopers are infinite - from a Hadoop Developer, to a Hadoop Tester or a Hadoop Architect, and so on. If cracking and managing Big Data is your passion in life, then think no more and Join Edureka's Hadoop Online course and carve a niche for yourself! Happy Hadooping!

Page 27: Big Data BOOK
Page 28: Big Data BOOK

Overview of Big Data Analytics

Big data analytics is the process of examining large amounts of data of a variety of types to uncover hidden patterns, unknown correlations, and other useful information. Such information can provide competitive advantages over rival organizations and result in business benefits, such as more effective marketing and increased revenue. New methods of working with big data, such as Hadoop and MapReduce, offer alternatives to traditional data warehousing.

Big Data Analytics with R and Hadoop is focused on the techniques of integrating R and Hadoop by various tools such as RHIPE and RHadoop. A powerful data analytics engine can be built, which can process analytics algorithms over a large scale dataset in a scalable manner. This can be implemented through data analytics operations of R, MapReduce, and HDFS of Hadoop.

Big Data sources

Users

Application

Systems

Sensors

Large and growing files(Big data files)

The primary goal of big data analytics is to help companies make more informed business

decisions by enabling data scienti The primary goal of big data analytics is its, predictive

modelers and other analytics professionals to analyze large volumes of transaction data, as well

as other forms of data that may be untapped by conventional business intelligence (BI) programs.

That could include Web server logs and Internet clickstream data, social media content and

social network activity reports, text from customer emails and survey responses, mobile-phone

Page 29: Big Data BOOK

call detail records and machine data captured by sensors connected to the Internet of Things.

Some people exclusively associate big data with semi-structured and unstructured data of that

sort, but consulting firms like Gartner Inc. and Forrester Research Inc. also consider transactions

and other structured data to be valid components of big data analytics applications.

Big data can be analyzed with the software tools commonly used as part of advanced

analytics disciplines such as predictive analytics, data mining, text analytics and statistical

analysis. Mainstream BI software and data visualization tools can also play a role in the analysis

process. But the semi-structured and unstructured data may not fit well in traditional data

warehouses based on relational databases. Furthermore, data warehouses may not be able to

handle the processing demands posed by sets of big data that need to be updated frequently or

even continually -- for example, real-time data on the performance of mobile applications or of

oil and gas pipelines. As a result, many organizations looking to collect, process and analyze big

data have turned to a newer class of technologies that includesHadoop and related tools such

as YARN, MapReduce, Spark, Hive and Pigas well as NoSQL databases. Those technologies

form the core of an open source software framework that supports the processing of large and

diverse data sets across clustered systems. In some cases, Hadoop clusters and NoSQL systems

are being used as landing pads and staging areas for data before it gets loaded into a data

warehouse for analysis, often in a summarized form that is more conducive to relational

structures. Increasingly though, big data vendors are pushing the concept of a Hadoop data lake

that serves as the central repository for an organization's incoming streams of raw data. In such

architectures, subsets of the data can then be filtered for analysis in data warehouses and

analytical databases, or it can be analyzed directly in Hadoop using batch query tools, stream

processing software and SQL on Hadoop technologies that run interactive, ad hoc queries written

in SQL.

Potential pitfalls that can trip up organizations on big data analytics initiatives include a lack of

internal analytics skills and the high cost of hiring experienced analytics professionals. The

amount of information that's typically involved, and its variety, can also cause data management

headaches, including data quality and consistency issues. In addition, integrating Hadoop

systems and data warehouses can be a challenge, although various vendors now offer software

connectors between Hadoop and relational databases, as well as other data integration tools with

big data capabilities.

Page 30: Big Data BOOK

Data generation points Examples

Mobile Devices

Readers/Scanners

Science facilities

Microphones

Cameras

Social Media

Programs/ Software

Big Data Analytics

• Examining large amount of data

• Appropriate information

• Identification of hidden patterns, unknown correlations

• Competitive advantage

• Better business decisions: strategic and operational

• Effective marketing, customer satisfaction, increased revenue

Page 31: Big Data BOOK

Applications

Twitter Data Analysis: Twitter data analysis is used to understand the hottest trends by dwelling into the twitter data. Using flume data is fetched from twitter to Hadoop in JSON format. Using JSON-serde twitter data is read and fed into HIVE tables so that we can do different analysis using HIVE queries. For eg: Top 10 popular tweets etc.

Stack Exchange Ranking and Percentile data-set: Stack Exchange is a place where you will find enormous data from multiple websites of Stack Group (like: stack overflow) which is open sourced. The place is a gold mine for people who wants to come up with several POC and are searching for suitable data-sets. In there you may query out the data you are interested in which will contain more than 50,000 odd records. For eg: You can download Stack Overflow Rank and Percentile data and find out the top 10 rankers.

Loan Dataset: The project is designed to find the good and bad URL links based on the reviews given by the users. The primary data will be highly unstructured. Using MR jobs the data will be transformed into structured form and then pumped to HIVE tables. Using Hive queries we can query out the information very easily. In the phase two we will feed another dataset which contains the corresponding cached web pages of the URL's into HBASE. Finally the entire project is showcased into a UI where you can check the ranking of the URL and view the cached page.

Data -sets by Government: These Data sets could be like Worker Population Ratio (per 1000) for persons of age (15-59) years according to the current weekly status approach for each state/UT.

Machine Learning Dataset like Badges datasets : Such dataset is for system to encode names, for example +/- label followed by a person's name.

Page 32: Big Data BOOK

Big data analytics is the process of examining large amounts of data of a variety of types to uncover hidden patterns, unknown correlations, and other useful information. Such information can provide competitive advantages over rival organizations and result in business benefits, such as more effective marketing and increased revenue. New methods of working with big data, such as Hadoop and MapReduce, offer alternatives to traditional data warehousing.

Big Data Analytics with R and Hadoop is focused on the techniques of integrating R and Hadoop by various tools such as RHIPE and RHadoop. A powerful data analytics engine can be built, which can process analytics algorithms over a large scale dataset in a scalable manner. This can be implemented through data analytics operations of R, MapReduce, and HDFS of HadoopWeather Dataset: It has all the details of weather over a period of time using which you may find out the highest, lowest or average temperature.

Page 33: Big Data BOOK

Social Media analysis including sentiment analysis in Big Data

Today's consumers are heavily involved in social media, with users having accounts on multiple social media services. Social media gives users a platform to communicate effectively with friends, family, and colleagues, and also gives them a platform to talk about their favorite (and least favorite brands). This “unstructured” conversation can give businesses valuable insight into how consumers perceive their brand, and allow them to actively make business decisions to maintain their image.

Social Media

Historically, unstructured data has been very difficult to analyze using traditional data warehousing technologies. New cost effective solutions, such as Hadoop, are changing this and allowing data of high volume, velocity, and variety to be much more easily analyzed. Hadoop is a massively parallel technology designed to be cost effective by running on commodity hardware. Today businesses can use Microsoft’s Hadoop implementation,

HDInsight, and SQL Server 2012 to effectively understand and analyze unstructured data, such as social media feeds, alongside existing Key Performance Indicators.

Where do I start?

Posts can be downloaded and loaded into Hadoop using familiar tools like SQL Server Integration Services, or purpose built tools like Apache Flume. Data can often be gathered for free directly from a social media services public application interfaces, though sometimes there are limitations, or from an aggregation service, such as DataSift, which pulls many sources together into a standard format.

Social networks like Twitter and Facebook manage hundreds of millions of interactions each day. Because of this large volume of traffic, the first step in analyzing social media is to understand the scope of data needed to be collected for analysis. Quite often the data can be limited to certain hash tags, accounts, and key words.

Map ReduceOnce the data is loaded into Hadoop the next step is to transform it into a format that can be used for analysis. Data transformation in Hadoop is completed using a process called MapReduce. MapReduce jobs can be written in a number of programming languages, including .Net, Java, Python, and Ruby, or can be system generated by tools such as Hive (a SQL like language for

Page 34: Big Data BOOK

Hadoop that many data analysts would be immediately comfortable with) or PIG (a procedural scripting language for Hadoop). MapReduce allows us to take unstructured data and transform (map) it to something meaningful, and then aggregate (reduce) for reporting. All of this happens in parallel across all nodes in the Hadoop cluster.

A simple example of MapReduce could map social media posts to a list of words and a count of their occurrences, and then reduce, that list to a count of occurrence of a word per day. In a more complex example we can use a dictionary in the map process to cleanse the social media posts, and then use a statistical model to determine the tone of an individual post.

So Now What?

Once MapReduce has done its magic, the meaningful data now stored in Hadoop can be loaded into existing enterprise business intelligence (BI) platform or analyzed directly using powerful self-service tools like PowerPivot and PowerView. Customers utilizing SQL Server as their enterprise BI platform have a variety of options to access their Hadoop data, including: Sqoop, SQL Server Integration Services, and Polybase (SQL Server PDW 2012 only).

Hadoop and SQL ServerHaving social media data loaded into an existing enterprise BI platform allows dashboards to be created that give at glance information on how customers feel about a brand. Imagine how powerful it would be to have the ability to visualize how customer sentiment is affecting top line sales over time! This type of powerful analysis allows businesses to have the insight needed to quickly adapt, and it’s all made possible through Hadoop.

Page 35: Big Data BOOK

Relationship between Small Data and Big Data

Small Data = Big Opportunity

According to a recent KPMG survey of 144 CIOs, 69% stated that data and analytics were crucial or very important to their business. However, 85% also said that they don’t know how to analyze the data they have already collected and 54% said their greatest barrier to success was an inability to identify the data worth collecting.

In short, many businesses have simply bit off more than they can chew when it comes to Big Data. And with all the hype surrounding Big Data Analytics – the small data, data that is small enough size for each of us to understand – often gets overlooked. While there is no debate that big data analytics can provide a wealth of valuable information to a business or organization – it is the small data – the actionable data – that provides the real opportunity for businesses.

Think of it this way. Big Data analytics can provide overall trends – like how many people are purchasing “x” between 5-8pm, while small data is more focused on what you are buying between those same hours.

According to IBM, Big Data spans four dimensions: Volume, Velocity, Variety, and Veracity. Volume, of course, defines the amount of data Velocity defines real-time processing and recognition of the data Variety is the type – both structure and unstructured and from multiple sources Veracity is the authenticity and/or accuracy of the data

However, small data adds a fifth dimension: Value. Small data is more focused on the end-user – what they do, what they need and what they can do with this information.

Page 36: Big Data BOOK

Small Data vs. Big Data

The key differentiator between big data and small data is the targeted nature of the information and the fact that it can be easily and quickly, acted upon.

Individuals leave a significant amount of digital traces in a single day. From check-ins, to Facebook likes and comments, tweets, web searches, Instagram and Pinterest postings, reviews, email sign ups, etc., while many businesses already collect significant amounts of small data directly from the customer – such as sales receipts, surveys and customer loyalty (‘rewards’) cards.

The key is to use this information in a way that is actionable and more importantly adds value to both the end user and the business.

Small Data is at the heart of CRM

All of this collected small data is at the heart of CRM (customer relationship management). By combining insights from all of this data, businesses can create a rich profile of its customer and better inform, motivate and connect with them.

According to Digital Clarity Group, there are four key principles for using small data:

Make it Simple: keep it as singular in focus as possible and use pictures charts and info graphics to convey the information

Make it Smart: make sure results are repeatable and trusted

Be Responsive: provide customers with the information they need for wherever they are

Be Social: make sure information can be share socially

Page 37: Big Data BOOK

For instance, Road Runner Sports has a shoe-dog app on its website in which it asks a few questions and provides recommendations on which running shoes I should consider. Unfortunately, it provides A LOT of options with no real way to distinguish which shoes are considered my “best” option.

In this example, small data can make a difference is if the app takes into account my previous shoe purchases and my review of those shoes before providing a recommendation. Additionally, if I tend to buy the same brand and their customer service experts believe I would benefit from another brand – perhaps they could offer me a discount as an incentive. Finally, it could limit the number of options by only showing me the highest rated shoes or only those in a specific price range.

We are able also provide some interesting opportunities – particularly those related to fitness. Combining exercise with nutrition information could possibly yield better results for users. This in turn could keep a user better engaged with both the device and the app. For instance, I am currently training for a ½ marathon; it would be great after a run if my device (based on my workout metrics – distance, speed, heart rate, calories, etc.) offered me some suggestions related to nutrition – such as “be sure to consume ‘x’ amount of water, protein, carbs, etc.” within 30 minutes. Or perhaps, based on my food log – make sure I am consuming the right amount of food based on my level of exercise.

Finally, how great would it be if every time you call into a customer help desk – they have a record of your past calls, tweets or message in Facebook – so the CSR could quickly state “I see that you have called about ‘x’ issue multiple times – are you having the same issue or is this a new issue?” This simple step could easily diffuse an angry customer and go a long way to building trust.

Small data is personal. Small data is local. The goal is to turn all of this information that is readily available into action and improve the customer experience. The opportunities are endless and apply across all industry segments with no business being too small to use data analytics.And remember that bigger is not always better.

Page 38: Big Data BOOK

Big Data Analysis Tools1. HadoopYou simply can't talk about big data without mentioning Hadoop. The Apache distributed data processing software is so pervasive that often the terms "Hadoop" and "big data" are used synonymously. The Apache Foundation also sponsors a number of related projects that extend the capabilities of Hadoop, and many of them are mentioned below. In addition, numerous vendors offer supported versions of Hadoop and related technologies. Operating System: Windows, Linux, OS X.

• Where processing is hosted?– Distributed Servers / Cloud (e.g. Amazon EC2)

• Where data is stored?– Distributed Storage (e.g. Amazon S3)

• What is the programming model?– Distributed Processing (e.g. MapReduce)

• How data is stored & indexed?– High-performance schema-free databases (e.g. MongoDB)

• What operations are performed on data?– Analytic / Semantic Processing

Types of tools used in Big-Data

2. MapReduceOriginally developed by Google, the MapReduce website describe it as "a programming model and software framework for writing applications that rapidly process vast amounts of data in parallel on large clusters of compute nodes." It's used by Hadoop, as well as many other data processing applications. Operating System: OS Independent.

3. GridGainGridGrain offers an alternative to Hadoop's MapReduce that is compatible with the Hadoop Distributed File System. It offers in-memory processing for fast analysis of real-time data. You can download the open source version from GitHub or purchase a commercially supported version from the link above. Operating System: Windows, Linux, OS X.

4. HPCCDeveloped by LexisNexis Risk Solutions, HPCC is short for "high performance computing cluster." It claims to offer superior performance to Hadoop. Both free community versions and paid enterprise versions are available. Operating System: Linux.

5. Storm

Page 39: Big Data BOOK

Now owned by Twitter, Storm offers distributed real-time computation capabilities and is often described as the "Hadoop of realtime." It's highly scalable, robust, fault-tolerant and works with nearly all programming languages. Operating System: Linux.

Databases/Data Warehouses

6. CassandraOriginally developed by Facebook, this NoSQL database is now managed by the Apache Foundation. It's used by many organizations with large, active datasets, including Netflix, Twitter, Urban Airship, Constant Contact, Reddit, Cisco and Digg. Commercial support and services are available through third-party vendors. Operating System: OS Independent.

7. HBaseAnother Apache project, HBase is the non-relational data store for Hadoop. Features include linear and modular scalability, strictly consistent reads and writes, automatic failover support and much more. Operating System: OS Independent.

8. MongoDBMongoDB was designed to support humongous databases. It's a NoSQL database with document-oriented storage, full index support, replication and high availability, and more. Commercial support is available through 10gen. Operating system: Windows, Linux, OS X, Solaris.

9. Neo4jThe "world’s leading graph database," Neo4j boasts performance improvements up to 1000x or more versus relational databases. Interested organizations can purchase advanced or enterprise versions from Neo Technology. Operating System: Windows, Linux.

10. CouchDBDesigned for the Web, CouchDB stores data in JSON documents that you can access via the Web or or query using JavaScript. It offers distributed scaling with fault-tolerant storage. Operating system: Windows, Linux, OS X, Android.

11. OrientDBThis NoSQL database can store up to 150,000 documents per second and can load graphs in just milliseconds. It combines the flexibility of document databases with the power of graph databases, while supporting features such as ACID transactions, fast indexes, native and SQL queries, and JSON import and export. Operating system: OS Independent.

12. TerrastoreBased on Terracotta, Terrastore boasts "advanced scalability and elasticity features without sacrificing consistency." It supports custom data partitioning, event processing, push-down predicates, range queries, map/reduce querying and processing and server-side update functions. Operating System: OS Independent.

13. FlockDB

Page 40: Big Data BOOK

Best known as Twitter's database, FlockDB was designed to store social graphs (i.e., who is following whom and who is blocking whom). It offers horizontal scaling and very fast reads and writes. Operating System: OS Independent.

14. HibariUsed by many telecom companies, Hibari is a key-value, big data store with strong consistency, high availability and fast performance. Support is available through Gemini Mobile. Operating System: OS Independent.

15. RiakRiak humbly claims to be "the most powerful open-source, distributed database you'll ever put into production." Users include Comcast, Yammer, Voxer, Boeing, SEOMoz, Joyent, Kiip.me, Formspring, the Danish Government and many others. Operating System: Linux, OS X.

16. HypertableThis NoSQL database offers efficiency and fast performance that result in cost savings versus similar databases. The code is 100 percent open source, but paid support is available. Operating System: Linux, OS X.

Page 41: Big Data BOOK

Big Data Analytics for Security

This section explains how Big Data is changing the analytics landscape. In particular, Big Data analytics can be leveraged to improve information security and situational awareness. For example, Big Data analytics can be employed to analyze financial transactions, log files, and network traffic to identify anomalies and suspicious activities, and to correlate multiple sources of information into a coherent view.

Data-driven information security dates back to bank fraud detection and anomaly-based intrusion detection systems. Fraud detection is one of the most visible uses for Big Data analytics. Credit card companies have conducted fraud detection for decades. However, the custom-built infrastructure to mine Big Data for fraud detection was not economical to adapt for other fraud detection uses. Off-the-shelf Big Data tools and techniques are now bringing attention to analytics for fraud detection in healthcare, insurance, and other fields.

A Application Of Big Data analytics

HomelandSecurity

Smarter Healthcare Multi-channel

sales

Telecom

Manufacturing

Traffic Control

Trading Analytics

SearchQuality

In the context of data analytics for intrusion detection, the following evolution is anticipated:

● 1 st generation: Intrusion detection systems – Security architects realized the need for layered security (e.g., reactive security and breach response) because a system with 100% protective security is impossible.

● 2 nd generation: Security information and event management (SIEM) – Managing alerts from different intrusion detection sensors and rules was a big challenge in enterprise settings. SIEM systems aggregate and filter alarms from many sources and present actionable information to security analysts.

Page 42: Big Data BOOK

● 3 rd generation: Big Data analytics in security (2nd generation SIEM) – Big Data tools have the potential to provide a significant advance in actionable security intelligence by reducing the time for correlating, consolidating, and contextualizing diverse security event information, and also for correlating long-term historical data for forensic purposes.

Analyzing logs, network packets, and system events for forensics and intrusion detection has traditionally been a significant problem; however, traditional technologies fail to provide the tools to support long-term, large-scale analytics for several reasons:

1. Storing and retaining a large quantity of data was not economically feasible. As a result, most event logs and other recorded computer activity were deleted after a fixed retention period (e.g., 60 days).

2. Performing analytics and complex queries on large, structured data sets was inefficient because traditional tools did not leverage Big Data technologies.

3. Traditional tools were not designed to analyze and manage unstructured data. As a result, traditional tools had rigid, defined schemas. Big Data tools (e.g., Piglatin scripts and regular expressions) can query data in flexible formats.

4. Big Data systems use cluster computing infrastructures. As a result, the systems are more reliable and available, and provide guarantees that queries on the systems are processed to completion. New Big Data technologies, such as databases related to the Hadoop ecosystem and stream processing, are enabling the storage and analysis of large heterogeneous data sets at an unprecedented scale and speed. These technologies will transform security analytics by: (a) Collecting data at a massive scale from many internal enterprise sources and external sources such as vulnerability databases; (b) Performing deeper analytics on the data;(c) Providing a consolidated view of security-related information; and (d) Achieving real-time analysis of streaming data. It is important to note that Big Data tools still require system architects and analysts to have a deep knowledge of their system in order to properly configure the Big Data analysis tools.

Network Security

In a recently published case study, Zions Bancorporation8 announced that it is using Hadoop clusters and business intelligence tools to parse more data more quickly than with traditional SIEM tools. In their experience, the quantity of data and the frequency analysis of events are too much for traditional SIEMs to handle alone. In their traditional systems, searching among a month’s load of data could take between 20 minutes and an hour. In their new Hadoop system running queries with Hive, they get the same results in about one minute.

The security data warehouse driving this implementation not only enables users to mine meaningful security information from sources such as firewalls and security devices, but also from website traffic, business processes and other day-to-day transactions.10 This incorporation of unstructured data and multiple disparate data sets into a single analytical framework is one of the main promises of Big Data.

Page 43: Big Data BOOK

Enterprise Events Analytics

Enterprises routinely collect terabytes of security relevant data (e.g., network events, software application events, and people action events) for several reasons, including the need for regulatory compliance and post-hoc forensic analysis. Unfortunately, this volume of data quickly becomes overwhelming. Enterprises can barely store the data; much less do anything useful with it. For example, it is estimated that an enterprise as large as HP currently (in 2013) generates 1 trillion events per day, or roughly 12 million events per second. These numbers will grow as enterprises enable event logging in more sources, hire more employees, deploy more devices, and run more software. Existing analytical techniques do not work well at this scale and typically produce so many false positives that their efficacy is undermined. The problem becomes worse as enterprises move to cloud architectures and collect much more data. As a result, the more data that is collected, the less actionable information is derived from the data.

The goal of a recent research effort at HP Labs is to move toward a scenario where more data leads to better analytics and more actionable information (Manadhata, Horne, & Rao, forthcoming). To do so, algorithms and systems must be designed and implemented in order to identify actionable security information from large enterprise data sets and drive false positive rates down to manageable levels. In this scenario, the more data that is collected, the more value can be derived from the data. However, many challenges must be overcome to realize the true potential of Big Data analysis.

Among these challenges are the legal, privacy, and technical issues regarding scalable data collection, transport, storage, analysis, and visualization.

Despite the challenges, the group at HP Labs has successfully addressed several Big Data analytics for security challenges, some of which are highlighted in this section. First, a large-scale graph inference approach was introduced to identify malware-infected hosts in an enterprise network and the malicious domains accessed by the enterprise's hosts. Specifically, a host-domain access graph was constructed from large enterprise event data sets by adding edges between every host in the enterprise and the domains visited by the host. The graph was then seeded with minimal ground truth information from a black list and a white list, and belief propagation was used to estimate the likelihood that a host or domain is malicious. Experiments on a 2 billion HTTP request data set collected at a large enterprise, a 1 billion DNS request data set collected at an ISP, and a 35 billion network intrusion detection system alert data set collected from over 900 enterprises worldwide showed that high true positive rates and low false positive rates can be achieved with minimal ground truth information (that is, having limited data labeled as normal events or attack events used to train anomaly detectors).

Second, terabytes of DNS events consisting of billions of DNS requests and responses collected at an ISP were analyzed. The goal was to use the rich source of DNS information to identify botnets, malicious domains, and other malicious activities in a network. Specifically, features that are indicative of maliciousness were identified. For example, malicious fast-flux domains tend to last for a short time, whereas good domains such as hp.com last much longer and resolve to many geographically-distributed IPs. A varied set of features were computed, including ones derived from domain names, time stamps, and DNS response time-to-live values. Then,

Page 44: Big Data BOOK

classification techniques (e.g., decision trees and support vector machines) were used to identify infected hosts and malicious domains. The analysis has already identified many malicious activities from the ISP data set.

Netflow Monitoring to Identify Botnets

This section summarizes the BotCloud research project (Fraçois, J. et al. 2011, November), which leverages the MapReduce paradigm for analyzing enormous quantities of Netflow data to identify infected hosts participating in a botnet (François, 2011, November). The rationale for using MapReduce for this project stemmed from the large amount of Netflow data collected for data analysis. 720 million Netflow records (77GB) were collected in only 23 hours. Processing this data with traditional tools is challenging. However, Big Data solutions like MapReduce greatly enhance analytics by enabling an easy-to-deploy distributed computing paradigm.

BotCloud relies on BotTrack, which examines host relationships using a combination of PageRank and clustering algorithms to track the command-and-control (C&C) channels in the botnet (François et al., 2011, May). Botnet detection is divided into the following steps: dependency graph creation, PageRank algorithm, and DBScan clustering.The dependency graph was constructed from Netflow records by representing each host (IP address) as a node. There is an edge from node A to B if, and only if, there is at least one Netflow record having A as the source address and B as the destination address. PageRank will discover patterns in this graph (assuming that P2P communications between bots have similar characteristics since they are involved in same type of activities) and the clustering phase will then group together hosts having the same pattern. Since PageRank is the most resource-consuming part, it is the only one implemented in MapReduce.

BotCloud used a small Hadoop cluster of 12 commodity nodes (11 slaves + 1 master): 6 Intel Core 2 Duo 2.13GHz nodes with 4 GB of memory and 6 Intel Pentium 4 3GHz nodes with 2GB of memory. The dataset contained about 16 million hosts and 720 million Netflow records. This leads to a dependency graph of 57 million edges.

The number of edges in the graph is the main parameter affecting the computational complexity. Since scores are propagated through the edges, the number of intermediate MapReduce key-value pairs is dependent on the number of links. Figure 5 shows the time to complete an iteration with different edges and cluster sizes.

Figure . Average execution time for a single PageRank iteration.

Page 45: Big Data BOOK

The results demonstrate that the time for analyzing the complete dataset (57 million edges) was reduced by a factor of seven by this small Hadoop cluster. Full results (including the accuracy of the algorithm for identifying botnets) are described in François et al. (2011, May).

Advanced Persistent Threats Detection

An Advanced Persistent Threat (APT) is a targeted attack against a high-value asset or a physical system. In contrast to mass-spreading malware, such as worms, viruses, and Trojans, APT attackers operate in “low-and-slow” mode. “Low mode” maintains a low profile in the networks and “slow mode” allows for long execution time. APT attackers often leverage stolen user credentials or zero-day exploits to avoid triggering alerts. As such, this type of attack can take place over an extended period of time while the victim organization remains oblivious to the intrusion.

The 2010 Verizon data breach investigation report concludes that in 86% of the cases, evidence about the data breach was recorded in the organization logs, but the detection mechanisms failed to raise security alarms (Verizon, 2010).

APTs are among the most serious information security threats that organizations face today. A common goal of an APT is to steal intellectual property (IP) from the targeted organization, to gain access to sensitive customer data, or to access strategic business information that could be used for financial gain, blackmail, and embarrassment, data poisoning, illegal insider trading or disrupting an organization’s business. APTs are operated by highly-skilled, well-funded and motivated attackers targeting sensitive information from specific organizations and operating over periods of months or years. APTs have become very sophisticated and diverse in the methods and technologies used, particularly in the ability to use organizations’ own employees to penetrate the IT systems by using social engineering methods. They often trick users into opening spear-phishing messages that are customized for each victim (e.g., emails, SMS, and PUSH messages) and then downloading and installing specially crafted malware that may contain zero-day exploits (Verizon, 2010; Curry et al., 2011; and Alperovitch, 2011).

Today, detection relies heavily on the expertise of human analysts to create custom signatures and perform manual investigation. This process is labor-intensive, difficult to generalize, and not scalable. Existing anomaly detection proposals commonly focus on obvious outliers (e.g., volume-based), but are ill-suited for stealthy APT attacks and suffer from high false positive rates.

Big Data analysis is a suitable approach for APT detection. A challenge in detecting APTs is the massive amount of data to sift through in search of anomalies. The data comes from an ever-increasing number of diverse information sources that have to be audited. This massive volume of data makes the detection task look like searching for a needle in a haystack (Giura & Wang, 2012). Due to the volume of data, traditional network perimeter defense systems can become ineffective in detecting targeted attacks and they are not scalable to the increasing size of organizational networks. As a result, a new approach is required. Many enterprises collect data about users’ and hosts’ activities within the organization’s network, as logged by firewalls, web

Page 46: Big Data BOOK

proxies, domain controllers, intrusion detection systems, and VPN servers. While this data is typically used for compliance and forensic investigation, it also contains a wealth of information about user behavior that holds promise for detecting stealthy attacks.

Beehive: Behavior Profiling for APT Detection

At RSA Labs, the observation about APTs is that, however subtle the attack might be, the attacker’s behavior (in attempting to steal sensitive information or subvert system operations) should cause the compromised user’s actions to deviate from their usual pattern. Moreover, since APT attacks consist of multiple stages (e.g., exploitation, command-and-control, lateral movement, and objectives), each action by the attacker provides an opportunity to detect behavioral deviations from the norm.

Correlating these seemingly independent events can reveal evidence of the intrusion, exposing stealthy attacks that could not be identified with previous methods.

These detectors of behavioral deviations are referred to as “anomaly sensors,” with each sensor examining one aspect of the host’s or user’s activities within an enterprise’s network. For instance, a sensor may keep track of the external sites a host contacts in order to identify unusual connections (potential command-and-control channels), profile the set of machines each user logs into to find anomalous access patterns (potential “pivoting” behavior in the lateral movement stage), study users’ regular working hours to flag suspicious activities in the middle of the night, or track the flow of data between internal hosts to find unusual “sinks” where large amounts of data are gathered (potential staging servers before data exfiltration).

While the triggering of one sensor indicates the presence of a singular unusual activity, the triggering of multiple sensors suggests more suspicious behavior. The human analyst is given the flexibility of combining multiple sensors according to known attack patterns (e.g., command-and-control communications followed by lateral movement) to look for abnormal events that may warrant investigation or to generate behavioral reports of a given user’s activities across time.

The prototype APT detection system at RSA Lab is named Beehive. The name refers to the multiple weak components (the “sensors”) that work together to achieve a goal (APT detection), just as bees with differentiated roles cooperate to maintain a hive. Preliminary results showed that Beehive is able to process a day’s worth of data (around a billion log messages) in an hour and identified policy violations and malware infections that would otherwise have gone unnoticed (Yen et al., 2013).

In addition to detecting APTs, behavior profiling also supports other applications, including IT management (e.g., identifying critical services and unauthorized IT infrastructure within the organization by examining usage patterns), and behavior-based authentication (e.g., authenticating users based on their interaction with other users and hosts, the applications they typically access, or their regular working hours). Thus, Beehive provides insights into an organization’s environment for security and beyond.

Page 47: Big Data BOOK

Using Large-Scale Distributed Computing to Unveil APTs

Although an APT itself is not a large-scale exploit, the detection method should use large-scale methods and close-to-target monitoring algorithms in order to be effective and to cover all possible attack paths. In this regard, a successful APT detection methodology should model the APT as an attack pyramid, as introduced by Giura & Wang (2012). An attack pyramid should have the possible attack goal (e.g., sensitive data, high rank employees, and data servers) at the top and lateral planes representing the environments where the events associated with an attack can be recorded (e.g., user plane, network plane, application plane, or physical plane). The detection framework proposed by Giura & Wang groups all of the events recorded in an organization that could potentially be relevant for security using flexible correlation rules that can be redefined as the attack evolves. The framework implements the detection rules (e.g., signature based, anomaly based, or policy based) using various algorithms to detect possible malicious activities within each context and across contexts using a MapReduce paradigm.

There is no doubt that the data used as evidence of attacks is growing in volume, velocity, and variety, and is increasingly difficult to detect. In the case of APTs, there is no known bad item that IDS could pick up or that could be found in traditional information retrieval systems or databases. By using a MapReduce implementation, an APT detection system has the possibility to more efficiently handle highly unstructured data with arbitrary formats that are captured by many types of sensors (e.g., Syslog, IDS, Firewall, NetFlow, and DNS) over long periods of time. Moreover, the massive parallel processing mechanism of MapReduce could use much more sophisticated detection algorithms than the traditional SQL-based data systems that are designed for transactional workloads with highly structured data. Additionally, with MapReduce, users have the power and flexibility to incorporate any detection algorithms into the Map and Reduce functions. The functions can be tailored to work with specific data and make the distributed computing details transparent to the users. Finally, exploring the use of large-scale distributed systems has the potential to help to analyze more data at once, to cover more attack paths and possible targets, and to reveal unknown threats in a context closer to the target, as is the case in APTs.

The WINE Platform for Experimenting with Big Data Analytics in Security

The Worldwide Intelligence Network Environment (WINE) provides a platform for conducting data analysis at scale, using field data collected at Symantec (e.g., anti-virus telemetry and file downloads), and promotes rigorous experimental methods (Dumitras & Shoue, 2011). WINE loads, samples, and aggregates data feeds originating from millions of hosts around the world and keeps them up-to-date. This allows researchers to conduct open-ended, reproducible experiments in order to, for example, validate new ideas on real-world data, conduct empirical studies, or compare the performance of different algorithms against reference data sets archived in WINE. WINE is currently used by Symantec’s engineers and by academic researchers.

Page 48: Big Data BOOK

WINE Analysis Example: Determining the Duration of Zero-Day Attacks

A zero-day attack exploits one or more vulnerabilities that have not been disclosed publicly. Knowledge of such vulnerabilities enables cyber criminals to attack any target undetected, from Fortune 500 companies to millions of consumer PCs around the world. The WINE platform was used to measure the duration of 18 zero-day attacks by combining the binary reputation and anti-virus telemetry data sets and by analyzing field data collected on 11 million hosts worldwide (Bilge & Dumitras, 2012). These attacks lasted between 19 days and 30 months, with a median of 8 months and an average of approximately 10 months (Figure 6). Moreover, 60% of the vulnerabilities identified in this study had not been previously identified as exploited in zero-day attacks. This suggests that such attacks are more common than previously thought. These insights have important implications for future security technologies because they focus attention on the attacks and vulnerabilities that matter most in the real world.

Figure 6. Analysis of zero-day attacks that go undetected.

The outcome of this analysis highlights the importance of Big Data techniques for security research. For more than a decade, the security community suspected that zero-day attacks are undetected for long periods of time, but past studies were unable to provide statistically significant evidence of this phenomenon. This is because zero-day attacks are rare events that are unlikely to be observed in honeypots or in lab experiments. For example, most of the zero-day attacks in the study showed up on fewer than 150 hosts out of the 11 million analyzed. Big Data platforms such as WINE provide unique insights about advanced cyber attacks and open up new avenues of research on next-generation security technologies.

Projects on Big Data

Page 49: Big Data BOOK

Areas of Big Data Applications in Action:

Consumer product companies and retail organizations are monitoring social media like Facebook and Twitter to get an unprecedented view into customer behavior, preferences, and product perception.

Manufacturers are monitoring minute vibration data from their equipment, which changes slightly as it wears down, to predict the optimal time to replace or maintain. Replacing it too soon wastes money; replacing it too late triggers an expensive work stoppage

Manufacturers are also monitoring social networks, but with a different goal than marketers: They are using it to detect aftermarket support issues before a warranty failure becomes publicly detrimental.

Financial Services organizations are using data mined from customer interactions to slice and dice their users into finely tuned segments. This enables these financial institutions to create increasingly relevant and sophisticated offers.

Advertising and marketing agencies are tracking social media to understand responsiveness to campaigns, promotions, and other advertising mediums.

Insurance companies are using Big Data analysis to see which home insurance applications can be immediately processed, and which ones need a validating in-person visit from an agent.

By embracing social media, retail organizations are engaging brand advocates, changing the perception of brand antagonists, and even enabling enthusiastic customers to sell their products.

Hospitals are analyzing medical data and patient records to predict those patients that are likely to seek readmission within a few months of discharge. The hospital can then intervene in hopes of preventing another costly hospital stay.

Web-based businesses are developing information products that combine data gathered from customers to offer more appealing recommendations and more successful coupon programs.

The government is making data public at both the national, state, and city level for users to develop new applications that can generate public good.

Sports teams are using data for tracking ticket sales and even for tracking team strategies.

Future of Big Data

Page 50: Big Data BOOK

According to global research firm Gartner, by 2015 nearly 4.4 million new jobs will be created globally by the ‘Big Data’ demand and only one-third of them will be filled. India, along with China, could be one of the biggest suppliers of manpower for the ‘Big Data’ industry. This is one of the Top Predictions for 2013: Balancing Economics, Risk, and Opportunity & Innovation that Gartner released in Chennai today.‘Big Data’ spending is expected to exceed $130 billion by 2015, generating jobs. Companies and professionals should seek to acquire relevant skills to deal with high-volume, real-time data.

Every day 2.5 quintillion bytes of data are created — so much that 90 per cent of the data in the world today have been created in the last two years alone. This data come from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. Specialists are required to analyse these ‘Big Data’ in a company to either take corrective actions or predict future trends.

Advanced information management/analytical skills and business expertise are growing in importance.

Page 51: Big Data BOOK