31
Slide 1 © 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com HDFS and Big Data TDD Using PigUnit

Hadoop & Test Driven Development via Pig-Unit

Embed Size (px)

Citation preview

Page 1: Hadoop & Test Driven Development via Pig-Unit

Slide 1© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com

HDFS and Big Data TDD Using PigUnit

Page 2: Hadoop & Test Driven Development via Pig-Unit

Slide 2© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com

Session Objectives

ᗍ Introduction to BIG Data & Hadoop

ᗍ Understand HDFS concepts

ᗍ Understand Big Data TDD Using PigUnit

ᗍ BIG Data & Hadoop Course Syllabus

ᗍ Webinar by Skillspeed

Get Started with BIG Data & Hadoop

Page 3: Hadoop & Test Driven Development via Pig-Unit

Slide 3© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com

Big Data and its Challenges

Get Started with BIG Data & Hadoop

Page 4: Hadoop & Test Driven Development via Pig-Unit

Slide 4© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com

Big Data and its Challenges

Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications

Systems / Enterprises generate huge amount of data from Terabytes to and even Petabytes of information

It’s very difficult to manage such huge data……

Get Started with BIG Data & Hadoop

Page 5: Hadoop & Test Driven Development via Pig-Unit

Slide 5© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com

Who Generates Big Data?

Have you ever wondered how Google, Facebook or LinkedIn manages to store and utilize the huge data?

Today, it is becoming a problem for all of us to manage such BIG DATA…. Get Started with BIG Data & Hadoop

Page 6: Hadoop & Test Driven Development via Pig-Unit

Slide 6© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com

Hadoop can be used for easy processing of such huge Data…..

We will answer how?

Before that let’s understand what is Hadoop?Get Started with BIG Data & Hadoop

Page 7: Hadoop & Test Driven Development via Pig-Unit

Slide 7© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com

Hadoop and its Characteristics

Apache Hadoop is a framework that allows the distributed processing of large data sets across clusters of commodity computers using a simple programming model

It is an Open-source Data Management technology with scale-out storage and distributed processing

Hadoop Characteristics

Flexible

Reliable

Economical

Scalable Get Started with BIG Data & Hadoop

Page 8: Hadoop & Test Driven Development via Pig-Unit

Slide 8© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com

Hadoop Ecosystem

Flume Sqoop

Import Or Export

Unstructured or Semi-Structured data Structured Data

Apache Oozie (Workflow)

HDFS(Hadoop Distributed File System)

Pig LatinData Analysis

HiveDW System

MapReduce Framework HBase

OtherYARN

Frameworks (MPI,GIRAPH)

YARNCluster Resource Management

Get Started with BIG Data & Hadoop

Page 9: Hadoop & Test Driven Development via Pig-Unit

Slide 9© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com

HDFS

Get Started with BIG Data & Hadoop

Page 10: Hadoop & Test Driven Development via Pig-Unit

Slide 10© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com

Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster

HDFS and its Components

The Hadoop distributed file system (HDFS) is a distributed, scalable, and portable file-system written in Java for the Hadoop framework. It has the following two components:

NameNode

ᗍ Storage side master of the systemᗍ It maintains, manages, and administers the data blocks present on the DataNodes

DataNodes

ᗍ Slave machines which provide the actual and redundant storageᗍ End points for client read and write operations

Get Started with BIG Data & Hadoop

Page 11: Hadoop & Test Driven Development via Pig-Unit

Slide 11© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com

HDFS Architecture

NameNode

Client

Rack 1 Client Rack 2

Metadata (Name, replicas,...): /home/foo/data, 3,…

Read DataNodes

Write

Replication

Blocks

Block ops

DataNodes

Metadata ops

Get Started with BIG Data & Hadoop

Page 12: Hadoop & Test Driven Development via Pig-Unit

Slide 12© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com

HDFS NameNode

Keeps Meta data in Main Memory

ᗍ The entire metadata is in main memoryᗍ FS meta data is not loaded from hard disk

Metadata type

ᗍ Files in HDFSᗍ Data Blocks for each fileᗍ DataNodes for each blockᗍ File attributes, e.g. access time, replication factor, access control

Get Started with BIG Data & Hadoop

Page 13: Hadoop & Test Driven Development via Pig-Unit

Slide 13© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com

Secondary NameNode

Secondary NameNode:

ᗍ In HDFS 1.0, not a hot standby for the NameNode

ᗍ By Default connects to NameNode every hour*

ᗍ Housekeeping, backup of NameNode metadata

ᗍ Saved metadata is used to bring up the secondary NameNode

NameNode

SecondaryNameNode

Metadata

I’’ll take metadata every hour and will

make it secure

Get Started with BIG Data & Hadoop

Page 14: Hadoop & Test Driven Development via Pig-Unit

Slide 14© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com

Big Data TDD Using PigUnit

Get Started with BIG Data & Hadoop

Page 15: Hadoop & Test Driven Development via Pig-Unit

Slide 15© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com

What is TDD?

ᗍ TDD stands for Test Driven Development

ᗍ Test Driven Development aims to shorten the development cycles

ᗍ It aims to “get something now and perfect it later” approach

ᗍ The typical process involves “RED-GREEN-REFACTOR” cycle

ᗍ It’s a part of larger software design paradigm- “Extreme Programming”

ᗍ Test Driven Development requires tests to be written before code itself!

ᗍ It leads to a better code which is just enough to pass the tests

ᗍ 100% code coverage is ensured for TDD based code

Get Started with BIG Data & Hadoop

Page 16: Hadoop & Test Driven Development via Pig-Unit

Slide 16© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com

I can’t follow TDD because…..

ᗍ “It’s working! Let’s freeze it for now”

ᗍ The release date is quite aggressive!

ᗍ It slows down our development cycle

ᗍ We are already short staffed..

ᗍ What are Testers supposed to do?

All (or possibly more) reasons above lead the teams for “Technical Debt”

Get Started with BIG Data & Hadoop

Page 17: Hadoop & Test Driven Development via Pig-Unit

Slide 17© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com

-Albert Einstein

“The most powerful force in the universe is compound Interest”

Get Started with BIG Data & Hadoop

Page 18: Hadoop & Test Driven Development via Pig-Unit

Slide 18© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com

Time Taken to Fix Bugs

0

250

500

750

1000

Design Implementation QA Post-release

Get Started with BIG Data & Hadoop

Page 19: Hadoop & Test Driven Development via Pig-Unit

Slide 19© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com

Traditional Development

Design Implement Test

Get Started with BIG Data & Hadoop

Page 20: Hadoop & Test Driven Development via Pig-Unit

Slide 20© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com

TDD

Design Test Implement Test

Get Started with BIG Data & Hadoop

Page 21: Hadoop & Test Driven Development via Pig-Unit

Slide 21© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com

TDD

Implement

Design

Test Test

Get Started with BIG Data & Hadoop

Page 22: Hadoop & Test Driven Development via Pig-Unit

Slide 22© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com

TDD

Implement

Design

Test Test

Get Started with BIG Data & Hadoop

Page 23: Hadoop & Test Driven Development via Pig-Unit

Slide 23© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com

Why Unit Test Pig?

ᗍ Pig is NOT a programming language

ᗍ Pig is a Data Flow Language

ᗍ It just converts the Pig Latin data flow to Map-Reduce jobs

ᗍ The best use-case for Pig in Big Data projects is for “Data Factory” operations

ᗍ Since we are not talking about a “programming language”, does testing make sense?

ᗍ Pig already comes with the diagnostic operators, so extra testing will be overhead!

All of the above reasons lead to even bigger problems, as the testing in Big Data world is data driven in nature

Get Started with BIG Data & Hadoop

Page 24: Hadoop & Test Driven Development via Pig-Unit

Slide 24© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com

What is PigUnit?

ᗍ PigUnit is the unit testing framework for Pig scripts

ᗍ It is not really a *Unit framework

ᗍ It’s a library which can be used within JUnit tests to

• Run Pig scripts from within JUnit tests

• Override variables in Pig scripts to provide data from tests rather than from external sources such as HDFS

• Inspect the values of your Pig script relations

• Make your STORE statements into no-ops so that your Pig scripts run without side effects

Get Started with BIG Data & Hadoop

Page 25: Hadoop & Test Driven Development via Pig-Unit

Slide 25© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com

Job Trends – Hadoop

Get Started with BIG Data & Hadoop

Page 26: Hadoop & Test Driven Development via Pig-Unit

Slide 26© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com

Why SkillSpeed?

Course Curriculum

from Industry Experts

Instructor Led Live Virtual

Sessions

Lifetime access to Course

Content via LMS

100% Placement Assistance

24x7 Support

Get Started with BIG Data & Hadoop

Page 27: Hadoop & Test Driven Development via Pig-Unit

Slide 27© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com

Course Topics

Module 1

Introduction to Big Data and Hadoop

Module 2

HDFS Internals, Hadoop Configurations and

Data Loading

Module 3

Introduction to Map Reduce

Module 4

Advanced Map Reduce Concepts

Module 5

Introduction to Pig

Module 6

Advanced Pig and Introduction to Hive

Module 7

Advanced Hive Concepts

Module 8

Extending Hive and HBase Introduction

Module 9

Advanced HBase and Oozie Introduction

Module 10

Project Set-up Discussion

Get Started with BIG Data & Hadoop

Page 28: Hadoop & Test Driven Development via Pig-Unit

Slide 28© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com

Corporate Partners

Get Started with BIG Data & Hadoop

Page 29: Hadoop & Test Driven Development via Pig-Unit

Slide 29© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com

Lines open 24/7

To know more about the course, Please contact:

IND +91-90660-20904 USA 1866-607-6547 (Toll Free)

Or reach us at

[email protected]

Contact us..

Get Started with BIG Data & Hadoop

Page 30: Hadoop & Test Driven Development via Pig-Unit

Slide 30© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com

References

http://en.wikipedia.org/wiki/Albert_Einstein

http://www.lincs.fr/research/areas/big-data/

Google images – credit for google, Facebook and LinkedIn LOGO and Snapshots

http://www.counsellingpages.co.uk/

http://langfordsconsultancy.com/langfords-training-support-package/

http://cbsepathshala.blogspot.in/2012/05/physics-class-x-chapter-electricity.html

http://mmatycoon.com/tycoontimes/tycoontimesstory.php?SID=1010

Page 31: Hadoop & Test Driven Development via Pig-Unit