Anand Hegde Prerna Shraff Performance Analysis of Lucene Index on HBase Environment Group #13

Anand Hegde

Prerna Shraff

Performance Analysis of Lucene Index on HBase Environment

Group #13

Overview

•HBase vs BigTable

•The Problem

• Implementation

•Performance Analysis

•Survey

•Conclusion

HBase vs BigTable

BigTable

• Compressed, high performance database system

• It is built GFS using Chubby Lock Service, SSTable etc.

HBase

• Hadoop Database

• Open source, distributed versioned, column oriented

•Modeled after BigTable

The Problem

•Data intensive computing requires storage solutions for huge amount of data.

•The requirement is to host very large tables on clusters of commodity hardware.

•HBase provides BigTable like capabilities on top of Hadoop.

•Current implementation in this field includes an experiment using Lucene Index on HBase in an HPC Environment. (Xiaoming Gao, Vaibhav Nachankar, Judy Qiu)

Architecture

Implementation

•Configured Hadoop and HBase on Alamo cluster.

•Added scripts to run the program sequentially on multiple nodes.

•Modified scripts to record size of the table.

•Modified scripts to record time of execution for both sequential and parallel execution.

Performance Analysis

•Sequential execution across same number of nodes for different data sizes.

•Sequential execution across different number of data nodes for same data size.

•Parallel execution across same number of nodes for different data sizes.

Analysis details

• Performed analysis on Alamo cluster on FutureGrid

• System type: Dell PowerEdge

•No. of CPUs: 192

•No. of cores: 768

• 3 ZooKeeper nodes + 1 HDFS-Master + 1 HBase-master

Analysis details

00000004###md###Title###Geoffrey C. Fox Papers Collection 1990

00000004###md###Category###paper, proceedings collection

00000004###md###Authors###Geoffrey C. Fox, others

00000004###md###CreatedYear###1990

00000004###md###Publishers###California Institute of Technology CA

00000004###md###Location###California Institute of Technology CA

00000004###md###StartPage###1

00000004###md###CurrentPage###105

00000004###md###Additional###This is a paper collection of Geoffrey C. Fox

00000004###md###DirPath###Proceedings in a collection of papers from one conference/Fox

00000005###md###Title###C3P Related Papers - T.Barnes

00000005###md###Category###paper, proceedings collection

00000005###md###Authors###T.Barnes, others

Number of nodes: 11

100 MB 300 MB 500 MB 800 MB 1 GB0

10

20

30

40

50

60

70

Sequential execution

Size of data

Tim

e in s

econds

Size of data: 50 MB

11 nodes 13 nodes 15 nodes 17 nodes 19 nodes0

1

2

3

4

5

6

7

Sequential execution

Number of nodes

Tim

e in s

econds

Number of nodes: 13

1 GB 5 GB 10 GB 20 GB 30 GB0

2

4

6

8

10

12

14

16

Parallel Execution

Size of data

Tim

e in m

inute

s

Survey

• There are a lot of load testing frameworks available to run distributed tests using many machines.

• Popular ones are Grinder, Apache JMeter, Load Runner etc.

•Compared the above testing frameworks to choose the best framework.

Why Survey?

•Gives the absolute measure of the system response time.

• Targets the regressions on the sever and the application code.

• Examines the response.

•Helps evaluate and compare middleware solutions from different vendors.

Load Runner

•Automated performance testing product on a commercial ground

• Supports JavaScript and C-script

•Windows platform

•Commercial

•Aimed for Automated Test Engineers

•Has a UI

Framework:

•Virtual User Scripts

•Controller

Apache JMeter

• Pure Java desktop application

• designed to load test functional behavior and measure performance

• designed for testing Web Applications

• Java based

•Highly extensible

Test Plan

• Thread Groups

• Controllers

• Samplers

• Listeners

Grinder

•Open source

•Uses Jython

• Scripts can be run by defining the tests in the grinder.properties file

Framework:

•Console

•Agent

•Workers

Grinder

Comparison

Parameter LoadRunner Grinder JMeter

Server monitoring

Strong for MS Windows

Needs wrapper based

approach

No built in monitoring

Amount of load

Number of users restricted

Number of agents restricted

Number of agents depend on H/W support available

Able to run in batch?

No No Yes

Ease of installation

Difficult Moderate Easy

Setting up tests

Icon based Uses Jython Java based

Comparison

Parameter LoadRunner Grinder JMeter

Running tests

Complex Moderate Simple

Result generation

Integrated analysis tool

No integrated tool available

Can generate client side graphs

Agent management

Easy/Automatic

Manual Real time/Dynamic

Cross Platform

No. MS Windows only

Yes Yes

Intended audience

Aimed at non-developers

Aimed at developers

Aimed at non-builders

Stability Poor Moderate Poor

Cost Expensive Free (open source)

Free (open source)

Roadmap

Study HBase

Study Lucene Indexing

ModifyScripts

Add Scripts

Study TestingFrameworks

Implement Grinder

Conclusion

• Sequential execution takes more time compared to parallel execution on HBase.

•Research indicates that HBase is not as robust as the BigTable yet.

•Regarding the testing framework, we recommend Grinder as it is an open source tool and has lot of documentation.

•Grinder also provides good real time feedbacks.

References

• http://grinder.sourceforge.net/

• http://jmeter.apache.org/

• http://www8.hp.com/us/en/software/software-product.html?compURI=tcm:245-935779

• http://hpcdb.org/sites/hpcdb.org/files/gao_lucene.pdf

• http://hadoop.apache.org/common/docs/stable/file_system_shell.html#du

Thank you

Documents

Anand Hegde Prerna Shraff Performance Analysis of Lucene Index on HBase Environment Group #13