Upload
gwendolyn-byrd
View
222
Download
1
Tags:
Embed Size (px)
Citation preview
Anand Hegde
Prerna Shraff
Performance Analysis of Lucene Index on HBase Environment
Group #13
Overview
•HBase vs BigTable
•The Problem
• Implementation
•Performance Analysis
•Survey
•Conclusion
HBase vs BigTable
BigTable
• Compressed, high performance database system
• It is built GFS using Chubby Lock Service, SSTable etc.
HBase
• Hadoop Database
• Open source, distributed versioned, column oriented
•Modeled after BigTable
The Problem
•Data intensive computing requires storage solutions for huge amount of data.
•The requirement is to host very large tables on clusters of commodity hardware.
•HBase provides BigTable like capabilities on top of Hadoop.
•Current implementation in this field includes an experiment using Lucene Index on HBase in an HPC Environment. (Xiaoming Gao, Vaibhav Nachankar, Judy Qiu)
Architecture
Implementation
•Configured Hadoop and HBase on Alamo cluster.
•Added scripts to run the program sequentially on multiple nodes.
•Modified scripts to record size of the table.
•Modified scripts to record time of execution for both sequential and parallel execution.
Performance Analysis
•Sequential execution across same number of nodes for different data sizes.
•Sequential execution across different number of data nodes for same data size.
•Parallel execution across same number of nodes for different data sizes.
Analysis details
• Performed analysis on Alamo cluster on FutureGrid
• System type: Dell PowerEdge
•No. of CPUs: 192
•No. of cores: 768
• 3 ZooKeeper nodes + 1 HDFS-Master + 1 HBase-master
Analysis details
00000004###md###Title###Geoffrey C. Fox Papers Collection 1990
00000004###md###Category###paper, proceedings collection
00000004###md###Authors###Geoffrey C. Fox, others
00000004###md###CreatedYear###1990
00000004###md###Publishers###California Institute of Technology CA
00000004###md###Location###California Institute of Technology CA
00000004###md###StartPage###1
00000004###md###CurrentPage###105
00000004###md###Additional###This is a paper collection of Geoffrey C. Fox
00000004###md###DirPath###Proceedings in a collection of papers from one conference/Fox
00000005###md###Title###C3P Related Papers - T.Barnes
00000005###md###Category###paper, proceedings collection
00000005###md###Authors###T.Barnes, others
Number of nodes: 11
100 MB 300 MB 500 MB 800 MB 1 GB0
10
20
30
40
50
60
70
Sequential execution
Size of data
Tim
e in s
econds
Size of data: 50 MB
11 nodes 13 nodes 15 nodes 17 nodes 19 nodes0
1
2
3
4
5
6
7
Sequential execution
Number of nodes
Tim
e in s
econds
Number of nodes: 13
1 GB 5 GB 10 GB 20 GB 30 GB0
2
4
6
8
10
12
14
16
Parallel Execution
Size of data
Tim
e in m
inute
s
Survey
• There are a lot of load testing frameworks available to run distributed tests using many machines.
• Popular ones are Grinder, Apache JMeter, Load Runner etc.
•Compared the above testing frameworks to choose the best framework.
Why Survey?
•Gives the absolute measure of the system response time.
• Targets the regressions on the sever and the application code.
• Examines the response.
•Helps evaluate and compare middleware solutions from different vendors.
Load Runner
•Automated performance testing product on a commercial ground
• Supports JavaScript and C-script
•Windows platform
•Commercial
•Aimed for Automated Test Engineers
•Has a UI
Framework:
•Virtual User Scripts
•Controller
Apache JMeter
• Pure Java desktop application
• designed to load test functional behavior and measure performance
• designed for testing Web Applications
• Java based
•Highly extensible
Test Plan
• Thread Groups
• Controllers
• Samplers
• Listeners
Grinder
•Open source
•Uses Jython
• Scripts can be run by defining the tests in the grinder.properties file
Framework:
•Console
•Agent
•Workers
Grinder
Comparison
Parameter LoadRunner Grinder JMeter
Server monitoring
Strong for MS Windows
Needs wrapper based
approach
No built in monitoring
Amount of load
Number of users restricted
Number of agents restricted
Number of agents depend on H/W support available
Able to run in batch?
No No Yes
Ease of installation
Difficult Moderate Easy
Setting up tests
Icon based Uses Jython Java based
Comparison
Parameter LoadRunner Grinder JMeter
Running tests
Complex Moderate Simple
Result generation
Integrated analysis tool
No integrated tool available
Can generate client side graphs
Agent management
Easy/Automatic
Manual Real time/Dynamic
Cross Platform
No. MS Windows only
Yes Yes
Intended audience
Aimed at non-developers
Aimed at developers
Aimed at non-builders
Stability Poor Moderate Poor
Cost Expensive Free (open source)
Free (open source)
Roadmap
Study HBase
Study Lucene Indexing
ModifyScripts
Add Scripts
Study TestingFrameworks
Implement Grinder
Conclusion
• Sequential execution takes more time compared to parallel execution on HBase.
•Research indicates that HBase is not as robust as the BigTable yet.
•Regarding the testing framework, we recommend Grinder as it is an open source tool and has lot of documentation.
•Grinder also provides good real time feedbacks.
References
• http://grinder.sourceforge.net/
• http://jmeter.apache.org/
• http://www8.hp.com/us/en/software/software-product.html?compURI=tcm:245-935779
• http://hpcdb.org/sites/hpcdb.org/files/gao_lucene.pdf
• http://hadoop.apache.org/common/docs/stable/file_system_shell.html#du
Thank you