25
S Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan

Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan

Embed Size (px)

Citation preview

Page 1: Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan

S

Frequent Word Combinations Mining and Indexing on

HBase

Hemanth GokavarapuSanthosh Kumar Saminathan

Page 2: Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan

Introduction

Many projects use Hbase to store large amount of data for distributed computation

The Processing of these data becomes a challenge for the programmers

The use of frequent terms help us in many ways in the field of machine learning

Eg: Frequently purchased items, Frequently Asked Questions, etc.

Page 3: Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan

Problem

These projects on Hbase create indexes on multiple data

We are able to find the frequency of a single word easily using these indexes

It is hard to find the frequency of a combination of words

For example: “cloud computing”

Searching these words separately may lead to results like “scientific computing”, “cloud platform”

Page 4: Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan

Objective

This project focuses on finding the frequency of a combination of words

We use the concept of Data mining and Apriori algorithm for this project

We will be using Map-Reduce and HBase for this project.

Page 5: Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan

Survey Topics

Apriori Algorithm

HBase

Map – Reduce

Page 6: Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan

Data Mining

What is Data Mining?

Process of analyzing data from different perspective

Summarizing data into useful information.

Page 7: Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan

Data Mining

How Data Mining works?

Data Mining analyzes relationships and patterns in stored transaction data based on open – ended user queries

What technology of infrastructure is needed?

Two critical technological drivers answers this question.

Size of the database

Query complexity

Page 8: Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan

Apriori Algorithm

Apriori Algorithm – Its an influential algorithm for mining frequent item sets for Boolean association rules.

Association rules form an very applied data mining approach.

Association rules are derived from frequent itemsets.

It uses level-wise search using frequent item property.

Page 9: Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan

Algorithm Flow

Page 10: Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan

Apriori Algorithm & Problem Description

10

Transaction ID Items Bought1 Shoes, Shirt, Jacket2 Shoes,Jacket3 Shoes, Jeans4 Shirt, Sweatshirt

If the minimum support is 50%, then {Shoes, Jacket} is the only 2- itemset that satisfies the minimum support.

Frequent Itemset Support{Shoes} 75%{Shirt} 50%{Jacket} 50%{Shoes, Jacket} 50%

If the minimum confidence is 50%, then the only two rules generated from this 2-itemset, that have confidence greater than 50%, are:

Shoes Jacket Support=50%, Confidence=66%Jacket Shoes Support=50%, Confidence=100%

Page 11: Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan

Apriori Algorithm Example

Scan D

itemset sup.{1} 2{2} 3{3} 3{4} 1{5} 3

C1

itemset sup.{1} 2{2} 3{3} 3{5} 3

L1

itemset sup{1 3} 2{2 3} 2{2 5} 3{3 5} 2

L2

itemset sup{1 2} 1{1 3} 2{1 5} 1{2 3} 2{2 5} 3{3 5} 2

C2 itemset{1 2}{1 3}{1 5}{2 3}{2 5}{3 5}

C2

Scan D

C3 itemset{2 3 5}

Scan D L3 itemset sup{2 3 5} 2

TID Items100 1 3 4200 2 3 5300 1 2 3 5400 2 5

Database DMin support =50%

Page 12: Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan

Apriori Advantages & Disadvantages

ADVANTAGES:

Uses larger itemset property

Easily Parallelized

Easy to Implement

DISADVANTAGES:

Assumes transaction database is memory resident

Requires many database scans

Page 13: Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan

HBase

What is HBase?

A Hadoop Database

Non - Relational

Open-source, Distributed, versioned, column-oriented store model

Designed after Google Bigtable

Runs on top of HDFS ( Hadoop Distributed File System )

Page 14: Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan

Map Reduce

Framework for processing highly distributable problems across huge datasets using large number of nodes. / cluster.

Processing occur on data stored either in filesystem ( unstructured ) or in Database ( structured )

Page 15: Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan

Map Reduce

Page 16: Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan

Mapper and Reducer

Mappers FreqentItemsMap -Finds the combination and assigns the key value for each combination CandidateGenMap AssociationRuleMap

Reducer FrequentItemsReduce CandidateGenReduce AssociationRuleReduce

Page 17: Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan

Flow Chart

Find Frequent Items

Start

Find Candidate Itemsets

Find Frequent Items

Set Null?

Generate Association Rules

No

Yes

Page 18: Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan

Schedule

1 week – Talking to the Experts at Futuregrid

1 Week – survey of HBase, Apriori Algorithm

4 Weeks -- Kick start on implementing Apriori Algorithm

2 Weeks – Testing the code and get the results.

Page 19: Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan

Results

Page 20: Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan

Conclusion

The execution takes more time for the single node

As the number of mappers getting increased, we come up with better performance

When the data is very large, single node execution takes more time and behaves weirdly

Page 21: Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan

Screenshot

Page 22: Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan

Known Issues

When the frequency is very low for large data set the reducer takes more time

Eg: A text paragraph in which the words are not repeated often.

Page 23: Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan

Future Work

The analysis can be done with Twister and other platform

The algorithm can be extended for other applications that use machine learning techniques

Page 25: Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan

Questions?