Frequent Itemset Mining(FIM) on BigData

A LITERATURE SURVEY ON :-

“FREQUENT ITEMSET MINING ON BIGDATA”

By :-

RAJU GUPTA (9028218451)

PURUSHOTAM SINGH

Big DataBig data usually includes data sets with sizes beyond the ability of commonly used software tools to capture,curate, manage, and process the data within a tolerable elapsed time.

Introduction :-

Frequent Itemset Mining (FIM)

Support The support supp(X) of an itemset X is defined as the proportion of

transactions in the data set which contain the itemset.

supp(X)= no. of transactions which contain the itemset X / total no. of transactions.

Confidence

conf(X->Y)= supp(X U Y)/supp(X).

Fig:- Example for support and confidence

Hadoop Framework :- Apache Hadoop is an open-source software framework for

storage and large-scale processing of data-sets on clusters of commodity hardware.

Hadoop Distributed File System (HDFS).

Hadoop MapReduce.

Map Reduce :-

Map :-

A mapper processes a part of data and generates a key-value pair.

Reduce :-

various key value pair are combined and fed to reducer which processes these parts and gives o/p.

MapReduce

Map

Key value pair

generation

Reduce

Give o/p

EXAMPLE1

EXAMPLE2

MAP REDUCE AND ITS ALGORITHM ..

• It is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster..

• Single pass counting utilizes a map reduce phase for each candidate generation and frequency counting steps..

MAP REDUCE(Cont..)

• Fixed pass combined counting starts to generate candidates with n different lengths after p phases and count their frequencies in one database scan.

• Dynamic passes counting is similar to fixed passes combined counting however n and p is determined dynamically at each phase by the number of generated candidates.

MAP REDUCE(Cont..)

• Fixed pass combined counting starts to generate candidates with n different lengths after p phases and count their frequencies in one database scan.

• Dynamic passes counting is similar to fixed passes combined counting however n and p is determined dynamically at each phase by the number of generated candidates.

MAP REDUCE(Cont..)

o Parallel FP Growth is a parallel version of well known FP Growth.. PFP groups the items and distributes their conditional databases to the mappers..

o The PARMA algorithm finds aproximate collections of frequent itemsets.

o TWISTER improves the performance between map reduce cycles or NIMBLE provides better programming tools for data mining jobs.

Search space distribution :-

The main challenge in adapting algorithms to the MapReduce Framework.

Task defined at start up.

Prefix tree:oTree Structure where each path represents an itemset.

oDivided into independent groups.

oEclat traverses the tree in the DFS manner to find FI’s

Running Time in Eclat.

Search space distribution (cont..) :-

To estimate the computation time of a subtree.o Total No. of items

o Order of frequency of items.

o Total Frequency of items.

Balanced Partitioning of prefix tree.

Engineering

Frequent Itemset Mining(FIM) on BigData