17
Naïve Bayesian Classifier를 이용한 Spam FilteringMapReduce 구현 2008. 8. 27 한재선 (NexR 대표이사) [email protected] www.nexr.co.kr

Naïve Bayesian Classifier를 이용한 Spam Filtering의 MapReduce 구현

Embed Size (px)

DESCRIPTION

Naïve Bayesian Classifier를이용한Spam Filtering의MapReduce구현 2008. 8. 27 한재선(NexR대표이사) [email protected] www.nexr.co.kr

Citation preview

Page 1: Naïve Bayesian Classifier를 이용한 Spam Filtering의 MapReduce 구현

Naïve Bayesian Classifier를이용한 Spam Filtering의

MapReduce 구현

2008. 8. 27

한재선 (NexR 대표이사)

[email protected]

www.nexr.co.kr

Page 2: Naïve Bayesian Classifier를 이용한 Spam Filtering의 MapReduce 구현

Target Application

Spam Filtering

OR

Page 3: Naïve Bayesian Classifier를 이용한 Spam Filtering의 MapReduce 구현

Spam Email

This is the easiest, fastest, and most effective way to lose bothpounds and inches permanently!!! This weight loss program isdesigned specifically to "boost" weight-loss efforts by assistingbody metabolism, and helping the body's ability to manage weight.A powerful, safe, 30 Day Program. This is one program you won'tfeel starved on. Complete program for one amazing low price!Program includes: <b>BONUS AMAZING FAT ABSORBER CAPSULES, 30 DAY -WEIGHTREDUCTION PLAN, PROGRESS REPORT!</b><br><br>SPECIAL BONUS..."FAT ABSORBERS", AS SEEN ON TVWith every order...AMAZING MELT AWAY FAT ABSORBER CAPSULES withdirections ( Absolutely Free ) ...With these capsulesyou can eat what you enjoy, without the worry of fat in your diet.2 to 3 capsules 15 minutes before eating or snack, and the fat will beabsorbed and passed through the body without the digestion of fat intothe body. <br><br>You will be losing by tomorrow! Don't Wait, visit our webpage below, and order now!

Page 4: Naïve Bayesian Classifier를 이용한 Spam Filtering의 MapReduce 구현

기본 아이디어

Spam Email에 자주 등장하는 단어들을

많이 포함하고 있는 Email을

Spam Email이라 간주하자!

Page 5: Naïve Bayesian Classifier를 이용한 Spam Filtering의 MapReduce 구현

기본 아이디어

Spam Email에 자주 등장하는 단어들을

많이 포함하고 있는 Email을

Spam Email이라 간주하자!

Training

Classifying

Page 6: Naïve Bayesian Classifier를 이용한 Spam Filtering의 MapReduce 구현

Training: 개념

This is the easiest, fastest, and most effective way to lose bothpounds and inches permanently!!! This weight loss program isdesigned specifically to "boost" weight-loss efforts by assistingbody metabolism, and helping the body's ability to manage weight.A powerful, safe, 30 Day Program. This is one program you won'tfeel starved on. Complete program for one amazing low price!Program includes: <b>BONUS AMAZING FAT ABSORBER CAPSULES, 30 DAY -WEIGHTREDUCTION PLAN, PROGRESS REPORT!</b><br><br>SPECIAL BONUS..."FAT ABSORBERS", AS SEEN ON TVWith every order...AMAZING MELT AWAY FAT ABSORBER CAPSULES withdirections ( Absolutely Free ) ...With these capsulesyou can eat what you enjoy, without the worry of fat in your diet.2 to 3 capsules 15 minutes before eating or snack, and the fat will beabsorbed and passed through the body without the digestion of fat intothe body. <br><br>You will be losing by tomorrow! Don't Wait, visit our webpage below, and order now!

program 9price 8reduction 8bonus 7amazing 7diet 6capsules 4

.

.

.order 2boost 1manage 1visit 1tomorrow 1

Feature ExtractionFeature = 문서의 지문

Page 7: Naïve Bayesian Classifier를 이용한 Spam Filtering의 MapReduce 구현

Training: 개념Training dataset

(Spam)

2 Categories(Classes)

word freq in spam Pr(word | spam) freq in ham Pr(word | ham)

bonus 3590 0.7 737 0.15

hadoop 252 0.05 1308 0.24

… … … … …

Training dataset(Ham)

Page 8: Naïve Bayesian Classifier를 이용한 Spam Filtering의 MapReduce 구현

Training: 구현

word freq in spam Pr(word | spam) freq in ham Pr(word | ham)

bonus 3590 0.7 737 0.15

hadoop 252 0.05 1308 0.24

… … … … …

Spam들에서word frequency

counting

Spam들에서word probability

계산

1. 각 category에서 word frequency counting2. 각 category에서 word probability 계산

Ham들에서word frequency

counting

Ham들에서word probability

계산

Page 9: Naïve Bayesian Classifier를 이용한 Spam Filtering의 MapReduce 구현

Training: 구현

word freq in spam Pr(word | spam) freq in ham Pr(word | ham)

bonus 3590 0.7 737 0.15

hadoop 252 0.05 1308 0.24

… … … … …

Spam들에서word frequency

counting

Spam들에서word probability

계산

1. 각 category에서 word frequency counting2. 각 category에서 word probability 계산

Ham들에서word frequency

counting

Ham들에서word probability

계산

Map

Reduce

Page 10: Naïve Bayesian Classifier를 이용한 Spam Filtering의 MapReduce 구현

Training: MapReduce

Map

Reduce

Spam Ham

(spam::bonus, 1)(ham::bonus, 1)

(spam::bonus, 3590)(ham::bonus, 737)

(spam, contents)(ham, contents)

parsing

adding

TransformUse MapReduce?

Page 11: Naïve Bayesian Classifier를 이용한 Spam Filtering의 MapReduce 구현

Classifying: 개념

Test email

word freq in spam Pr(word | spam) freq in ham Pr(word | ham)

bonus 3590 0.7 737 0.15

hadoop 252 0.05 1308 0.24

… … … … …

Pr(email | spam) = Pr(w1|spam) xPr(w2|spam) x …

featureextraction

Pr(email | ham) = Pr(w1|ham) xPr(w2|ham) x …

Page 12: Naïve Bayesian Classifier를 이용한 Spam Filtering의 MapReduce 구현

Classifying: Bayes Thoerem

얻고자 하는 확률은

Pr(spam | email) & Pr(ham | email)

Bayes Theorem

Pr(A|B) = Pr(B|A) * Pr(A) / Pr(B)

Pr (Cat|Email) = Pr(Email | Cat) * Pr(Cat) / Pr(Email)

Page 13: Naïve Bayesian Classifier를 이용한 Spam Filtering의 MapReduce 구현

Classifying: MapReduce

Map

Reduce(Identity)

Unknown

(spam, contents)

(contents)

parsing &calculation

wordfreq in spam

Pr(word | spam)

freq in hamPr(word |

ham)

bonus 3590 0.7 737 0.15

hadoop 252 0.05 1308 0.24

… … … … …

reading &instantiation

CategoryObjects

Page 14: Naïve Bayesian Classifier를 이용한 Spam Filtering의 MapReduce 구현

Advanced: Map에서 Training 결과 공유

HDFS

DistributedCache

HBase

RDBMS

Page 15: Naïve Bayesian Classifier를 이용한 Spam Filtering의 MapReduce 구현

Advanced:DistributedCache

• Distribute application-specific large, read-only files efficiently

• Cache files (text, archivs, ejars etc.)

– Only copied once per job

• Code Example

// in Job configureJobConf conf = new JobConf(getConf(), NaiveBayesianClassifierMR.class);DistributedCache.addCacheFile(new Path(cachePath).toUri(), conf);

// in Map configurePath[] localFiles = DistributedCache.getLocalCacheFiles(conf);String cachedFile = localFiles[0].toString();BufferedReader br = new BufferedReader(new FileReader(cachedFile));

Page 16: Naïve Bayesian Classifier를 이용한 Spam Filtering의 MapReduce 구현

Advanced:Custom parameter 전달

• Custom 변수를 각 Map에게 전달 필요

– 예: spam, ham 결정을 위한 threshold

(ham인데 spam으로 판단되는 false positive 결과를 줄이기 위한 조치)

• conf.set() & conf.get()

– 예: conf.set(“nbc.spam_threshold”, 3)

conf.set(“nbc.ham_threshold”, 1)

Page 17: Naïve Bayesian Classifier를 이용한 Spam Filtering의 MapReduce 구현

Advanced:Complete ML Framework

Filter Validator

Classifier

Evaluator

Visualizer

Clustering

Recommendation

Input

Output

DataSink

DataSource