Upload
jinho-jung
View
2.228
Download
5
Embed Size (px)
DESCRIPTION
Naïve Bayesian Classifier를이용한Spam Filtering의MapReduce구현 2008. 8. 27 한재선(NexR대표이사) [email protected] www.nexr.co.kr
Citation preview
Naïve Bayesian Classifier를이용한 Spam Filtering의
MapReduce 구현
2008. 8. 27
한재선 (NexR 대표이사)
www.nexr.co.kr
Target Application
Spam Filtering
OR
Spam Email
This is the easiest, fastest, and most effective way to lose bothpounds and inches permanently!!! This weight loss program isdesigned specifically to "boost" weight-loss efforts by assistingbody metabolism, and helping the body's ability to manage weight.A powerful, safe, 30 Day Program. This is one program you won'tfeel starved on. Complete program for one amazing low price!Program includes: <b>BONUS AMAZING FAT ABSORBER CAPSULES, 30 DAY -WEIGHTREDUCTION PLAN, PROGRESS REPORT!</b><br><br>SPECIAL BONUS..."FAT ABSORBERS", AS SEEN ON TVWith every order...AMAZING MELT AWAY FAT ABSORBER CAPSULES withdirections ( Absolutely Free ) ...With these capsulesyou can eat what you enjoy, without the worry of fat in your diet.2 to 3 capsules 15 minutes before eating or snack, and the fat will beabsorbed and passed through the body without the digestion of fat intothe body. <br><br>You will be losing by tomorrow! Don't Wait, visit our webpage below, and order now!
기본 아이디어
Spam Email에 자주 등장하는 단어들을
많이 포함하고 있는 Email을
Spam Email이라 간주하자!
기본 아이디어
Spam Email에 자주 등장하는 단어들을
많이 포함하고 있는 Email을
Spam Email이라 간주하자!
Training
Classifying
Training: 개념
This is the easiest, fastest, and most effective way to lose bothpounds and inches permanently!!! This weight loss program isdesigned specifically to "boost" weight-loss efforts by assistingbody metabolism, and helping the body's ability to manage weight.A powerful, safe, 30 Day Program. This is one program you won'tfeel starved on. Complete program for one amazing low price!Program includes: <b>BONUS AMAZING FAT ABSORBER CAPSULES, 30 DAY -WEIGHTREDUCTION PLAN, PROGRESS REPORT!</b><br><br>SPECIAL BONUS..."FAT ABSORBERS", AS SEEN ON TVWith every order...AMAZING MELT AWAY FAT ABSORBER CAPSULES withdirections ( Absolutely Free ) ...With these capsulesyou can eat what you enjoy, without the worry of fat in your diet.2 to 3 capsules 15 minutes before eating or snack, and the fat will beabsorbed and passed through the body without the digestion of fat intothe body. <br><br>You will be losing by tomorrow! Don't Wait, visit our webpage below, and order now!
program 9price 8reduction 8bonus 7amazing 7diet 6capsules 4
.
.
.order 2boost 1manage 1visit 1tomorrow 1
Feature ExtractionFeature = 문서의 지문
Training: 개념Training dataset
(Spam)
2 Categories(Classes)
word freq in spam Pr(word | spam) freq in ham Pr(word | ham)
bonus 3590 0.7 737 0.15
hadoop 252 0.05 1308 0.24
… … … … …
Training dataset(Ham)
Training: 구현
word freq in spam Pr(word | spam) freq in ham Pr(word | ham)
bonus 3590 0.7 737 0.15
hadoop 252 0.05 1308 0.24
… … … … …
Spam들에서word frequency
counting
Spam들에서word probability
계산
1. 각 category에서 word frequency counting2. 각 category에서 word probability 계산
Ham들에서word frequency
counting
Ham들에서word probability
계산
Training: 구현
word freq in spam Pr(word | spam) freq in ham Pr(word | ham)
bonus 3590 0.7 737 0.15
hadoop 252 0.05 1308 0.24
… … … … …
Spam들에서word frequency
counting
Spam들에서word probability
계산
1. 각 category에서 word frequency counting2. 각 category에서 word probability 계산
Ham들에서word frequency
counting
Ham들에서word probability
계산
Map
Reduce
Training: MapReduce
Map
Reduce
Spam Ham
(spam::bonus, 1)(ham::bonus, 1)
(spam::bonus, 3590)(ham::bonus, 737)
(spam, contents)(ham, contents)
parsing
adding
TransformUse MapReduce?
Classifying: 개념
Test email
word freq in spam Pr(word | spam) freq in ham Pr(word | ham)
bonus 3590 0.7 737 0.15
hadoop 252 0.05 1308 0.24
… … … … …
Pr(email | spam) = Pr(w1|spam) xPr(w2|spam) x …
featureextraction
Pr(email | ham) = Pr(w1|ham) xPr(w2|ham) x …
Classifying: Bayes Thoerem
얻고자 하는 확률은
Pr(spam | email) & Pr(ham | email)
Bayes Theorem
Pr(A|B) = Pr(B|A) * Pr(A) / Pr(B)
Pr (Cat|Email) = Pr(Email | Cat) * Pr(Cat) / Pr(Email)
Classifying: MapReduce
Map
Reduce(Identity)
Unknown
(spam, contents)
(contents)
parsing &calculation
wordfreq in spam
Pr(word | spam)
freq in hamPr(word |
ham)
bonus 3590 0.7 737 0.15
hadoop 252 0.05 1308 0.24
… … … … …
reading &instantiation
CategoryObjects
Advanced: Map에서 Training 결과 공유
HDFS
DistributedCache
HBase
RDBMS
Advanced:DistributedCache
• Distribute application-specific large, read-only files efficiently
• Cache files (text, archivs, ejars etc.)
– Only copied once per job
• Code Example
// in Job configureJobConf conf = new JobConf(getConf(), NaiveBayesianClassifierMR.class);DistributedCache.addCacheFile(new Path(cachePath).toUri(), conf);
// in Map configurePath[] localFiles = DistributedCache.getLocalCacheFiles(conf);String cachedFile = localFiles[0].toString();BufferedReader br = new BufferedReader(new FileReader(cachedFile));
Advanced:Custom parameter 전달
• Custom 변수를 각 Map에게 전달 필요
– 예: spam, ham 결정을 위한 threshold
(ham인데 spam으로 판단되는 false positive 결과를 줄이기 위한 조치)
• conf.set() & conf.get()
– 예: conf.set(“nbc.spam_threshold”, 3)
conf.set(“nbc.ham_threshold”, 1)
Advanced:Complete ML Framework
Filter Validator
Classifier
Evaluator
Visualizer
Clustering
Recommendation
Input
Output
DataSink
DataSource