Upload
akhilesh-aji
View
115
Download
0
Embed Size (px)
Citation preview
SR Text MiningAkhilesh Aji8/5/16
Slide 2
Educational BackgroundEducation:
• Bachelor of Science in Computer Science• Georgia Institute of Technology• Expected graduation date May 2019• Big Data Club: entity tagging news sources
Project GoalsSlide 3Objective:
To build a text mining model which indicates when the rate of top keywords changes or when a new keyword emerges.
Background:• Service Request (SR) is generated whenever an FSE works on a laser• Some SRs do not replace any part• SR’s main free bodies of text are: Customer Description, Problem
Found, Task Description, and Resolution
Project GoalSlide 4
SRYes EWI
No
AnalysisEWIText Mining
SR NoYes:
Part Replacement
Automated Monitoring
Data PipelineSlide 5
User filters which SRs to process
Extract SR’s text•Customer Description•Problem Found•Task Description
Tokenize, group, and stem text•MO PRA -> MO_PRA•OPTIMIZING, OPTIMIZED, OPTIMIZES -> OPTIMIZ
Replace similar words•NEON, NE -> NE•SOFTWARE, SW, SOFT WARE -> SW
Remove weak words•Dates•Numbers•Stopwords: AND, IS, BUT•Selected words: DUE END GROUP
Save SR number with terms Calculate frequency Display in Spotfire
Pre-Processing
Tokenize• Example: “DOSE ERROR COMMUNICATION …”• Result: [“DOSE”,”ERROR”, “COMMUNICATION”…]
Group• Some words mean more as a group• [“DOSE_ERROR”, “ERROR_COMMUNICATION”…]
Stem• Many words mean roughly the same thing• Optimizing, optimized, optimal, optimize all become optimiz
© 2016 Cymer, LLC
6
User filters which SRs to process
Extract SR’s text• Customer Description• Problem Found• Task Description
Tokenize, group, and stem text• MO PRA -> MO_PRA• OPTIMIZING, OPTIMIZED,
OPTIMIZES -> OPTIMIZ
Replace similar words• NEON, NE -> NE• SOFTWARE, SW, SOFT
WARE -> SW
Remove weak words• Dates• Numbers• Stopwords: AND, IS, BUT• Selected words: DUE END
GROUP
Save SR number with terms Calculate frequency Display in Spotfire
Replace
Stemming doesn’t handle all derivations of a word• NEON, NE -> NE• SOFTWARE, SW, SFOT_WARE -> SWHand selection of similar wordsDeep learning spell correction• Not all words in SR have a dictionary spelling• Find similarly used words according to word2vec (Python API)• Compare spelling according to Levenshtein Distance
© 2016 Cymer, LLC
7
User filters which SRs to process
Extract SR’s text• Customer Description• Problem Found• Task Description
Tokenize, group, and stem text• MO PRA -> MO_PRA• OPTIMIZING, OPTIMIZED,
OPTIMIZES -> OPTIMIZ
Replace similar words• NEON, NE -> NE• SOFTWARE, SW, SOFT
WARE -> SW
Remove weak words• Dates• Numbers• Stopwords: AND, IS, BUT• Selected words: DUE END
GROUP
Save SR number with terms Calculate frequency Display in Spotfire
RemoveSlide 8
Not all text adds meaning to the analysis• Dates• Numbers• Stopwords• RegexHand selected words that should be removed: GROUP, END Words only to be used in pairs: INCREASE, MO
User filters which SRs to process
Extract SR’s text• Customer Description• Problem Found• Task Description
Tokenize, group, and stem text• MO PRA -> MO_PRA• OPTIMIZING, OPTIMIZED,
OPTIMIZES -> OPTIMIZ
Replace similar words• NEON, NE -> NE• SOFTWARE, SW, SOFT
WARE -> SW
Remove weak words• Dates• Numbers• Stopwords: AND, IS, BUT• Selected words: DUE END
GROUP
Save SR number with terms Calculate frequency Display in Spotfire
MethodologySlide 9
Recurring Keywords:• Python script embedded in Spotfire• Each word stored once for overall usage and once for its given month• Word maps to a unique set of SRs that the word is used in• Number of total and monthly SRs are kept
Emerging Trends:• R script embedded in Spotfire• Hypergeometric test compares the most recent two months• Same statistical test used for EWI
Project OutcomesSlide 10
Created Spotfire Dashboard:• Pulls data from SQL• Processes data with R and Python• Interactive display
SR Script
Text Mining Extension: BackgroundSlide 11
Reliability manually classifies SRs into ~30 categories • Each SR takes about 1 min• Classifying SRs related to XL Immersion • 13,063 classified SRs to date
Objective: To create and train a model that predicts the category for a given SR.
Text Mining Extension: MethodologySlide 12Methodology
• Count term usage• TF-IDF: Term frequency – inverse document frequency• Train an SVM classifier against pre-categorized SRsAchieved 75% accuracy using training set of 12000 SRs and testing set of 1000 SRs
This is an example document. This
document means something
This second document represents something
else
[1, 2, 0, 1, 1, 1, 0, 0, 1, 2][0, 1, 1, 0, 0, 0, 1, 1, 1, 1]
[ 0.34, 0.48, 0. , 0.34 …][ 0. , 0.33, 0.47, 0. …]