2016 Cymer Intern

SR Text MiningAkhilesh Aji8/5/16

Slide 2

Educational BackgroundEducation:

• Bachelor of Science in Computer Science• Georgia Institute of Technology• Expected graduation date May 2019• Big Data Club: entity tagging news sources

http://www.google.com/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=0ahUKEwiJydTjufHNAhWgF8AKHX0hDs8QjRwIBw&url=http://soliton.ae.gatech.edu/people/ptsiotra/&psig=AFQjCNHPoXIPltsiz3xjUlRPLqLpBSpQkg&ust=1468534030468278

Project GoalsSlide 3Objective:

To build a text mining model which indicates when the rate of top keywords changes or when a new keyword emerges.

Background:• Service Request (SR) is generated whenever an FSE works on a laser• Some SRs do not replace any part• SR’s main free bodies of text are: Customer Description, Problem

Found, Task Description, and Resolution

Project GoalSlide 4

SRYes EWI

No

AnalysisEWIText Mining

SR NoYes:

Part Replacement

Automated Monitoring

http://www.google.com/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=0ahUKEwj508GPwvPNAhVHNhoKHaZnBSkQjRwIBw&url=http://kwizoo.com/stick-figure.aspx&bvm=bv.127178174,d.d2s&psig=AFQjCNE4z9RN6qd8KdLYRo9SeqjVSc8_NQ&ust=1468604997257715










Data PipelineSlide 5

User filters which SRs to process

Extract SR’s text•Customer Description•Problem Found•Task Description

Tokenize, group, and stem text•MO PRA -> MO_PRA•OPTIMIZING, OPTIMIZED, OPTIMIZES -> OPTIMIZ

Replace similar words•NEON, NE -> NE•SOFTWARE, SW, SOFT WARE -> SW

Remove weak words•Dates•Numbers•Stopwords: AND, IS, BUT•Selected words: DUE END GROUP

Save SR number with terms Calculate frequency Display in Spotfire

Pre-Processing

Tokenize• Example: “DOSE ERROR COMMUNICATION …”• Result: [“DOSE”,”ERROR”, “COMMUNICATION”…]

Group• Some words mean more as a group• [“DOSE_ERROR”, “ERROR_COMMUNICATION”…]

Stem• Many words mean roughly the same thing• Optimizing, optimized, optimal, optimize all become optimiz

© 2016 Cymer, LLC

6


Extract SR’s text• Customer Description• Problem Found• Task Description

Tokenize, group, and stem text• MO PRA -> MO_PRA• OPTIMIZING, OPTIMIZED,

OPTIMIZES -> OPTIMIZ

Replace similar words• NEON, NE -> NE• SOFTWARE, SW, SOFT

WARE -> SW

Remove weak words• Dates• Numbers• Stopwords: AND, IS, BUT• Selected words: DUE END

GROUP


Replace

Stemming doesn’t handle all derivations of a word• NEON, NE -> NE• SOFTWARE, SW, SFOT_WARE -> SWHand selection of similar wordsDeep learning spell correction• Not all words in SR have a dictionary spelling• Find similarly used words according to word2vec (Python API)• Compare spelling according to Levenshtein Distance

© 2016 Cymer, LLC

7






WARE -> SW


GROUP


RemoveSlide 8

Not all text adds meaning to the analysis• Dates• Numbers• Stopwords• RegexHand selected words that should be removed: GROUP, END Words only to be used in pairs: INCREASE, MO






WARE -> SW


GROUP


MethodologySlide 9

Recurring Keywords:• Python script embedded in Spotfire• Each word stored once for overall usage and once for its given month• Word maps to a unique set of SRs that the word is used in• Number of total and monthly SRs are kept

Emerging Trends:• R script embedded in Spotfire• Hypergeometric test compares the most recent two months• Same statistical test used for EWI

Project OutcomesSlide 10

Created Spotfire Dashboard:• Pulls data from SQL• Processes data with R and Python• Interactive display

SR Script

Text Mining Extension: BackgroundSlide 11

Reliability manually classifies SRs into ~30 categories • Each SR takes about 1 min• Classifying SRs related to XL Immersion • 13,063 classified SRs to date

Objective: To create and train a model that predicts the category for a given SR.

Text Mining Extension: MethodologySlide 12Methodology

• Count term usage• TF-IDF: Term frequency – inverse document frequency• Train an SVM classifier against pre-categorized SRsAchieved 75% accuracy using training set of 12000 SRs and testing set of 1000 SRs

This is an example document. This

document means something

This second document represents something

else

[1, 2, 0, 1, 1, 1, 0, 0, 1, 2][0, 1, 1, 0, 0, 0, 1, 1, 1, 1]

[ 0.34, 0.48, 0. , 0.34 …][ 0. , 0.33, 0.47, 0. …]

Documents

2016 Cymer Intern