24

II-SDV 2014 Organising Data: The step before visualisation (Nils C. Newman - Search Technology, USA & Alan L. Porter - Georgia Tech, USA)

Embed Size (px)

Citation preview

Page 1: II-SDV 2014 Organising Data: The step before visualisation (Nils C. Newman - Search Technology, USA & Alan L. Porter - Georgia Tech, USA)
Page 2: II-SDV 2014 Organising Data: The step before visualisation (Nils C. Newman - Search Technology, USA & Alan L. Porter - Georgia Tech, USA)

Organizing DataThe Step Before Visualization

Nils C. Newman

Director New Business Development at Search Technology

& UNU-MERIT

Dr. Alan L. Porter

Director R&D at Search Technology

& Emeritus Professor, Georgia Tech

Page 3: II-SDV 2014 Organising Data: The step before visualisation (Nils C. Newman - Search Technology, USA & Alan L. Porter - Georgia Tech, USA)

The way it was…..

• You would read information and filter the data through your mental framework, enabling discovery and synthesis

Page 4: II-SDV 2014 Organising Data: The step before visualisation (Nils C. Newman - Search Technology, USA & Alan L. Porter - Georgia Tech, USA)

The way it is now….

• Too much information for you to process readily by reading…

Page 5: II-SDV 2014 Organising Data: The step before visualisation (Nils C. Newman - Search Technology, USA & Alan L. Porter - Georgia Tech, USA)

Enter Text Analysis…

• If a computer can organize and present the data to you, then you can absorb more information faster than traditional reading

Page 6: II-SDV 2014 Organising Data: The step before visualisation (Nils C. Newman - Search Technology, USA & Alan L. Porter - Georgia Tech, USA)

The challenge…

• How can a computer look at a collection of information and turn those data into something organized - into a framework that you understand?

Page 7: II-SDV 2014 Organising Data: The step before visualisation (Nils C. Newman - Search Technology, USA & Alan L. Porter - Georgia Tech, USA)

Two main issues to consider….

• Do you want to impose order on the data?

• Do you want to let the data self-organize?

Page 8: II-SDV 2014 Organising Data: The step before visualisation (Nils C. Newman - Search Technology, USA & Alan L. Porter - Georgia Tech, USA)

The choice is important because it drives the

math

Impose

Order

Self

Organize

LSAPCA

TM

SVM

NLP AS/PI

Roots in StatisticsRoots in Machine Learning

Page 9: II-SDV 2014 Organising Data: The step before visualisation (Nils C. Newman - Search Technology, USA & Alan L. Porter - Georgia Tech, USA)

Within Machine Learning –

resources impact the decision…

Supervised training

• Requires time and effort by subject matter expert(s)

Unsupervised training

• Requires suitable quantities of training material

• Computationally expensive

Page 10: II-SDV 2014 Organising Data: The step before visualisation (Nils C. Newman - Search Technology, USA & Alan L. Porter - Georgia Tech, USA)

Within Statistics –

data drive the decision…

Data Signal

• Requires data with sufficiently strong signal and relatively low noise

Data Homogeneity

• Requires that the records be sufficiently consistent (record to record)

Page 11: II-SDV 2014 Organising Data: The step before visualisation (Nils C. Newman - Search Technology, USA & Alan L. Porter - Georgia Tech, USA)

Data Quality can help you make the decision…

High Noise Data Quality High Signal

Supervised

Machine Learning

Unsupervised

Machine LearningStatistics

Page 12: II-SDV 2014 Organising Data: The step before visualisation (Nils C. Newman - Search Technology, USA & Alan L. Porter - Georgia Tech, USA)

But as with most things, it is never that easy..

Page 13: II-SDV 2014 Organising Data: The step before visualisation (Nils C. Newman - Search Technology, USA & Alan L. Porter - Georgia Tech, USA)

Reality is usually an engineered hybrid

approach

Impose

Order

Self

Organize

Page 14: II-SDV 2014 Organising Data: The step before visualisation (Nils C. Newman - Search Technology, USA & Alan L. Porter - Georgia Tech, USA)

But the hybrid approach adds complexity

• The hybrid elements make things somewhat confusing but provide capabilities to address issues:

�Known noise can be removed

�Signal can be amplified

�Steps can be hard-coded to reduce computational variability

• As tool developers, we often hide these tweaks to make tools look simpler than they actually are

Page 15: II-SDV 2014 Organising Data: The step before visualisation (Nils C. Newman - Search Technology, USA & Alan L. Porter - Georgia Tech, USA)

A hybrid example…

• A core analytical approach in VantagePoint is a modified version of Principal Components Analysis (PCA)

• We feed phrases created by a Natural Language Processing (NLP) algorithm into the PCA algorithm to self-organize data

• So we are already using a hybrid system

• However, a recently developed Topic Modeling (TM) algorithm looked like it would out-perform our PCA/NLP system

• So we devised a series of tests pitting our PCA/NLP against TM

Page 16: II-SDV 2014 Organising Data: The step before visualisation (Nils C. Newman - Search Technology, USA & Alan L. Porter - Georgia Tech, USA)

Round 1

• In round one, we compared our PCA/NLP approach to TM (Latent DirichletAllocation -- LDA) by analyzing a set of ~4,000 Dye-Sensitized Solar Cell (DSSC) abstracts

• The LDA approach ran much faster, required less expertise to run, and gave reasonable results

• However, this “bag of words” approach means that labeling the resulting clusters requires significant topical expertise

• The PCA/NLP approach required more expertise to run but the results gave clearer answers (and reasonable cluster labels)

• Judges’ Decision - Tie

Page 17: II-SDV 2014 Organising Data: The step before visualisation (Nils C. Newman - Search Technology, USA & Alan L. Porter - Georgia Tech, USA)

Round 2

• In round two, we compared our PCA/NLP approach to several different TM approaches by analyzing a mixed set containing searches on 7 different topics

• The results were judged on precision and recall

• One particular TM approach worked really well

• It out-performed our PCA approach and all other TM approaches

• Judges’ Decision – TM variant a winner!

Page 18: II-SDV 2014 Organising Data: The step before visualisation (Nils C. Newman - Search Technology, USA & Alan L. Porter - Georgia Tech, USA)

Round 3

• In round three, we tested the round two winner by analyzing a set of search results on similar topics

• The results were encouraging but not as clear-cut as round two

• Judges’ Decision – TM variant still a winner!

Page 19: II-SDV 2014 Organising Data: The step before visualisation (Nils C. Newman - Search Technology, USA & Alan L. Porter - Georgia Tech, USA)

Round 4

• Not to be outdone by the TM team, our PCA team looked at the problem and decided that adding more tuning would be better than changing to TM

• They layered multiple “simple” techniques together to create a new more powerful PCA hybrid

• The super hybrid system includes up to 10 different steps embodied in a single process:

• Stopword removal• Acronym identification• Common word removal• Term Pruning• Association rule based removal• Term consolidation• etc…

Page 20: II-SDV 2014 Organising Data: The step before visualisation (Nils C. Newman - Search Technology, USA & Alan L. Porter - Georgia Tech, USA)

The result?

• The fight is still ongoing but the improved PCA is looking to keep pace with TM while maintaining its dominance in Cluster naming

• The VantagePoint “Cluster Suite + PCA” approach is certainly ahead in usability

• We have the next bout scheduled for later this year Who is ready to Byte?

Page 21: II-SDV 2014 Organising Data: The step before visualisation (Nils C. Newman - Search Technology, USA & Alan L. Porter - Georgia Tech, USA)

Why tell you all this?

• I wanted to give you a little insight into how tool developers think

• The recent explosive growth in algorithms means that we have a lot of different approaches from which to choose

• The growth in computing power means we can operate at a scale unheard of a decade ago

• We are driven to make the tools more effective and easier to use

• However, doing so often makes tools more opaque to the user

Page 22: II-SDV 2014 Organising Data: The step before visualisation (Nils C. Newman - Search Technology, USA & Alan L. Porter - Georgia Tech, USA)

What does all this mean to you?

• There is no “one size fits all” when it comes to text analytics

• Analytical techniques still need to be matched to your data and your problems

• The state of the art is rapidly evolving

• You need to have a good sense of what is going on “under the hood” of the tools that you use

Page 23: II-SDV 2014 Organising Data: The step before visualisation (Nils C. Newman - Search Technology, USA & Alan L. Porter - Georgia Tech, USA)

Why bother?

• Understanding a little about how your tools work is critical BEFORE you confound the situation by adding visualization on top the analysis

• Otherwise, you have to take it on faith that what we are doing suits your analytical situation

Page 24: II-SDV 2014 Organising Data: The step before visualisation (Nils C. Newman - Search Technology, USA & Alan L. Porter - Georgia Tech, USA)

Questions?

Thank you!