II-SDV 2014 Organising Data: The step before visualisation (Nils C. Newman - Search Technology, USA...

Preview:

Citation preview

Organizing DataThe Step Before Visualization

Nils C. Newman

Director New Business Development at Search Technology

& UNU-MERIT

Dr. Alan L. Porter

Director R&D at Search Technology

& Emeritus Professor, Georgia Tech

The way it was…..

• You would read information and filter the data through your mental framework, enabling discovery and synthesis

The way it is now….

• Too much information for you to process readily by reading…

Enter Text Analysis…

• If a computer can organize and present the data to you, then you can absorb more information faster than traditional reading

The challenge…

• How can a computer look at a collection of information and turn those data into something organized - into a framework that you understand?

Two main issues to consider….

• Do you want to impose order on the data?

• Do you want to let the data self-organize?

The choice is important because it drives the

math

Impose

Order

Self

Organize

LSAPCA

TM

SVM

NLP AS/PI

Roots in StatisticsRoots in Machine Learning

Within Machine Learning –

resources impact the decision…

Supervised training

• Requires time and effort by subject matter expert(s)

Unsupervised training

• Requires suitable quantities of training material

• Computationally expensive

Within Statistics –

data drive the decision…

Data Signal

• Requires data with sufficiently strong signal and relatively low noise

Data Homogeneity

• Requires that the records be sufficiently consistent (record to record)

Data Quality can help you make the decision…

High Noise Data Quality High Signal

Supervised

Machine Learning

Unsupervised

Machine LearningStatistics

But as with most things, it is never that easy..

Reality is usually an engineered hybrid

approach

Impose

Order

Self

Organize

But the hybrid approach adds complexity

• The hybrid elements make things somewhat confusing but provide capabilities to address issues:

�Known noise can be removed

�Signal can be amplified

�Steps can be hard-coded to reduce computational variability

• As tool developers, we often hide these tweaks to make tools look simpler than they actually are

A hybrid example…

• A core analytical approach in VantagePoint is a modified version of Principal Components Analysis (PCA)

• We feed phrases created by a Natural Language Processing (NLP) algorithm into the PCA algorithm to self-organize data

• So we are already using a hybrid system

• However, a recently developed Topic Modeling (TM) algorithm looked like it would out-perform our PCA/NLP system

• So we devised a series of tests pitting our PCA/NLP against TM

Round 1

• In round one, we compared our PCA/NLP approach to TM (Latent DirichletAllocation -- LDA) by analyzing a set of ~4,000 Dye-Sensitized Solar Cell (DSSC) abstracts

• The LDA approach ran much faster, required less expertise to run, and gave reasonable results

• However, this “bag of words” approach means that labeling the resulting clusters requires significant topical expertise

• The PCA/NLP approach required more expertise to run but the results gave clearer answers (and reasonable cluster labels)

• Judges’ Decision - Tie

Round 2

• In round two, we compared our PCA/NLP approach to several different TM approaches by analyzing a mixed set containing searches on 7 different topics

• The results were judged on precision and recall

• One particular TM approach worked really well

• It out-performed our PCA approach and all other TM approaches

• Judges’ Decision – TM variant a winner!

Round 3

• In round three, we tested the round two winner by analyzing a set of search results on similar topics

• The results were encouraging but not as clear-cut as round two

• Judges’ Decision – TM variant still a winner!

Round 4

• Not to be outdone by the TM team, our PCA team looked at the problem and decided that adding more tuning would be better than changing to TM

• They layered multiple “simple” techniques together to create a new more powerful PCA hybrid

• The super hybrid system includes up to 10 different steps embodied in a single process:

• Stopword removal• Acronym identification• Common word removal• Term Pruning• Association rule based removal• Term consolidation• etc…

The result?

• The fight is still ongoing but the improved PCA is looking to keep pace with TM while maintaining its dominance in Cluster naming

• The VantagePoint “Cluster Suite + PCA” approach is certainly ahead in usability

• We have the next bout scheduled for later this year Who is ready to Byte?

Why tell you all this?

• I wanted to give you a little insight into how tool developers think

• The recent explosive growth in algorithms means that we have a lot of different approaches from which to choose

• The growth in computing power means we can operate at a scale unheard of a decade ago

• We are driven to make the tools more effective and easier to use

• However, doing so often makes tools more opaque to the user

What does all this mean to you?

• There is no “one size fits all” when it comes to text analytics

• Analytical techniques still need to be matched to your data and your problems

• The state of the art is rapidly evolving

• You need to have a good sense of what is going on “under the hood” of the tools that you use

Why bother?

• Understanding a little about how your tools work is critical BEFORE you confound the situation by adding visualization on top the analysis

• Otherwise, you have to take it on faith that what we are doing suits your analytical situation

Questions?

Thank you!