Text Mining Wksp Auvil

Engineering Knowledge for the Humanities

Text Mining WorkshopApril 26, 2008

Loretta AuvilNational Center for Supercomputing Applications (NCSA)

University of Illinois

No Formulas

More Visualizations

www.visualcomplexity.com

NoraVis OpenLaszlo

www.noraproject.org

NoraVis Backend

• Leverages D2K as web service call for predictive modeling• Passing parameters for some options

• Known modeling problems:• Training on a very sparse set of words, so improvements in

modeling can be achieved through additional semantic additions

MONK

www.monkproject.org

Challenges in Humanities Collaboration

• Understanding terminology and text mining capabilities• Learning their needs• Creating meaningful ways to display and present results• Technology innocence• Bridging different software tools• Appreciating how long things take to develop• Working collaboratively as a team

How To Address these Challenges

• Educate team on data and text mining approaches• Demonstrate approaches with working examples• Develop use cases that drive software development• Create an environment/infrastructure that lets us create data

flows that are component based• Deploy web services for computations• Develop web application for setting up problem and delivery

of results

SEASR Project Highlights

• SEASR will employ a comprehensive environment thatintegrates two complementary and revolutionary technicaladvances – Service Oriented Architecture and SemanticWeb, into a single computing architecture – SemanticEnabled Service Oriented Architecture

• SEASR will be enriched with a broad range of knowledgerepresentation and reasoning capabilities

• SEASR addresses the challenges of transforminginformation into knowledge by constructing the softwarebridges that are required to move from the unstructured andsemi-structured data world to the structured data world

What does this mean for the Humanities Community?

SEASR will:• help scholars locate and access documents of interest in the

sea of large data stores• provide scholars with enhanced data synthesis and query

analysis• from focused data retrieval and data integration• to intelligent human-computer interactions for knowledge access• to semantic data enrichment• to entity and relationship discovery• to knowledge discovery and hypothesis generation

• empower collaboration among scholars by enhancing andinnovating virtual research environments

Specific Project Highlights

• Common Services Layer: Provide execution environmentand supporting infrastructure that map from the problemsolving layer to the resource layer• Designed and developed Meandre (semantic, web-driven data flow

execution environment)• Developed the ability to define extensions for executing

components in languages other than Java; extensions have alreadybeen created for python and common lisp

• Problem Solving Layer: Visual environments that turncomponents and web services into a domain-specificproblem solving environment

Workbench

Community Hub

Semantically Enabled SOA

Semantically Enabled SOA 2

A Problem from the MONK Project

• Analyze the repetition that occurred in the “The Making ofAmericans” by Gertrude Stein

Repetition in The Making of Americans

~900~623~530Total pages

97.0612.8116.28Average wordfrequency

532917,19011,730Unique words (types)

517,207220,254190,906Total words(tokens)

Making ofAmericans

Moby DickUncleTom’sCabin

Text Source

Visualization Approach from ManyEyes

Many Eyes Website: http://services.alphaworks.ibm.com/manyeyes/view/S4ZIjIsOtha6H~kYwoKjI2~

Solution… came gradually

• Examine book by comparing each paragraph• Create feature set based on moving window of n-grams (3

grams) across each paragraph• Preprocess text

• To Stem or Not– "I will throw the umbrella in the mud"– “Martha was throwing the umbrella in the mud”

• To Keep Punctuation or Not• Execute the Closet algorithm (from Jiawei Han, et.al)

• Providing the following early results:

5:[a description of]:[1085|1087|1084|1082|1086]4:[men and women]:[1085|1083|1084|1088]4:[this is now|is now a]:[1087|1082|1086|1088]3:[a description of|now a description|this is now|is now a]

:[1087|1082|1086]3:[kinds of men|of men and|men and women]:[1085|1083|1088]3:[this is now|is now a|now a description|a description of]

:[1087|1082|1086]How do we make this meaningful to humanists??

How to visualize… Trying existing tools

Brad Paley, TextArc

M. Wattenberg, Arc Diagrams

TimeSearcher

SpotFire

No context, No reading original text, No scale, No trends…

Custom Solution - FeatureLens

FeatureLens--an early MONK (Metadata Offer New Knowledge)application--uses the machine learning approach of frequentpattern mining to identify fuzzy repetition patterns in a datacollection, and with no initial human input.

• Organized into sections (in this case chapters)• Rank frequent patterns by frequency and length• Show frequent patterns of n-grams in context• Rank frequent patterns by distribution trends, per collection

and per section.• Compare multiple patterns on the same views: distributions,

sections, paragraphs• Read the text (with highlighting of patterns)• Some options for handling scale for large data sets (e.g.

each line is five paragraphs)• Search for particular word

FeatureLens: Organized into sections (chapters)

Created by Anthony Don and team at http://www.cs.umd.edu/hcil/textvis/featurelens/.

FeatureLens: patterns sorted by frequency and length


FeatureLens: n-gram patterns in context


FeatureLens: distribution trends


FeatureLens: multiple patterns


The New Way to Read

• By visualizing certain patterns in this text and (it follows withlarger collections in general), by looking at the text “from adistance” through textual analytics and visualizations, onecan “read” the novel in ways formerly impossible.

• Franco Moretti has argued that the solution to trulyincorporating a more global perspective in our critical literarypractices is not to read more of the vast amounts of literatureavailable to us, but to read it differently by employing “distantreading.” “We know how to read texts,” he writes, “now let’slearn how not to read them.”Franco Moretti, Conjectures on World Literature. New Left Review, 1 (Jan.-Feb. 2000): 68.

Stories buried in the repetition…

Massive Digitization Projects

• What can be done with these large digital text collections• How can we use these large digital text collections

• Justify the use of computers and advanced techniques toprocess these collections, because we (humans) can’t readthis much

• The point is not to save the reader from reading theindividual texts or from making an independent judgment ofeach document's characteristics; rather, the point is to learnfrom the reader's holistic impression of the text and then,having done so, to show the reader what evidence correlateswith these impressions

Transformational New Research Topics for Humanities

• Track patterns in morphology, syntax, and semantics acrosslarge stretches of time, space and culture

• Track topics or terminology across thousands of text• Track the social and economic influence of topics• Study multi-lingual and cultural impacts• Study literary inheritance• Study the evolution of ideas• and a lot more

Exploratory Analysis Environments

• Provide access to text• Focus on specific passages• Allow for comparative reading• Provide enriched context for text and data analysis

References

• J. Pei, J. Han, and R. Mao, ''CLOSET: An Efficient Algorithmfor Mining Frequent Closed Itemsets'', Proc. 2000 ACM-SIGMOD Int. Workshop on Data Mining and KnowledgeDiscovery (DMKD'00), Dallas, TX, May 2000.

• Tanya Clement, Anthony Don, Loretta Auvil, CatherinePlaisant, Greg Pape and Vered Goren. ‘Something that isinteresting is interesting them’: Using text mining andvisualizations to aid interpreting repetition in Gertrude Stein’sThe Making of Americans, Digital Humanities 2007.

Automated Learning Group / SEASR Team

Michael WelgeBernie Ac’sBoris CapitanuLily DongPeter GrovesAmit KumarXavier LloràChad OlsonMary PietrowiczDuane SearsmithKelly SearsmithDavid Tcheng