Upload
loretta-auvil
View
1.792
Download
0
Embed Size (px)
DESCRIPTION
My presentation of our work at the Text Mining Workshop 2008 held in conjunction with Eighth SIAM International Conference on Data Mining (SDM 2008) in Atlanta, GA on April 26, 2008.
Citation preview
Engineering Knowledge for the Humanities
Text Mining WorkshopApril 26, 2008
Loretta AuvilNational Center for Supercomputing Applications (NCSA)
University of Illinois
No Formulas
More Visualizations
www.visualcomplexity.com
NoraVis OpenLaszlo
www.noraproject.org
NoraVis Backend
• Leverages D2K as web service call for predictive modeling• Passing parameters for some options
• Known modeling problems:• Training on a very sparse set of words, so improvements in
modeling can be achieved through additional semantic additions
MONK
www.monkproject.org
Challenges in Humanities Collaboration
• Understanding terminology and text mining capabilities• Learning their needs• Creating meaningful ways to display and present results• Technology innocence• Bridging different software tools• Appreciating how long things take to develop• Working collaboratively as a team
How To Address these Challenges
• Educate team on data and text mining approaches• Demonstrate approaches with working examples• Develop use cases that drive software development• Create an environment/infrastructure that lets us create data
flows that are component based• Deploy web services for computations• Develop web application for setting up problem and delivery
of results
SEASR Project Highlights
• SEASR will employ a comprehensive environment thatintegrates two complementary and revolutionary technicaladvances – Service Oriented Architecture and SemanticWeb, into a single computing architecture – SemanticEnabled Service Oriented Architecture
• SEASR will be enriched with a broad range of knowledgerepresentation and reasoning capabilities
• SEASR addresses the challenges of transforminginformation into knowledge by constructing the softwarebridges that are required to move from the unstructured andsemi-structured data world to the structured data world
What does this mean for the Humanities Community?
SEASR will:• help scholars locate and access documents of interest in the
sea of large data stores• provide scholars with enhanced data synthesis and query
analysis• from focused data retrieval and data integration• to intelligent human-computer interactions for knowledge access• to semantic data enrichment• to entity and relationship discovery• to knowledge discovery and hypothesis generation
• empower collaboration among scholars by enhancing andinnovating virtual research environments
Specific Project Highlights
• Common Services Layer: Provide execution environmentand supporting infrastructure that map from the problemsolving layer to the resource layer• Designed and developed Meandre (semantic, web-driven data flow
execution environment)• Developed the ability to define extensions for executing
components in languages other than Java; extensions have alreadybeen created for python and common lisp
• Problem Solving Layer: Visual environments that turncomponents and web services into a domain-specificproblem solving environment
Workbench
Community Hub
Semantically Enabled SOA
Semantically Enabled SOA 2
A Problem from the MONK Project
• Analyze the repetition that occurred in the “The Making ofAmericans” by Gertrude Stein
Repetition in The Making of Americans
~900~623~530Total pages
97.0612.8116.28Average wordfrequency
532917,19011,730Unique words (types)
517,207220,254190,906Total words(tokens)
Making ofAmericans
Moby DickUncleTom’sCabin
Text Source
Visualization Approach from ManyEyes
Many Eyes Website: http://services.alphaworks.ibm.com/manyeyes/view/S4ZIjIsOtha6H~kYwoKjI2~
Solution… came gradually
• Examine book by comparing each paragraph• Create feature set based on moving window of n-grams (3
grams) across each paragraph• Preprocess text
• To Stem or Not– "I will throw the umbrella in the mud"– “Martha was throwing the umbrella in the mud”
• To Keep Punctuation or Not• Execute the Closet algorithm (from Jiawei Han, et.al)
• Providing the following early results:
5:[a description of]:[1085|1087|1084|1082|1086]4:[men and women]:[1085|1083|1084|1088]4:[this is now|is now a]:[1087|1082|1086|1088]3:[a description of|now a description|this is now|is now a]
:[1087|1082|1086]3:[kinds of men|of men and|men and women]:[1085|1083|1088]3:[this is now|is now a|now a description|a description of]
:[1087|1082|1086]How do we make this meaningful to humanists??
How to visualize… Trying existing tools
Brad Paley, TextArc
M. Wattenberg, Arc Diagrams
TimeSearcher
SpotFire
No context, No reading original text, No scale, No trends…
Custom Solution - FeatureLens
FeatureLens--an early MONK (Metadata Offer New Knowledge)application--uses the machine learning approach of frequentpattern mining to identify fuzzy repetition patterns in a datacollection, and with no initial human input.
• Organized into sections (in this case chapters)• Rank frequent patterns by frequency and length• Show frequent patterns of n-grams in context• Rank frequent patterns by distribution trends, per collection
and per section.• Compare multiple patterns on the same views: distributions,
sections, paragraphs• Read the text (with highlighting of patterns)• Some options for handling scale for large data sets (e.g.
each line is five paragraphs)• Search for particular word
FeatureLens: Organized into sections (chapters)
Created by Anthony Don and team at http://www.cs.umd.edu/hcil/textvis/featurelens/.
FeatureLens: patterns sorted by frequency and length
Created by Anthony Don and team at http://www.cs.umd.edu/hcil/textvis/featurelens/.
FeatureLens: n-gram patterns in context
Created by Anthony Don and team at http://www.cs.umd.edu/hcil/textvis/featurelens/.
FeatureLens: distribution trends
Created by Anthony Don and team at http://www.cs.umd.edu/hcil/textvis/featurelens/.
FeatureLens: multiple patterns
Created by Anthony Don and team at http://www.cs.umd.edu/hcil/textvis/featurelens/.
The New Way to Read
• By visualizing certain patterns in this text and (it follows withlarger collections in general), by looking at the text “from adistance” through textual analytics and visualizations, onecan “read” the novel in ways formerly impossible.
• Franco Moretti has argued that the solution to trulyincorporating a more global perspective in our critical literarypractices is not to read more of the vast amounts of literatureavailable to us, but to read it differently by employing “distantreading.” “We know how to read texts,” he writes, “now let’slearn how not to read them.”Franco Moretti, Conjectures on World Literature. New Left Review, 1 (Jan.-Feb. 2000): 68.
Stories buried in the repetition…
Massive Digitization Projects
• What can be done with these large digital text collections• How can we use these large digital text collections
• Justify the use of computers and advanced techniques toprocess these collections, because we (humans) can’t readthis much
• The point is not to save the reader from reading theindividual texts or from making an independent judgment ofeach document's characteristics; rather, the point is to learnfrom the reader's holistic impression of the text and then,having done so, to show the reader what evidence correlateswith these impressions
Transformational New Research Topics for Humanities
• Track patterns in morphology, syntax, and semantics acrosslarge stretches of time, space and culture
• Track topics or terminology across thousands of text• Track the social and economic influence of topics• Study multi-lingual and cultural impacts• Study literary inheritance• Study the evolution of ideas• and a lot more
Exploratory Analysis Environments
• Provide access to text• Focus on specific passages• Allow for comparative reading• Provide enriched context for text and data analysis
References
• J. Pei, J. Han, and R. Mao, ''CLOSET: An Efficient Algorithmfor Mining Frequent Closed Itemsets'', Proc. 2000 ACM-SIGMOD Int. Workshop on Data Mining and KnowledgeDiscovery (DMKD'00), Dallas, TX, May 2000.
• Tanya Clement, Anthony Don, Loretta Auvil, CatherinePlaisant, Greg Pape and Vered Goren. ‘Something that isinteresting is interesting them’: Using text mining andvisualizations to aid interpreting repetition in Gertrude Stein’sThe Making of Americans, Digital Humanities 2007.
Automated Learning Group / SEASR Team
Michael WelgeBernie Ac’sBoris CapitanuLily DongPeter GrovesAmit KumarXavier LloràChad OlsonMary PietrowiczDuane SearsmithKelly SearsmithDavid Tcheng