Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
1
CS448B :: 9 Nov 2010
Text Visualization
Jeffrey Heer Stanford University
Why visualize text?
Why Visualize Text?
Understanding – get the “gist” of a document
Grouping – cluster for overview or classification
Compare – compare document collections, or inspect evolution of collection over time
Correlate – compare patterns in text to those in other data, e.g., correlate with social network
What is text data?
DocumentsArticles, books and novelsComputer programsE-mails, web pages, blogsTags, comments
Collection of documentsMessages (e-mail, blogs, tags, comments)Social networks (personal profiles)Academic collaborations (publications)
2
Challenge: Visualize Dissertations
You have 20 years of university Ph.D. theses:Text, Year, Dept., Author, Advisor, Committee
What questions might you want to answer?What visualizations might help?
A Concrete Example
What would help you gauge…
The topics in the document?Whether or not you should read it?Its relationship to other documents?
Tag Cloud: Word Counts Word Tree: Word Sequences
3
PhraseNet: “A the B”MIMIR – Jason Chuang
A Double Gulf of Evaluation
Many (most?) text visualizations do not represent the text directly… they represent a model (term statistics, clusters, etc).
1) Can you interpret the visualization? How well does it convey the properties of the model?
2) Do you trust the model? How does the model enable us to reason about the text?
Lessons for Text Visualization
Show (or provide access to) the source text. Let readers assess the model and use visualization as an index into the documents.
Find meaningful abstractions for grouping documents. Are clusters interpretable?
Where possible use text to represent text… but which terms are the most descriptive?
4
Topics
Text as DataVisualizing Document ContentEvolving DocumentsVisualizing ConversationDocument Collections
Text as Data
Words are (not) nominal?
High dimensional (10,000+)More than equality testsWords have meanings and relations
Correlations: Hong Kong, San Francisco, Bay Area
Order: April, February, January, June, March, May
Membership: Tennis, Running, Swimming, Hiking, Piano
Hierarchy, antonyms & synonyms, entities, …
Text Processing Pipeline
Tokenization: segment text into termsSpecial cases? e.g., “San Francisco”, “L’ensemble”, “U.S.A.”Remove stop words? e.g., “a”, “an”, “the”, “to”, “be”?
Stemming: one means of normalizing termsReduce terms to their “root”; Porter’s algorithm for Englishe.g., automate(s), automatic, automation all map to automatFor visualization, want to reverse stemming for labels
Simple solution: map from stem to the most frequent word
Result: ordered stream of terms
5
The Bag of Words Model
Ignore ordering relationships within the text
A document ≈ vector of term weightsEach dimension corresponds to a term (10,000+)Each value represents the relevance
For example, simple term counts
Aggregate into a document x term matrixDocument vector space model
Document x Term matrix
Each document is a vector of term weightsSimplest weighting is to just count occurrences
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 157 73 0 0 0 0
Brutus 4 157 0 1 0 0
Caesar 232 227 0 2 1 1
Calpurnia 0 10 0 0 0 0
Cleopatra 57 0 0 0 0 0
mercy 2 0 3 5 5 1
worser 2 0 1 1 1 0
WordCount (Harris 2004)
http://wordcount.org
6
Weaknesses of Tag Clouds
Sub-optimal visual encoding (size vs. position)Inaccurate size encoding (long words are bigger)May not facilitate comparison (unstable layout)Term frequency may not be meaningfulDoes not show the structure of the text
Keyword Weighting
Term Frequencytftd = count(t) in dCan take log frequency: log(1 + tftd)Can normalize to show proportion: tftd / Σt tftd
Keyword Weighting
Term Frequencytftd = count(t) in d
TF.IDF: Term Freq by Inverse Document Freqtf.idftd = log(1 + tftd) × log(N/dft)dft = # docs containing t; N = # of docs
7
Keyword Weighting
Term Frequencytftd = count(t) in d
TF.IDF: Term Freq by Inverse Document Freqtf.idftd = log(1 + tftd) × log(N/dft)dft = # docs containing t; N = # of docs
G2: Probability of different word frequencyE1 = |d| × (tftd + tft(C-d)) / |C|E2 = |C-d| × (tftd + tft(C-d)) / |C|G2 = 2 × (tftd log(tftd/E1) + tft(C-d) log(tft(C-d)/E2))
Limitations of Frequency Statistics?
Typically focus on unigrams (single terms)
Often favors frequent (TF) or rare (IDF) termsNot clear that these provide best description
A “bag of words” ignores additional informationGrammar / part-of-speechPosition within documentRecognizable entities
8
How do people describe text?
We asked 69 subjects (all Ph.D. students) to read and describe dissertation abstracts.
Students were given 3 documents in sequence, they then described the collection as a whole.
Students were matched to both familiar and unfamiliar topics; topical diversity within a collection was varied systematically.
[Chuang, Heer & Manning, 2010]
Bigrams (phrases of 2 words) are the most common.
Phrase length declines with more docs & more diversity.
Term Commonness
log(tfw) / log(tfthe)
The normalized term frequency relative to the most frequent n-gram, e.g., the word “the”.
Measured across an entire corpus or across the entire English language (using Google n-grams)
9
Selected descriptive terms have medium commonness.Judges avoid both rare and common words.
Commonness increases with more docs & more diversity.
Scoring Terms with Freq, Grammar & Position
10
G2 Regression Model
Visualizing Document Content
TileBars [Hearst]
11
Visual Thesaurus [ThinkMap]
Concordance
What is the common local context of a term?
12
WordTree (Wattenberg et al) Filter infrequent runs
13
Recurrent themes in speech
Glimpses of structure
Concordances show local, repeated structureBut what about other types of patterns?
For example Lexical: <A> at <B> Syntactic: <Noun> <Verb> <Object>
Phrase Nets [van Ham et al]
Look for specific linking patterns in the text:‘A and B’, ‘A at B’, ‘A of B’, etcCould be output of regexp or parser.
Visualize extracted patterns in a node-link viewOccurrences Node sizePattern position Edge direction
14
Portrait of the Artist as a Young ManX and Y
Node Grouping
The BibleX begat Y
Pride & PrejudiceX at YLexical Parser, < 1sec running time
15
Pride & PrejudiceX at YSyntactic Parser, > 24 hours running time
18th & 19th Century NovelsX’s Y
X of Y X of Y
16
Administrivia
Interesting Talk Tomorrow
Data Visualization: Application in Industry
Eric Rodenbeck, CEO Stamen DesignWednesday Nov 10, 7-9pmART4, Cummings Art Building
More info at:http://www.meetup.com/VisualizeMyData/
Interesting Talk Tomorrow (2)
New Directions in Tableau 6.0
Jock Mackinlay, Tableau SoftwareWednesday Nov 10, 2:30-4pm392 Gates Hall
Final ProjectDesign a new visualization technique or systemMany options: new system, interaction technique, design study6-8 page paper in conference paper format2 Project Presentations
ScheduleProject Proposal: Tuesday, Nov 9 (end of day)Initial Presentation: Tuesday, Nov 16 & Thursday, Nov 18Poster Presentation: Tuesday, Dec 7 (4-6pm)Final Papers: Friday, Dec 10 (end of day)
LogisticsGroups of up to 3 people, graded individuallyClearly report responsibilities of each member
17
Evolving Documents
Visualizing Revision History
How to depict contributions over time?
Example: Wikipedia history log
Animated Traces (Ben Fry)
http://benfry.com/traces/
18
Diff
History Flow (Viégas et al)
Wikipedia History Flow (IBM)
19
Visualizing Conversation
Visualizing Conversation
Many dimensions to consider:Who (senders, receivers)What (the content of communication)When (temporal patterns)
Interesting cross-products:What x When Topic “Zeitgeist”Who x Who Social networkWho x Who x What x When Information flow
20
Usenet Visualization (Viégas & Smith)
Show correspondence patterns in text forumsInitiate vs. reply; size and duration of discussion
Newsgroup crowds / Authorlines
Mountain (Viégas)
Conversation by person over time (who x when).
21
Themail (Viégas et al)
One person over time, TF.IDF weighted terms
Enron E-Mail Corpus
Washington Lobbyist ?
22
Visualizing Document Collections
NewsMap: Google News Treemap (Marcos Weskamp)
10 x 10 News Map (Harris 2004) Named Entity Recognition
Identify and classify named entities in text:John Smith PERSONSoviet Union COUNTRY353 Serra St ADDRESS(555) 721-4312 PHONE NUMBER
Entity relations: how do the entities relate?Simple approach: do they co-occur in a small window of text?
23
Doc. Similarity & Clustering
In vector model, compute distance among docsFor TF.IDF, typically cosine distanceSimilarity measure can be used to cluster
Topic modeling approachesAssume documents are a mixture of topicsTopics are (roughly) a set of co-occurring termsLatent Semantic Analysis (LSA): reduce term matrixLatent Dirichlet Allocation (LDA): statistical model
ThemeRiver (Havre et al 99)
0
0.0005
0.001
0.0015
0.002
0.0025
0.003
0.0035
0.004
1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008
Statistical Machine Learning in Pubmed
supervised machine learning
probabilistic reasoning
mcmc
dimensionality / kernels
clustering / similarity
bayesian learning
Track topic strengths over time
24
Parallel Tag Clouds (Collins et al 09) Lessons for Text Visualization
Show (or provide access to) the source text. Let readers assess the model and use visualization as an index into the documents.
Find meaningful abstractions for grouping documents. Are clusters interpretable?
Where possible use text to represent text… but which terms are the most descriptive?