CS B Nov 2010 Text Visualization - Stanford University...2010/11/09 · Project Proposal: Tuesday, Nov 9 (end of day) Initial Presentation: Tuesday, Nov 16 & Thursday, Nov 18 Poster

1

CS448B :: 9 Nov 2010

Text Visualization

Jeffrey Heer Stanford University

Why visualize text?

Why Visualize Text?

Understanding – get the “gist” of a document

Grouping – cluster for overview or classification

Compare – compare document collections, or inspect evolution of collection over time

Correlate – compare patterns in text to those in other data, e.g., correlate with social network

What is text data?

DocumentsArticles, books and novelsComputer programsE-mails, web pages, blogsTags, comments

Collection of documentsMessages (e-mail, blogs, tags, comments)Social networks (personal profiles)Academic collaborations (publications)

2

Challenge: Visualize Dissertations

You have 20 years of university Ph.D. theses:Text, Year, Dept., Author, Advisor, Committee

What questions might you want to answer?What visualizations might help?

A Concrete Example

What would help you gauge…

The topics in the document?Whether or not you should read it?Its relationship to other documents?

Tag Cloud: Word Counts Word Tree: Word Sequences

3

PhraseNet: “A the B”MIMIR – Jason Chuang

A Double Gulf of Evaluation

Many (most?) text visualizations do not represent the text directly… they represent a model (term statistics, clusters, etc).

1) Can you interpret the visualization? How well does it convey the properties of the model?

2) Do you trust the model? How does the model enable us to reason about the text?

Lessons for Text Visualization

Show (or provide access to) the source text. Let readers assess the model and use visualization as an index into the documents.

Find meaningful abstractions for grouping documents. Are clusters interpretable?

Where possible use text to represent text… but which terms are the most descriptive?

4

Topics

Text as DataVisualizing Document ContentEvolving DocumentsVisualizing ConversationDocument Collections

Text as Data

Words are (not) nominal?

High dimensional (10,000+)More than equality testsWords have meanings and relations

Correlations: Hong Kong, San Francisco, Bay Area

Order: April, February, January, June, March, May

Membership: Tennis, Running, Swimming, Hiking, Piano

Hierarchy, antonyms & synonyms, entities, …

Text Processing Pipeline

Tokenization: segment text into termsSpecial cases? e.g., “San Francisco”, “L’ensemble”, “U.S.A.”Remove stop words? e.g., “a”, “an”, “the”, “to”, “be”?

Stemming: one means of normalizing termsReduce terms to their “root”; Porter’s algorithm for Englishe.g., automate(s), automatic, automation all map to automatFor visualization, want to reverse stemming for labels

Simple solution: map from stem to the most frequent word

Result: ordered stream of terms

5

The Bag of Words Model

Ignore ordering relationships within the text

A document ≈ vector of term weightsEach dimension corresponds to a term (10,000+)Each value represents the relevance

For example, simple term counts

Aggregate into a document x term matrixDocument vector space model

Document x Term matrix

Each document is a vector of term weightsSimplest weighting is to just count occurrences

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 157 73 0 0 0 0

Brutus 4 157 0 1 0 0

Caesar 232 227 0 2 1 1

Calpurnia 0 10 0 0 0 0

Cleopatra 57 0 0 0 0 0

mercy 2 0 3 5 5 1

worser 2 0 1 1 1 0

WordCount (Harris 2004)

http://wordcount.org

6

Weaknesses of Tag Clouds

Sub-optimal visual encoding (size vs. position)Inaccurate size encoding (long words are bigger)May not facilitate comparison (unstable layout)Term frequency may not be meaningfulDoes not show the structure of the text

Keyword Weighting

Term Frequencytftd = count(t) in dCan take log frequency: log(1 + tftd)Can normalize to show proportion: tftd / Σt tftd

Keyword Weighting

Term Frequencytftd = count(t) in d

TF.IDF: Term Freq by Inverse Document Freqtf.idftd = log(1 + tftd) × log(N/dft)dft = # docs containing t; N = # of docs

7

Keyword Weighting

Term Frequencytftd = count(t) in d

TF.IDF: Term Freq by Inverse Document Freqtf.idftd = log(1 + tftd) × log(N/dft)dft = # docs containing t; N = # of docs

G2: Probability of different word frequencyE1 = |d| × (tftd + tft(C-d)) / |C|E2 = |C-d| × (tftd + tft(C-d)) / |C|G2 = 2 × (tftd log(tftd/E1) + tft(C-d) log(tft(C-d)/E2))

Limitations of Frequency Statistics?

Typically focus on unigrams (single terms)

Often favors frequent (TF) or rare (IDF) termsNot clear that these provide best description

A “bag of words” ignores additional informationGrammar / part-of-speechPosition within documentRecognizable entities

8

How do people describe text?

We asked 69 subjects (all Ph.D. students) to read and describe dissertation abstracts.

Students were given 3 documents in sequence, they then described the collection as a whole.

Students were matched to both familiar and unfamiliar topics; topical diversity within a collection was varied systematically.

[Chuang, Heer & Manning, 2010]

Bigrams (phrases of 2 words) are the most common.

Phrase length declines with more docs & more diversity.

Term Commonness

log(tfw) / log(tfthe)

The normalized term frequency relative to the most frequent n-gram, e.g., the word “the”.

Measured across an entire corpus or across the entire English language (using Google n-grams)

9

Selected descriptive terms have medium commonness.Judges avoid both rare and common words.

Commonness increases with more docs & more diversity.

Scoring Terms with Freq, Grammar & Position

10

G2 Regression Model

Visualizing Document Content

TileBars [Hearst]

11

Visual Thesaurus [ThinkMap]

Concordance

What is the common local context of a term?

12

WordTree (Wattenberg et al) Filter infrequent runs

13

Recurrent themes in speech

Glimpses of structure

Concordances show local, repeated structureBut what about other types of patterns?

For example Lexical: <A> at <B> Syntactic: <Noun> <Verb> <Object>

Phrase Nets [van Ham et al]

Look for specific linking patterns in the text:‘A and B’, ‘A at B’, ‘A of B’, etcCould be output of regexp or parser.

Visualize extracted patterns in a node-link viewOccurrences Node sizePattern position Edge direction

14

Portrait of the Artist as a Young ManX and Y

Node Grouping

The BibleX begat Y

Pride & PrejudiceX at YLexical Parser, < 1sec running time

15

Pride & PrejudiceX at YSyntactic Parser, > 24 hours running time

18th & 19th Century NovelsX’s Y

X of Y X of Y

16

Administrivia

Interesting Talk Tomorrow

Data Visualization: Application in Industry

Eric Rodenbeck, CEO Stamen DesignWednesday Nov 10, 7-9pmART4, Cummings Art Building

More info at:http://www.meetup.com/VisualizeMyData/

Interesting Talk Tomorrow (2)

New Directions in Tableau 6.0

Jock Mackinlay, Tableau SoftwareWednesday Nov 10, 2:30-4pm392 Gates Hall

Final ProjectDesign a new visualization technique or systemMany options: new system, interaction technique, design study6-8 page paper in conference paper format2 Project Presentations

ScheduleProject Proposal: Tuesday, Nov 9 (end of day)Initial Presentation: Tuesday, Nov 16 & Thursday, Nov 18Poster Presentation: Tuesday, Dec 7 (4-6pm)Final Papers: Friday, Dec 10 (end of day)

LogisticsGroups of up to 3 people, graded individuallyClearly report responsibilities of each member

17

Evolving Documents

Visualizing Revision History

How to depict contributions over time?

Example: Wikipedia history log

Animated Traces (Ben Fry)

http://benfry.com/traces/

18

Diff

History Flow (Viégas et al)

Wikipedia History Flow (IBM)

19

Visualizing Conversation

Visualizing Conversation

Many dimensions to consider:Who (senders, receivers)What (the content of communication)When (temporal patterns)

Interesting cross-products:What x When Topic “Zeitgeist”Who x Who Social networkWho x Who x What x When Information flow

20

Usenet Visualization (Viégas & Smith)

Show correspondence patterns in text forumsInitiate vs. reply; size and duration of discussion

Newsgroup crowds / Authorlines

Mountain (Viégas)

Conversation by person over time (who x when).

21

Themail (Viégas et al)

One person over time, TF.IDF weighted terms

Enron E-Mail Corpus

Washington Lobbyist ?

22

Visualizing Document Collections

NewsMap: Google News Treemap (Marcos Weskamp)

10 x 10 News Map (Harris 2004) Named Entity Recognition

Identify and classify named entities in text:John Smith PERSONSoviet Union COUNTRY353 Serra St ADDRESS(555) 721-4312 PHONE NUMBER

Entity relations: how do the entities relate?Simple approach: do they co-occur in a small window of text?

23

Doc. Similarity & Clustering

In vector model, compute distance among docsFor TF.IDF, typically cosine distanceSimilarity measure can be used to cluster

Topic modeling approachesAssume documents are a mixture of topicsTopics are (roughly) a set of co-occurring termsLatent Semantic Analysis (LSA): reduce term matrixLatent Dirichlet Allocation (LDA): statistical model

ThemeRiver (Havre et al 99)

0

0.0005

0.001

0.0015

0.002

0.0025

0.003

0.0035

0.004

1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008

Statistical Machine Learning in Pubmed

supervised machine learning

probabilistic reasoning

mcmc

dimensionality / kernels

clustering / similarity

bayesian learning

Track topic strengths over time

24

Parallel Tag Clouds (Collins et al 09) Lessons for Text Visualization

Show (or provide access to) the source text. Let readers assess the model and use visualization as an index into the documents.

Find meaningful abstractions for grouping documents. Are clusters interpretable?

Where possible use text to represent text… but which terms are the most descriptive?