56
Visualizing Text: Tools & Techniques Eric E Monson, PhD (Duke VTG) Katherine de Vos Devine (Duke AAHVS) 8 Nov 2012

Eric E Monson, Text->Data 08 Nov 2012

  • Upload
    emonson

  • View
    258

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Eric E Monson, Text->Data 08 Nov 2012

Visualizing Text: Tools & TechniquesEric E Monson, PhD (Duke VTG)Katherine de Vos Devine (Duke AAHVS)8 Nov 2012

Page 2: Eric E Monson, Text->Data 08 Nov 2012

Why do we visualize?

Page 3: Eric E Monson, Text->Data 08 Nov 2012

Why do we visualize?

To reveal patterns:clusters, trends, gaps &

outliers

Page 4: Eric E Monson, Text->Data 08 Nov 2012

Anscombe’s Quartet

I II III IV

x y x y x y x y

10 8.04 10 9.14 10 7.46 8 6.58

8 6.95 8 8.14 8 6.77 8 5.76

13 7.58 13 8.74 13 12.74 8 7.71

9 8.81 9 8.77 9 7.11 8 8.84

11 8.33 11 9.26 11 7.81 8 8.47

14 9.96 14 8.1 14 8.84 8 7.04

6 7.24 6 6.13 6 6.08 8 5.25

4 4.26 4 3.1 4 5.39 19 12.5

12 10.84 12 9.13 12 8.15 8 5.56

7 4.82 7 7.26 7 6.42 8 7.91

5 5.68 5 4.74 5 5.73 8 6.89

Mean of x 9

Variance of x 11

Mean of y 7.50

Variance of y 4.122 or 4.127

XY Correlation 0.816

Linear fit y = 3.00 + 0.500x

Anscombe, F. J. (1973). "Graphs in Statistical Analysis". American Statistician 27 (1): 17–21.

Page 5: Eric E Monson, Text->Data 08 Nov 2012

12

10

8

6

4

4 6 8 10 12 14 16 18

12

10

8

6

4

4 6 8 10 12 14 16 18

12

10

8

6

4

4 6 8 10 12 14 16 18

12

10

8

6

4

4 6 8 10 12 14 16 18

I II

III IV

modified from http://en.wikipedia.org/wiki/File:Anscombe%27s_quartet_3.svg

Page 6: Eric E Monson, Text->Data 08 Nov 2012

Form of presentation can reveal patternsCleveland (1994)

Page 7: Eric E Monson, Text->Data 08 Nov 2012

Why do we visualize?

Efficient humanvisual system

Page 8: Eric E Monson, Text->Data 08 Nov 2012

“Preattentive” cues – fast!Chris Healy (NC State) web examples – http://www.csc.ncsu.edu/faculty/healey/PP/

Page 9: Eric E Monson, Text->Data 08 Nov 2012

Why do we visualize?

Communication& Exploration

Page 10: Eric E Monson, Text->Data 08 Nov 2012

Vis for Communication & Exploration

• Medium – Print, slide show, poster, web

• Telling a story – Clear & guided

Page 11: Eric E Monson, Text->Data 08 Nov 2012

Text Visualization – Difficult & Fascinating

• Not preattentive

• Difficult to abstract

• Occlusion destroys comprehension

• Context gives meaning

Page 12: Eric E Monson, Text->Data 08 Nov 2012

Text Processing(not covering)

Page 13: Eric E Monson, Text->Data 08 Nov 2012
Page 14: Eric E Monson, Text->Data 08 Nov 2012
Page 15: Eric E Monson, Text->Data 08 Nov 2012

Types of visualization

Page 16: Eric E Monson, Text->Data 08 Nov 2012

Term countingWordle

Page 17: Eric E Monson, Text->Data 08 Nov 2012

Term counting + contextNYTimes

Page 18: Eric E Monson, Text->Data 08 Nov 2012

Document ComparisonJuxta

Page 19: Eric E Monson, Text->Data 08 Nov 2012

Document ComparisonJuxta Commons

Page 20: Eric E Monson, Text->Data 08 Nov 2012

Document ComparisonJuxta Commons

Page 21: Eric E Monson, Text->Data 08 Nov 2012

Terms in contextPoemViz (Indiana SILS)

Page 22: Eric E Monson, Text->Data 08 Nov 2012

Document Network (entities)Jigsaw Visual Analytics

Page 23: Eric E Monson, Text->Data 08 Nov 2012

Topic VisualizationMany Bills (IBM)

Page 24: Eric E Monson, Text->Data 08 Nov 2012

Many Eyes – IBM collaborative web vis

• Pros

- Some of the best vis people in the world did the original development

- Wide variety of visualizations, some that don’t exist anywhere else

- Best-practice graphics- Nice model for crowd vis

• Cons

- Experimental, and not clear that IBM is still supporting it even though usage keeps increasing

Page 25: Eric E Monson, Text->Data 08 Nov 2012

Phrase NetMany Eyes (* and *)

Page 26: Eric E Monson, Text->Data 08 Nov 2012

Word TreeMany Eyes

Page 27: Eric E Monson, Text->Data 08 Nov 2012

Research questions driving this workKatherine de Vos Devine

I am pursuing a JD/PhD in Art History, specializing in twentieth century and contemporary art and fashion. Broadly, my research focuses on appropriation. Specifically, I focus on practices of avant-garde artists and fashion designers that are characterized as adapting, borrowing, or interpreting from other individuals, cultures, or the past, as well as the ways in which new technologies permit and encourage appropriation.

My dissertation will focus on the regulation and enforcement of appropriation through formal (legal) and informal (social) rules, contrast the market for fine art (high regulation of appropriation) with the market for high-end fashion (low regulation of appropriation), and explore the ways in which regulation of behaviors and technologies distinguish these markets.

Page 28: Eric E Monson, Text->Data 08 Nov 2012

Research questions driving this workEric E Monson

Can we build some tools that will help Katherine explore her data more easily and quickly, while allowing her to ask new questions and see new patterns that would not have been possible without the technology?

– What tends to work is a combination of many small tools.

– Focus on prototypes so can see what is actually useful

Page 29: Eric E Monson, Text->Data 08 Nov 2012

Two main pieces we worked on

• Text Archive- Sources: Google Scholar court decisions for now- Working archive- Full-text & faceted searchable DB

• Visualizations- Web interface- Words in context

Page 30: Eric E Monson, Text->Data 08 Nov 2012

Text vis details: Concordance

In [6]: dec.concordance('piracy', width=120, lines=70)Displaying 70 of 232 matches:owledge could not be used without incurring the guilt of piracy of the book .' " Feist Publications , 499 U . S . at 350cent infringement , noting that "[ o ] pen and unabashed piracy is not a mark of good faith " and that " in [ those ] ci reproductions of records or tapes , which is known as " piracy ,"[- 3 -] could be prosecuted or face civil liability foe substance of the Sound Recording Act of 1971 [- 3 -] " Piracy ," which refers to an unauthorized duplication of a perf7 , 3129 n . 2 , 87 L . Ed . 2d 152 ( 1985 ) [- 4 -] See Piracy and Counterfeiting Amendments Act of 1982 , Pub . L . Noe here a conflict of policies : ( a ) that of preventing piracy of copyrighted matter and ( b ) that of enforcing the anthe plaintiff if we deny it relief . As the defendants ' piracy is unmistakably clear , while the plaintiffs ' infractioth that last conclusion we disagree . Open and unabashed piracy is not a mark of good faith ; and we think the ' claimedeld found that state laws on trade secrets and recording piracy were not preempted by the Copyright Act . See Kewanee Oie the review for it , such a use will be deemed in law a piracy . Id . at 550 ( quoting Folsom v . Marsh , 9 F . Cas . 3making this factual determination , a layman must detect piracy " without any aid or suggestion or critical analysis by so to treat the concept of " publication " as to prevent piracy . They tend to bear out Judge Putnam ' s suggestion in Lvely short " period of one year would actually encourage piracy by making it easier for malefactors to evade detection . Prostar is a Texas corporation suing for alleged signal piracy conducted in a Louisiana establishment . More generally conducted on a national and international scale . Cable piracy consequently differs from many of the cases where courtsnd that application of Louisiana conversion law to cable piracy claims brought under 47 U . S . C . ? ? 553 and 605 woulsions " in their efforts to investigate and pursue cable piracy . A single federal standard would eliminate these practi S . position in trade negotiations with countries where piracy is not uncommon " and " rais [ ing ] the like [ li ] hoo, Note , A Trade Based Response to Intellectual Property Piracy : A Comprehensive Plan to Aid the Motion Picture Industrso to treat the concept of ` publication ' as to prevent piracy ." We think the authorities he cites and others warrant with more or less colorable alterations to disguise the piracy . Paraphrasing is copying and an infringement , if carriord convinces me of Millard ' s transparent and shocking piracy of plaintiff ' s publications . For the wrongful and clehe trade as ' disklegging ,' ' bootlegging ' or record ' piracy .' Krug sold these records to dealer customers includinglions of dollars in losses suffered as a result of the " piracy and bootlegging " of the industry ' s products . Andersoboth to protect consumers and to prevent tape and record piracy . While tapes and records are doubtless speech , as Andee . Disclosure of the manufacturer also protects against piracy . Anderson contends that this latter interest of the station , which might adequately serve the state ' s anti - piracy interest , would largely defeat its consumer - protectioty . The primary purpose of Sec . 653w is to prevent the piracy of the works of these performers and manufacturers ; thetition for writ of habeas corpus is AFFIRMED . [- 1 -] " Piracy is the term used for unauthorized duplication of originarized duplication of original commercial products ." See Piracy and Counterfeiting Amendments Act of 1982 , S . Rep . No

Page 31: Eric E Monson, Text->Data 08 Nov 2012

Text vis overview: Word Tree

Page 32: Eric E Monson, Text->Data 08 Nov 2012

Data stages (iterative)

• Gather – Google Scholar scraping (shhh...)

• Parse – HTML content & metadata

• Clean – parsing mistakes & regularization

• Analyze / Transform – topic (subject) modeling

• Visualize – build online prototypes

Page 33: Eric E Monson, Text->Data 08 Nov 2012

New tools to learn (project as excuse)

• MongoDB – doc-centered NoSQL database

• Google Refine – data cleaning / regularization

• Apache Solr – Lucene-based search DB / server

• PHP

• D3.js – JavaScript data / DOM / vis library

• (already knew some: Python, BeautifulSoup, Mallet)

‣ So far, prototype with 18k+ copyright & trademark court decisions (1900-2011)

Page 34: Eric E Monson, Text->Data 08 Nov 2012
Page 35: Eric E Monson, Text->Data 08 Nov 2012

MongoDB – Working DB

• Pros

- Scalable, high-performance, open-source- No Schema! – Easy!- JSON – native object / dict in JS & Python- Indexed queries, rich operators, geospatial- GridFS for large binary files- Easy dumps and CSV export

• Cons

- No in-DB joins

Page 36: Eric E Monson, Text->Data 08 Nov 2012

MongoDB – Working DB> db.docs.findOne({referenced:{$size:4}}){ "_id" : ObjectId("4f406d8d47b2301618000091"), "content" : "\n121 F.2d 575 (1941)\nCORCORAN\r\nv.\r\nCOLUMBIA BROADCASTING SYSTEM, Inc., et al.\n No. 9664.\nCircuit Court of Appeals, Ninth Circuit.\nJune 30, 1941.\nBlase A. Bonpane, of Hollywood, Cal., for appellant.\nFrederick Leuschner and Richard Harper Graham, both of Los Angeles, Cal., for appellee Montgomery Ward & Co.\nBefore DENMAN, MATHEWS, and HEALY, Circuit Judges.\nHEALY, Circuit Judge.\nThe appeal is from a judgment awarding attorneys' fees in a suit for infringement of copyright, the allowance being made under the claimed authority of § 40 of the Copyright Act (Act of March 4, 1909, c. 320, 35 Stats. 1084, 17 U.S.C.A. § 40), providing that the court \"may award to the prevailing party a reasonable attorney's fee as part of [...] a consolidation of two cases.\n\n", "court" : "United States Court of Appeals, Ninth Circuit.", "court_level" : 4, "dates" : { "unlabeled" : ISODate("1941-06-30T00:00:00Z") }, "docket" : "", "docket_url" : "", "file_ref" : { "$ref" : "fs.files", "$id" : ObjectId("4f35d3a88b4cff037f000122") }, "filename" : "17703661253263627975.html", "media_type" : "google_scholar_case", "name" : "CORCORAN v. COLUMBIA BROADCASTING SYSTEM, Inc., et al.", "numbers" : [ "121 F.2d 575 (1941)" ], "ref_summary" : "Corcoran v. Columbia Broadcasting System, 121 F. 2d 575 - Circuit Court of Appeals, 9th Circuit 1941", "referenced" : [ "15974969240519593564", "7943512795249075682", "9986939737036121771", "5029666492191827803" ],

"solr_term_freqs" : [ 2, 1, ..., 3, 1 ], "solr_term_list" : [ "1", "1084", ..., "work", "would" ], "subjects" : { "television" : 0, "fashion" : 0, "art" : 0, "publishing" : 0.0032362460624426603, "comics" : 0, "photography" : 0, "toys" : 0, "architecture" : 0, "sports" : 0, "maps" : 0, "theater" : 0, "music" : 0.0032362460624426603, "advertising" : 0, "internet" : 0, "videogames" : 0, "design" : 0, "film" : 0, "software" : 0 }, "tags" : [ "copyright" ], "url" : "scholar.google.com/scholar_case?case=17703661253263627975", "year" : 1941}

Page 37: Eric E Monson, Text->Data 08 Nov 2012

Google Refine – Data cleaning

• Pros

- Free- Useful- Tools that no other package covers- Training at Data & GIS Services

• Cons

- Clustering algorithms & parameters opaque

Page 38: Eric E Monson, Text->Data 08 Nov 2012

Google Refine – Data cleaning

Page 39: Eric E Monson, Text->Data 08 Nov 2012

Google Refine – Data cleaning

Page 40: Eric E Monson, Text->Data 08 Nov 2012

Google Refine – Data cleaning

Page 41: Eric E Monson, Text->Data 08 Nov 2012

Google Refine – Data cleaning

Page 42: Eric E Monson, Text->Data 08 Nov 2012
Page 43: Eric E Monson, Text->Data 08 Nov 2012

Apache Solr – Searching

• Pros

- Lucene, fast & open-source- Indexed full-text, faceted, “snippets” returned on searches- Control over text processing (stemming)- Rich document handling (PDF, Word)

• Cons

- Not as transparent or flexible as MongoDB (no command line, no embedded documents)

- Java running in a servlet container (Tomcat)- Install & config a bit technical

Page 44: Eric E Monson, Text->Data 08 Nov 2012
Page 45: Eric E Monson, Text->Data 08 Nov 2012

Topic modeling (LDA – mallet)

• Want to search / filter by topic

• Don’t want to manually label all cases

• Topic Modeling – Latent Dirichlet Allocation

- Topics are weighted groups of words

- Documents are weighted groups of topics

- Humans give topics names later

Page 46: Eric E Monson, Text->Data 08 Nov 2012

Topic modeling (LDA – mallet)

Page 47: Eric E Monson, Text->Data 08 Nov 2012

Topic modeling (LDA – mallet)

Page 48: Eric E Monson, Text->Data 08 Nov 2012
Page 49: Eric E Monson, Text->Data 08 Nov 2012

D3 – Visualization (Mike Bostock)

• Pros

- Lightweight & Fast- Web (JavaScript & SVG)- Almost infinite flexibility- Attach data to DOM- Transitions & Interactivity- Free & open-source

• Cons

- Need programming expertise- Learning curve & No pre-canned visualizations

Page 50: Eric E Monson, Text->Data 08 Nov 2012

Prototype examples

Page 51: Eric E Monson, Text->Data 08 Nov 2012
Page 52: Eric E Monson, Text->Data 08 Nov 2012
Page 53: Eric E Monson, Text->Data 08 Nov 2012
Page 54: Eric E Monson, Text->Data 08 Nov 2012
Page 55: Eric E Monson, Text->Data 08 Nov 2012
Page 56: Eric E Monson, Text->Data 08 Nov 2012

Future work

• More Sources!! – NYTimes & Vogue top priority

• Additional parsing – metadata

• Interaction & Tree Pruning – like Many Eyes

• Multi-tree comparisons

• Timelines – theme river