21
Rada Mihalcea University of Michigan Linguistic Ethnography: Identifying Dominant Word Classes in Text Stephen Pulman Oxford University

Linguistic Ethnography: Identifying Dominant Word Classes ...web.eecs.umich.edu/~mihalcea/498IR/Lectures/LinguisticEthnography.pdf• “The Onion” – “the best source of humour

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Linguistic Ethnography: Identifying Dominant Word Classes ...web.eecs.umich.edu/~mihalcea/498IR/Lectures/LinguisticEthnography.pdf• “The Onion” – “the best source of humour

Rada Mihalcea University of Michigan

Linguistic Ethnography: Identifying Dominant Word Classes in Text

Stephen Pulman Oxford University

Page 2: Linguistic Ethnography: Identifying Dominant Word Classes ...web.eecs.umich.edu/~mihalcea/498IR/Lectures/LinguisticEthnography.pdf• “The Onion” – “the best source of humour

Linguistic Ethnography?

•  Finding and understanding patterns in given types of text –  Find the characteristics of a text –  Reflective of behavior or style

•  Examples –  Female vs. male authored texts (gender) –  Texts describing happy vs. sad moods (mood) – Humorous vs. non-humorous text (comic) –  Introvert vs. extrovert authors (psychology)

Page 3: Linguistic Ethnography: Identifying Dominant Word Classes ...web.eecs.umich.edu/~mihalcea/498IR/Lectures/LinguisticEthnography.pdf• “The Onion” – “the best source of humour

Linguistic Ethnography vs. Text Classification

•  Text classification: –  Automatic separation of classes of text –  Supervised or semi-supervised algorithms (Naïve Bayes,

SVM, perceptron, etc.) –  Feature weighting and selection

•  Linguistic ethnography –  Identification of classes of words over salient features –  Understand the characteristics of the texts –  Insights into the properties and behaviors modeled by those

texts

Page 4: Linguistic Ethnography: Identifying Dominant Word Classes ...web.eecs.umich.edu/~mihalcea/498IR/Lectures/LinguisticEthnography.pdf• “The Onion” – “the best source of humour

Well kids, I had an awesome birthday thanks to you. =D Just wanted to so thank you for coming and thanks for the gifts and junk. =) I have many pictures and I will post them later. hearts

mood:

Home alone for too many hours, all week long ... screaming child, headache, tears that just won’t let themselves loose.... and now I’ve lost my wedding band. I hate this.

mood:

An Example: Finding Happiness

Page 5: Linguistic Ethnography: Identifying Dominant Word Classes ...web.eecs.umich.edu/~mihalcea/498IR/Lectures/LinguisticEthnography.pdf• “The Onion” – “the best source of humour

Corpus-derived Happiness Factors

yay 86.67 shopping 79.56 awesome 79.71 birthday 78.37 lovely 77.39 concert 74.85 cool 73.72 cute 73.20 lunch 73.02 books 73.02

goodbye 18.81 hurt 17.39 tears 14.35 cried 11.39 upset 11.12 sad 11.11 cry 10.56 died 10.07 lonely 9.50 crying 5.50

Page 6: Linguistic Ethnography: Identifying Dominant Word Classes ...web.eecs.umich.edu/~mihalcea/498IR/Lectures/LinguisticEthnography.pdf• “The Onion” – “the best source of humour

Identifying Word Classes in Text

•  Foreground corpus: corpus of texts of interest •  Background corpus: “neutral” texts –  Collection of texts that do not have the property shared by

the foreground corpus –  Balanced corpus

•  Mix of texts

•  Goal: identify word classes that are dominant in the foreground corpus

Page 7: Linguistic Ethnography: Identifying Dominant Word Classes ...web.eecs.umich.edu/~mihalcea/498IR/Lectures/LinguisticEthnography.pdf• “The Onion” – “the best source of humour

Word Class Dominance

•  C = {W1, W2, …, Wn}

•  Score significantly higher than 1: word classes that are dominant in the foreground corpus

)(

)(

FSize

WFrequencyCoverage CW

i

Fi

∑∈=

)(

)(

BSize

WFrequencyCoverage CW

i

Bi

∑∈=

)()(.CCoverageCCoveragenanceDomi

B

FF =

Page 8: Linguistic Ethnography: Identifying Dominant Word Classes ...web.eecs.umich.edu/~mihalcea/498IR/Lectures/LinguisticEthnography.pdf• “The Onion” – “the best source of humour

Lexical Resources for Word Classes

•  Roget –  Thesaurus of English language –  100,000 grouped based on synonymy and other semantic

relations •  Linguistic Inquiry and Word Count (LIWC) –  Lexicon developed for psycholinguistic analysis (Pennebaker

& all) –  2,200 words grouped into 70 classes

•  WordNet Affect –  Resource built on top of WordNet –  Annotations with the emotions in the classification of Ortony –  Focus on: anger, disgust, fear, joy, sadness, surprise

Page 9: Linguistic Ethnography: Identifying Dominant Word Classes ...web.eecs.umich.edu/~mihalcea/498IR/Lectures/LinguisticEthnography.pdf• “The Onion” – “the best source of humour

Word Class Examples

•  Roget: –  PERFECTION: perfection, purity, integrity, impeccability, … – MEDIOCRITY: mediocrity, dullness, indifference, inferiority, …

•  LIWC: – OPTIMISM: accept, best, confidence, glorious, hope, … –  SOCIAL: adult, advice, affair, boy, buddies, comrade, …

•  WordNet-Affect: –  ANGER: offense, temper, irritation, fury, rage, … –  JOY: worship, adoration, sympathy, tenderness, respect, love, …

Page 10: Linguistic Ethnography: Identifying Dominant Word Classes ...web.eecs.umich.edu/~mihalcea/498IR/Lectures/LinguisticEthnography.pdf• “The Onion” – “the best source of humour

A Case Study: Verbal Humour

•  Gain insights into the “language of humour” •  Find classes of words that are dominant in humorous

text •  Foreground corpus: humorous text –  Two types of verbal humour:

•  One-liners •  Humorous news articles

•  Background corpus: non-humorous text –  A mix of data from non-humorous sources: Reuters

newspapers, British National Corpus, proverbs, Open Mind Common Sense

Page 11: Linguistic Ethnography: Identifying Dominant Word Classes ...web.eecs.umich.edu/~mihalcea/498IR/Lectures/LinguisticEthnography.pdf• “The Onion” – “the best source of humour

Humorous Data: One-liners

•  “He who smiles in a crisis has found someone to blame” •  Short sentence, simple syntax •  Deliberate use of rhetoric devices (alliteration, rhyme) •  Frequent use of creative language •  Comic effect

•  Web-based bootstrapping •  Start with a few manually selected seeds •  Identify a list of Web pages including at least one seed •  Parse Web pages and find new one-liners •  Repeat

–  16,000 one-liners

Page 12: Linguistic Ethnography: Identifying Dominant Word Classes ...web.eecs.umich.edu/~mihalcea/498IR/Lectures/LinguisticEthnography.pdf• “The Onion” – “the best source of humour

Humorous Data: News stories

•  “The Onion” –  “the best source of humour out there” (Jeff Grienfield, CNN)

•  Canadian Prime Minister Jean Chrétien and Indian President Abdul Kalam held a subdued press conference in the Canadian Capitol building Monday to announce that the two nations have peacefully and sheepishly resolved a dispute over their common border. "We are - well, I guess proud isn't the word - relieved, I suppose, to restore friendly relations with India after the regrettable dispute over the exact coordinates of our shared border," said Chrétien, who refused to meet reporters' eyes as he nervously crumpled his prepared statement. "The border that, er... Well, I guess it turns out that we don't share a border after all." Chrétien then officially withdrew his country's demand that India hand over a 20-mile-wide stretch of land that was to have served as a demilitarized buffer zone between the two nations.“

–  1,125 news articles from August 2005 – March 2006 •  1,000-10,000 characters

Page 13: Linguistic Ethnography: Identifying Dominant Word Classes ...web.eecs.umich.edu/~mihalcea/498IR/Lectures/LinguisticEthnography.pdf• “The Onion” – “the best source of humour

Dominant Roget Word Classes in Humorous Text

•  anonymity 3.48 : you, person, cover, anonymous, unknown, unidentified, unspecified

•  odor 3.36 : nose, smell, strong, breath, inhale, stink, pong, perfume, flavor

•  secrecy 2.96 : close, wall, secret, meeting, apart, ourselves, security, censorship

•  wrong 2.83 : wrong, illegal, evil, terrible, shame, beam, incorrect, pity, horror

•  unorthodoxy 2.52 : error, non, err, wander, pagan, fallacy, atheism, erroneous, fallacious

•  overestimation 2.45 : think, exaggerate, overestimated, overestimate, exaggerated

•  disarrangement 2.18 : trouble, throw, ball, bug, insanity, confused, upset, mess, confuse

Page 14: Linguistic Ethnography: Identifying Dominant Word Classes ...web.eecs.umich.edu/~mihalcea/498IR/Lectures/LinguisticEthnography.pdf• “The Onion” – “the best source of humour

Dominant LIWC Word Classes in Humorous Text

•  you 3.17 : you, thou, thy, thee, thin •  I 2.84 : myself, mine •  swear 2.81 : hell, ass, butt, suck, dick, arse, bastard, sucked,

sucks, boobs •  self 2.23 : our, myself, mine, lets, ourselves, ours •  sexual 2.07 : love, loves, loved, naked, butt, gay, dick, boobs,

cock, horny, fairy •  groom 2.06 : soap, shower, perfume, makeup •  cause 1.99 : why, how, because, found, since, product, depends,

thus, cos •  humans 1.79 : man, men, person, children, human, child, kids,

baby, girl, boy

Page 15: Linguistic Ethnography: Identifying Dominant Word Classes ...web.eecs.umich.edu/~mihalcea/498IR/Lectures/LinguisticEthnography.pdf• “The Onion” – “the best source of humour

Dominant WordNet-Affect Word Classes in Humorous Text

•  surprise 3.31 : stupid, wonder, wonderful, beat, surprised, surprise, amazing, terrific

Page 16: Linguistic Ethnography: Identifying Dominant Word Classes ...web.eecs.umich.edu/~mihalcea/498IR/Lectures/LinguisticEthnography.pdf• “The Onion” – “the best source of humour

Evaluation

•  How good are these classes? •  Derive word classes from different data sets and

measure correlation •  Split the one-liners in two: 8,000 one-liners vs. 8,000 one-

liners •  Split the news stories in two: 550 stories vs. 550 stories •  16,000 one-liners vs. 1,100 news stories

Roget LIWCone-liners vs. one-liners 0.95 0.96news stories vs. news stories 0.84 0.88one-liners vs. news stories 0.63 0.42

Page 17: Linguistic Ethnography: Identifying Dominant Word Classes ...web.eecs.umich.edu/~mihalcea/498IR/Lectures/LinguisticEthnography.pdf• “The Onion” – “the best source of humour

Characteristics of Verbal Humour

•  Observed by analyzing the word classes •  Human-centerdness –  YOU, I, SELF, HUMANS

•  you occurs in more than 25% of the one-liners •  “You can always find what you are not looking for.” •  professional communities •  “It was so cold last winter, that I saw a lawyer with his hands in his

own pockets.”

Page 18: Linguistic Ethnography: Identifying Dominant Word Classes ...web.eecs.umich.edu/~mihalcea/498IR/Lectures/LinguisticEthnography.pdf• “The Onion” – “the best source of humour

Characteristics of Verbal Humour

•  Negative polarity – WRONG, UNORTHODOXY,

DISARRANGEMENT •  “Only adults have trouble with child-proof

bottles.” •  “When everything comes your way, you are

in the wrong lane.”

Page 19: Linguistic Ethnography: Identifying Dominant Word Classes ...web.eecs.umich.edu/~mihalcea/498IR/Lectures/LinguisticEthnography.pdf• “The Onion” – “the best source of humour

Dominant Classes in Humour

– Human-centeredness: human-related semantic classes found dominant in humorous text as compared to non-humorous text

– Negative polarity: semantic classes with negative orientation

•  Humour as “natural therapy” where tensions related to negative scenarios concerning us humans are relieved through laughter

•  Correlation with empirical observations from previous work •  Human-centerdness, negative polarity, sexual vocabulary,

swear words, surprise

Page 20: Linguistic Ethnography: Identifying Dominant Word Classes ...web.eecs.umich.edu/~mihalcea/498IR/Lectures/LinguisticEthnography.pdf• “The Onion” – “the best source of humour

Conclusions

•  Find the dominant word classes in types of text •  Reflective of behavior or style •  Systematic and portable

•  Case study on humour: •  Good correlation among classes derived from different

corpora •  Correlation with empirical observations from previous work

Page 21: Linguistic Ethnography: Identifying Dominant Word Classes ...web.eecs.umich.edu/~mihalcea/498IR/Lectures/LinguisticEthnography.pdf• “The Onion” – “the best source of humour

A conclusion is simply the place where you got tired of thinking. ?