28
11/13/13 1 15383: txt proc Intro to Text Processing Lecture 8 Behrang Mohit Some slides are borrowed or adapted from Noah Smith, Dan Jurafsky and Chris Manning lectures. 15383: txt proc So far … Talked about: Text processing IE, Doc classificaMon Basics of linguisMcs Syntax POS tagging 2

Intro$to$TextProcessing$ Lecture$8$behrang/15383/lecs/2013.11.13.Lec8.pdf11/13/13 1 15383: txt proc! Intro$to$TextProcessing$ Lecture$8$ Behrang$Mohit$ Some$slides$are$borrowed$or$adapted$from$Noah$Smith,$Dan$Jurafsky$and$Chris$

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Intro$to$TextProcessing$ Lecture$8$behrang/15383/lecs/2013.11.13.Lec8.pdf11/13/13 1 15383: txt proc! Intro$to$TextProcessing$ Lecture$8$ Behrang$Mohit$ Some$slides$are$borrowed$or$adapted$from$Noah$Smith,$Dan$Jurafsky$and$Chris$

11/13/13  

1  

15383: txt proc!

Intro  to  Text  Processing  Lecture  8  

Behrang  Mohit  

Some  slides  are  borrowed  or  adapted  from  Noah  Smith,  Dan  Jurafsky  and  Chris  Manning    lectures.  

15383: txt proc!

So  far  …  

•  Talked  about:  – Text  processing  

•  IE,  Doc  classificaMon  – Basics  of  linguisMcs  – Syntax  

•  POS  tagging  

2  

Page 2: Intro$to$TextProcessing$ Lecture$8$behrang/15383/lecs/2013.11.13.Lec8.pdf11/13/13 1 15383: txt proc! Intro$to$TextProcessing$ Lecture$8$ Behrang$Mohit$ Some$slides$are$borrowed$or$adapted$from$Noah$Smith,$Dan$Jurafsky$and$Chris$

11/13/13  

2  

15383: txt proc!

Words  as  symbols  

•  that  empty  sky  was  full  of  light  

•  that/DT  empty/ADJ  sky/Noun  was/VBD  full/ADJ  of/IN  light/NN    

•  22/DT  59/ADJ  103/Noun  4/VBD  58/ADJ  9/IN  97/NN    

3  

15383: txt proc!

Word  RelaMonships  

•  We  want  computers  to  know  how  lexical  items  (words)  like  giraffe  are  related  to  each  other.  – e.g.,  so  we  can  infer:  

•  saw  (Sara,  giraffe)    saw(Sara,  animal)  

– Disambiguate:  •  In  my  office  there  are  two  tables  and  one  chair.  •  This  year  the  chair  of  conference  is  Prof.  Ali.  

4  

Page 3: Intro$to$TextProcessing$ Lecture$8$behrang/15383/lecs/2013.11.13.Lec8.pdf11/13/13 1 15383: txt proc! Intro$to$TextProcessing$ Lecture$8$ Behrang$Mohit$ Some$slides$are$borrowed$or$adapted$from$Noah$Smith,$Dan$Jurafsky$and$Chris$

11/13/13  

3  

15383: txt proc!

Learning  words  

•  Computers  have  no  view  of  the  world  

5  

15383: txt proc!

Learning  words  

•  No  understanding  of  dicMonaries,  Wikipedia,  scienMfic  text  

•  Understand  symbols,  staMsMcs  and  graph  structures  – Lexical  database  

6  

Page 4: Intro$to$TextProcessing$ Lecture$8$behrang/15383/lecs/2013.11.13.Lec8.pdf11/13/13 1 15383: txt proc! Intro$to$TextProcessing$ Lecture$8$ Behrang$Mohit$ Some$slides$are$borrowed$or$adapted$from$Noah$Smith,$Dan$Jurafsky$and$Chris$

11/13/13  

4  

15383: txt proc!

Lexical  semanMcs  

•  Syntax  – Word-­‐level:  POS  tagging  

– Sentence-­‐level:  Parse  Trees  •  SemanMcs  (meaning)  

– Lexical  Seman,cs:  Study  of  the  meaning  of  words  and  the  connecMons  between  words  

7  

15383: txt proc!

Words  and  their  RelaMons  

•  Lexical  item  or  a  word  form:  One  or  more  words  which  form  the  basic  elements  of  a  language’s  lexicon  (vocabulary)  – Horse    –  traffic  light  

•  Sense:  a  meaning  of  a  word  form  – Chair:  an  object  used  for  sidng  – Chair:  The  head  of  a  group  of  humans  

8  

Page 5: Intro$to$TextProcessing$ Lecture$8$behrang/15383/lecs/2013.11.13.Lec8.pdf11/13/13 1 15383: txt proc! Intro$to$TextProcessing$ Lecture$8$ Behrang$Mohit$ Some$slides$are$borrowed$or$adapted$from$Noah$Smith,$Dan$Jurafsky$and$Chris$

11/13/13  

5  

15383: txt proc!

Lexical  Matrix  

9  

•  Many:Many  mappings  between  word  forms  and  meanings.  

•  F1  and  F2  are  synonyms.  

15383: txt proc!

Lexical  Matrix  

10  

•  Chair  and  sofa  are  synonyms  

•  Word  RelaMons  •  Synonymy  •  Antonymy  •  [a-­‐z]+nymy  

Page 6: Intro$to$TextProcessing$ Lecture$8$behrang/15383/lecs/2013.11.13.Lec8.pdf11/13/13 1 15383: txt proc! Intro$to$TextProcessing$ Lecture$8$ Behrang$Mohit$ Some$slides$are$borrowed$or$adapted$from$Noah$Smith,$Dan$Jurafsky$and$Chris$

11/13/13  

6  

15383: txt proc!

Synonyms  

•  Synonym:  A  word  that  can  be  subsMtuted  for  another  without  changing  the  meaning  of  the  sentence.  – e.g.  Drive  (Mariem,  car)  vs.  Drive(Mariem,  automobile).  

– Cases  were  words  are  not  quite  synonyms  •  Ali  likes  to  drink  strong  tea.  •  Ali  likes  to  drink  powerful  tea.  

– Genre,  style,  etc.  influence  word  usage.  

11  

15383: txt proc!

Synonymy  and  word  sense  

•  Synonymy  is  really  between  senses  of  words  – Zeinab  likes  fish  and  chips.  – *Tariq  likes  fish  and  circuits.  – The  computer  has  many  chips.  – The  computer  has  many  circuits.  

12  

Page 7: Intro$to$TextProcessing$ Lecture$8$behrang/15383/lecs/2013.11.13.Lec8.pdf11/13/13 1 15383: txt proc! Intro$to$TextProcessing$ Lecture$8$ Behrang$Mohit$ Some$slides$are$borrowed$or$adapted$from$Noah$Smith,$Dan$Jurafsky$and$Chris$

11/13/13  

7  

15383: txt proc!

Antonyms  

•  Formal  definiMon?  – Not  always  not-­‐x  

•  Not  rich  !=  poor  – Some  where  at  the  opposite  ends  of  the  a  scale  

– Or  changes:  •  rise/fall  

13  

sad   happy  

15383: txt proc!

[a-­‐z]+nyms  

•  is-­‐a  relaMonship:  – Hyponyms:  specificaMons  

•  Horse  is  a  hyponym  of  animal.  

– Hypernyms:  generalizaMon  •  Animal  is  a  hypernym  of  horse.  

•  Part-­‐whole  relaMonship  – Meronym  

•  Wheel  is  a  meronym  of  car  

– Holonym  •  Car  is  a  holonym  of  wheel  

14  

Page 8: Intro$to$TextProcessing$ Lecture$8$behrang/15383/lecs/2013.11.13.Lec8.pdf11/13/13 1 15383: txt proc! Intro$to$TextProcessing$ Lecture$8$ Behrang$Mohit$ Some$slides$are$borrowed$or$adapted$from$Noah$Smith,$Dan$Jurafsky$and$Chris$

11/13/13  

8  

15383: txt proc!

Hyponymy  and  Hypernymy  

•  One  sense  is  a  hyponym  of  another  if  the  first  sense  is  more  specific,  denoMng  a  subclass  of  the  other  – car  is  a  hyponym  of  vehicle  – mango  is  a  hyponym  of  fruit  

•  Conversely  hypernym/superordinate  (“hyper  is  super”)  – vehicle  is  a  hypernym    of  car  –  fruit  is  a  hypernym  of  mango  

15383: txt proc!

Hyponymy  more  formally  

•  Extensional:  –  The  class  denoted  by  the  superordinate  extensionally  includes  the  class  denoted  by  the  hyponym  

•  Entailment:  – A  sense  A  is  a  hyponym  of  sense  B  if  being  an  A  entails  being  a  B  

•  Hyponymy  is  usually  transiMve    –  (A  hypo  B  and  B  hypo  C  entails  A  hypo  C)  

•  Another  name:  the  IS-­‐A  hierarchy  – A  IS-­‐A  B            (or  A  ISA  B)  –  B  subsumes  A  

Page 9: Intro$to$TextProcessing$ Lecture$8$behrang/15383/lecs/2013.11.13.Lec8.pdf11/13/13 1 15383: txt proc! Intro$to$TextProcessing$ Lecture$8$ Behrang$Mohit$ Some$slides$are$borrowed$or$adapted$from$Noah$Smith,$Dan$Jurafsky$and$Chris$

11/13/13  

9  

15383: txt proc!

WordNet  

•  Lexical  database  of  English  sense  relaMons:  – Nouns,  Verb,  AdjecMves,  Adverbs  

•  Developed  at  Princeton  University  – Over  20  years  effort  – Global  WordNet:  in  many  languages  (e.g.  Arabic)  

•  Organized  by  synsets:  – Set  of  near  synonyms  

•  Chair,  sofa  •  Chair,  chairman,  head  

17  

15383: txt proc!

Senses  of  “bass”  in  Wordnet  

Page 10: Intro$to$TextProcessing$ Lecture$8$behrang/15383/lecs/2013.11.13.Lec8.pdf11/13/13 1 15383: txt proc! Intro$to$TextProcessing$ Lecture$8$ Behrang$Mohit$ Some$slides$are$borrowed$or$adapted$from$Noah$Smith,$Dan$Jurafsky$and$Chris$

11/13/13  

10  

15383: txt proc!

Synsets  for  dog  •  S:  (n)  dog,  domesMc  dog,  Canis  familiaris  (a  member  of  the  genus  Canis  

(probably  descended  from  the  common  wolf)  that  has  been  domesMcated  by  man  since  prehistoric  Mmes;  occurs  in  many  breeds)  "the  dog  barked  all  night"    

•  S:  (n)  dog  (informal  term  for  a  man)  "you  lucky  dog"    •  S:  (n)  cad,  bounder,  blackguard,  dog,  hound,  heel  (someone  who  is  morally  

reprehensible)  "you  dirty  dog"    •  S:  (n)  frank,  frankfurter,  hotdog,  hot  dog,  dog,  wiener,  wienerwurst,  

weenie  (a  smooth-­‐textured  sausage  of  minced  beef  or  pork  usually  smoked;  oven  served  on  a  bread  roll)    

•  S:  (n)  pawl,  detent,  click,  dog  (a  hinged  catch  that  fits  into  a  notch  of  a  ratchet  to  move  a  wheel  forward  or  prevent  it  from  moving  backward)    

•  S:  (n)  andiron,  firedog,  dog,  dog-­‐iron  (metal  supports  for  logs  in  a  fireplace)  "the  andirons  were  too  hot  to  touch”  

•  …  

19  

15383: txt proc!

WordNet  Noun  RelaMons  

Page 11: Intro$to$TextProcessing$ Lecture$8$behrang/15383/lecs/2013.11.13.Lec8.pdf11/13/13 1 15383: txt proc! Intro$to$TextProcessing$ Lecture$8$ Behrang$Mohit$ Some$slides$are$borrowed$or$adapted$from$Noah$Smith,$Dan$Jurafsky$and$Chris$

11/13/13  

11  

15383: txt proc!

WordNet  3.0  

•  Where  it  is:  – hwp://wordnetweb.princeton.edu/perl/webwn  

•  Libraries  – Python:    WordNet    from  NLTK  

• hwp://www.nltk.org/Home  –  Java:  

•  JWNL,  extJWNL  on  sourceforge  

15383: txt proc!

WordNet:  hypernomy/hyponomy  

22  

Page 12: Intro$to$TextProcessing$ Lecture$8$behrang/15383/lecs/2013.11.13.Lec8.pdf11/13/13 1 15383: txt proc! Intro$to$TextProcessing$ Lecture$8$ Behrang$Mohit$ Some$slides$are$borrowed$or$adapted$from$Noah$Smith,$Dan$Jurafsky$and$Chris$

11/13/13  

12  

15383: txt proc!

Synset

•  MeSH  (Medical  Subject  Headings)  

–  177,000  entry  terms    that  correspond  to  26,142  biomedical  “headings”  

•  Hemoglobins  Entry  Terms:    Eryhem,  Ferrous  Hemoglobin,  Hemoglobin  Defini,on:    The  oxygen-­‐carrying  proteins  of  ERYTHROCYTES.  They  are  found  in  all  vertebrates  and  some  invertebrates.  The  number  of  globin  subunits  in  the  hemoglobin  quaternary  structure  differs  between  species.  Structures  range  from  monomeric  to  a  variety  of  mulMmeric  arrangements  

MeSH:  Medical  Subject  Headings  thesaurus  from  the  NaMonal  Library  of  Medicine  

15383: txt proc!

The  MeSH  Hierarchy  

•  a  

24  

Page 13: Intro$to$TextProcessing$ Lecture$8$behrang/15383/lecs/2013.11.13.Lec8.pdf11/13/13 1 15383: txt proc! Intro$to$TextProcessing$ Lecture$8$ Behrang$Mohit$ Some$slides$are$borrowed$or$adapted$from$Noah$Smith,$Dan$Jurafsky$and$Chris$

11/13/13  

13  

15383: txt proc!

Uses  of  the  MeSH  Ontology  

•  Provide  synonyms  (“entry  terms”)  –  E.g.,  glucose  and  dextrose  

•  Provide  hypernyms  (from  the  hierarchy)  –  E.g.,  glucose  ISA  monosaccharide  

•  Indexing  in  MEDLINE/PubMED  database  –  NLM’s  bibliographic  database:    

•  20  million  journal  arMcles  •  Each  arMcle  hand-­‐assigned  10-­‐20  MeSH  terms  

15383: txt proc!

Word  Sense  DisambiguaMon  (WSD)  

•  Which  synset  does  a  given  word  (and  its  context)  belong  to?  –  In  my  office  there  are  two  tables  and  one  chair.  

–  This  year  the  chair  of  conference  is  Prof.  Ali.  

26  

Page 14: Intro$to$TextProcessing$ Lecture$8$behrang/15383/lecs/2013.11.13.Lec8.pdf11/13/13 1 15383: txt proc! Intro$to$TextProcessing$ Lecture$8$ Behrang$Mohit$ Some$slides$are$borrowed$or$adapted$from$Noah$Smith,$Dan$Jurafsky$and$Chris$

11/13/13  

14  

15383: txt proc!

How  do  we  know  when  a  word  has  more  than  one  sense?  

•  The  “zeugma”  test:  Two  senses  of  serve?  – Which flights serve breakfast?!– Does Lufthansa serve Dubai?!– ?Does  Luvhansa  serve  breakfast  and  Dubai?  

•  Since  this  conjuncMon  sounds  weird,    – we  say  that  these  are  two  different  senses  of  “serve”  

15383: txt proc!

Word  Sense  DisambiguaMon  (WSD)  

•  Which  synset  does  a  given  word  (and  its  context)  belong  to?  –  In  my  office  there  are  two  tables  and  one  chair.  –  This  year  the  chair  of  conference  is  Prof.  Ali.  

•  Finding  all  word  senses  is  a  difficult  problem:  –  We  assume  we  have  the  collecMon  of  senses  

•  WordNet  

–  Given  a  sentence,  choose  the  right  sense  

•  ApplicaMon:  – Most  areas  of  text  processing  

•  NER  (e.g.  Washington)  

28  

Page 15: Intro$to$TextProcessing$ Lecture$8$behrang/15383/lecs/2013.11.13.Lec8.pdf11/13/13 1 15383: txt proc! Intro$to$TextProcessing$ Lecture$8$ Behrang$Mohit$ Some$slides$are$borrowed$or$adapted$from$Noah$Smith,$Dan$Jurafsky$and$Chris$

11/13/13  

15  

15383: txt proc!

Approaches  to  WSD  

•  DicMonary-­‐based:  – Unsupervised  (e.g.  Lesk  algorithm)  

•  Corpus-­‐based  approaches  – Supervised  algorithms  (e.g.  Ng  &  Lee  96)  

29  

15383: txt proc!

Lesk  Algorithm  (1986)  

•  Only  requires  a  dicMonary  

•  The  most  likely  sense  for  a  word  in  a  given  context  is  decided  by  a  measure  of  overlap  between  the  definiMon  of  the  target  word  and  the  definiMon  of  the  words  in  the  context  of  the  target  word.  

30  

Page 16: Intro$to$TextProcessing$ Lecture$8$behrang/15383/lecs/2013.11.13.Lec8.pdf11/13/13 1 15383: txt proc! Intro$to$TextProcessing$ Lecture$8$ Behrang$Mohit$ Some$slides$are$borrowed$or$adapted$from$Noah$Smith,$Dan$Jurafsky$and$Chris$

11/13/13  

16  

15383: txt proc!

Lesk  algorithm  

31  

15383: txt proc!

Lesk  algorithm  

32  

•  This  year  the  chair  of  the  conference  is  Prof.  Ali.  •  Chair:  head  or  chief  of  an  organizaMon,  meeMng,  …  •  Conference:  meeMng,  gathering,  …  

Page 17: Intro$to$TextProcessing$ Lecture$8$behrang/15383/lecs/2013.11.13.Lec8.pdf11/13/13 1 15383: txt proc! Intro$to$TextProcessing$ Lecture$8$ Behrang$Mohit$ Some$slides$are$borrowed$or$adapted$from$Noah$Smith,$Dan$Jurafsky$and$Chris$

11/13/13  

17  

15383: txt proc!

LEXAS  algorithm  

•  Uses  supervised  learning  – A  human-­‐labeled  dataset  where  all  terms  are  disambiguated  

•  For  each  sense  of  word  w,  a  set  of  features  is  extracted  from  the  training  data  

•  A  test  word  t  is  compared  with  each  of  the  training  examples  using  a  “distance”  funcMon  

•  The  sense  of  w  which  minimizes  the  distance  with  the  target  word  t  is  selected.  

33  

15383: txt proc!

LEXAS:  Training  phase  

•  A  window  of  words:  [L3,  L2,  L1,  W,  R1,  R2,  R3]  

•  Feature  extracMon:  A  feature  vector  of  – POS  of  the  words  to  the  lev  and  right  – Surrounding  words  – Keywords  that  co-­‐occur  frequently  with  w  – SyntacMc  relaMonship  …  

34  

Page 18: Intro$to$TextProcessing$ Lecture$8$behrang/15383/lecs/2013.11.13.Lec8.pdf11/13/13 1 15383: txt proc! Intro$to$TextProcessing$ Lecture$8$ Behrang$Mohit$ Some$slides$are$borrowed$or$adapted$from$Noah$Smith,$Dan$Jurafsky$and$Chris$

11/13/13  

18  

15383: txt proc!

LEXAS:  Distance  esMmaMon  

•  Distance  between  feature  vector:  – Sum  of  distance  among  features  

•  Distance  between  two  values  of  feature  f:  

•  C1,i:  number  of  training    

 examples  with  value  v1    and  sense  i  

•  C1:  number  of  training  examples  with  value  1  

•  n:  total  number  of  senses  for  the  word  

35  

15383: txt proc!

More  supervised  learning  for  WSD  

•  Availability  of  labeled  data  – Supervised  algorithms:  

•  Naïve  Bayes  •  Decision  List  •  ….  

– Semi-­‐supervised  Learning  

36  

Page 19: Intro$to$TextProcessing$ Lecture$8$behrang/15383/lecs/2013.11.13.Lec8.pdf11/13/13 1 15383: txt proc! Intro$to$TextProcessing$ Lecture$8$ Behrang$Mohit$ Some$slides$are$borrowed$or$adapted$from$Noah$Smith,$Dan$Jurafsky$and$Chris$

11/13/13  

19  

15383: txt proc!

WSD  evaluaMon  

•  Intrinsic  evaluaMon  – Use  gold-­‐standard  data  to  test  and  evaluate  the  actual  algorithm  

•  Extrinsic  evaluaMon  – Employ  the  algorithm  in  a  real-­‐world  applicaMon  and  see  if  it  helps.  

37  

15383: txt proc!

WSD  evaluaMon  

•  SenseEval  workshop:  over  10  years  of  compeMMon.  

•  Precision,  Recall,  Accuracy  •  Baseline:  Most  frequent  sense  

– Not  always  great  performance  by  humans  •  On  some  polysemous  words  (e.g.  Mtle):  ~  70%  

•  Extrinsic  evaluaMon:  on  MT,  QA,  IR  

38  

Page 20: Intro$to$TextProcessing$ Lecture$8$behrang/15383/lecs/2013.11.13.Lec8.pdf11/13/13 1 15383: txt proc! Intro$to$TextProcessing$ Lecture$8$ Behrang$Mohit$ Some$slides$are$borrowed$or$adapted$from$Noah$Smith,$Dan$Jurafsky$and$Chris$

11/13/13  

20  

15383: txt proc!

Word  Similarity  

•  Synonymy:  a  binary  relaMon  –  Two  words  are  either  synonymous  or  not  

•  Similarity  (or  distance):  a  looser  metric  –  Two  words  are  more  similar  if  they  share  more  features  of  meaning  

•  Similarity  is  properly  a  relaMon  between  senses  –  The  word  “bank”  is  not  similar  to  the  word  “slope”  –  Bank1  is  similar  to  fund3  

–  Bank2  is  similar  to  slope5  

15383: txt proc!

Why  word  similarity  

•  InformaMon  retrieval  •  QuesMon  answering  •  Machine  translaMon  

•  Language  modeling  

•  AutomaMc  essay  grading  

•  Plagiarism  detecMon  

Page 21: Intro$to$TextProcessing$ Lecture$8$behrang/15383/lecs/2013.11.13.Lec8.pdf11/13/13 1 15383: txt proc! Intro$to$TextProcessing$ Lecture$8$ Behrang$Mohit$ Some$slides$are$borrowed$or$adapted$from$Noah$Smith,$Dan$Jurafsky$and$Chris$

11/13/13  

21  

15383: txt proc!

Word  similarity  and  word  relatedness  

•  We  oven  disMnguish  word  similarity    from  word  relatedness  –  Similar  words:  near-­‐synonyms  –  Related  words:  can  be  related  any  way  

•  car, bicycle:        similar  •  car, gasoline:      related,  not  similar  

15383: txt proc!

Two  classes  of  similarity  algorithms  

•  Thesaurus-­‐based  algorithms  –  Are  words  “nearby”  in  hypernym  hierarchy?  –  Do  words  have  similar  glosses  (definiMons)?  

•  DistribuMonal  algorithms  –  Do  words  have  similar  distribuMonal  contexts?  

Page 22: Intro$to$TextProcessing$ Lecture$8$behrang/15383/lecs/2013.11.13.Lec8.pdf11/13/13 1 15383: txt proc! Intro$to$TextProcessing$ Lecture$8$ Behrang$Mohit$ Some$slides$are$borrowed$or$adapted$from$Noah$Smith,$Dan$Jurafsky$and$Chris$

11/13/13  

22  

15383: txt proc!

Path  based  similarity  

•  Two  concepts  (senses/synsets)  are  similar  if  they  are  near  each  other  in  the  thesaurus  hierarchy    – =have  a  short  path  between  them  

– concepts  have  path  1  to  themselves  

15383: txt proc!

Refinements  to  path-­‐based  similarity  

•  pathlen(c1,c2) =  1  +  number  of  edges  in  the  shortest  path  in  the  hypernym  graph  between  sense  nodes  c1  and  c2  

•  ranges  from  0  to  1  (idenMty)  

•  simpath(c1,c2) =

•  wordsim(w1,w2) = max sim(c1,c2)

c1∈senses(w1),c2∈senses(w2)  

1pathlen(c1,c2 )

Page 23: Intro$to$TextProcessing$ Lecture$8$behrang/15383/lecs/2013.11.13.Lec8.pdf11/13/13 1 15383: txt proc! Intro$to$TextProcessing$ Lecture$8$ Behrang$Mohit$ Some$slides$are$borrowed$or$adapted$from$Noah$Smith,$Dan$Jurafsky$and$Chris$

11/13/13  

23  

15383: txt proc!

Example:  path-­‐based  similarity  simpath(c1,c2) = 1/pathlen(c1,c2)

simpath(nickel,coin)  =  1/2 = .5

simpath(fund,budget)  =  1/2 = .5 simpath(nickel,currency)  =  1/4 = .25 simpath(nickel,money)  =  1/6 = .17 simpath(coinage,Richter  scale)  =  1/6 = .17

15383: txt proc!

Problem  with  basic  path-­‐based  similarity  

•  Assumes  each  link  represents  a  uniform  distance  –  But  nickel  to  money  seems  to  us  to  be  closer  than  nickel  to  standard  

–  Nodes  high  in  the  hierarchy  are  very  abstract  •  We  instead  want  a  metric  that  

– Represents  the  cost  of  each  edge  independently  – Words  connected  only  through  abstract  nodes    

•  are  less  similar  

Page 24: Intro$to$TextProcessing$ Lecture$8$behrang/15383/lecs/2013.11.13.Lec8.pdf11/13/13 1 15383: txt proc! Intro$to$TextProcessing$ Lecture$8$ Behrang$Mohit$ Some$slides$are$borrowed$or$adapted$from$Noah$Smith,$Dan$Jurafsky$and$Chris$

11/13/13  

24  

15383: txt proc!

InformaMon  content  similarity  metrics  •  Let’s  define  P(c) as:  

–  The  probability  that  a  randomly  selected  word  in  a  corpus  is  an  instance  of  concept  c

–  Formally:  there  is  a  disMnct  random  variable,  ranging  over  words,  associated  with  each  concept  in  the  hierarchy  

•  for  a  given  concept,  each  observed  noun  is  either  –   a  member  of  that  concept    with  probability  P(c) –  not  a  member  of  that  concept  with  probability  1-P(c)

– All  words  are  members  of  the  root  node  (EnMty)  •  P(root)=1

–  The  lower  a  node  in  hierarchy,  the  lower  its  probability  

15383: txt proc!

InformaMon  content  similarity  

•  Train  by  counMng  in  a  corpus  – Each  instance  of  hill  counts  toward  frequency    

of  natural  elevaDon,  geological  formaDon,  enDty,  etc  – Let  words(c) be  the  set  of  all  words  that  are  children  of  node  c  

•  words(“geo-­‐formaMon”)  =  {hill,ridge,growo,coast,cave,shore,natural  elevaMon}  

•  words(“natural  elevaMon”)  =  {hill,  ridge}  P(c) =

count(w)w!words(c)"

N

geological-­‐formaMon  

shore  

hill  

natural  elevaMon  

coast  

cave  

growo  ridge  

…  

enMty  

Page 25: Intro$to$TextProcessing$ Lecture$8$behrang/15383/lecs/2013.11.13.Lec8.pdf11/13/13 1 15383: txt proc! Intro$to$TextProcessing$ Lecture$8$ Behrang$Mohit$ Some$slides$are$borrowed$or$adapted$from$Noah$Smith,$Dan$Jurafsky$and$Chris$

11/13/13  

25  

15383: txt proc!

InformaMon  content  similarity  

•  WordNet  hierarchy  augmented  with  probabiliMes  P(c)  

D.  Lin.  1998.  An  InformaMon-­‐TheoreMc  DefiniMon  of  Similarity.  ICML  1998  

15383: txt proc!

The  (extended)  Lesk  Algorithm    

•  A  thesaurus-­‐based  measure  that  looks  at  glosses  •  Two  concepts  are  similar  if  their  glosses  contain  similar  words  –  Drawing  paper:  paper  that  is  specially  prepared  for  use  in  draving  

–  Decal:  the  art  of  transferring  designs  from  specially  prepared  paper  to  a  wood  or  glass  or  metal  surface  

•  For  each  n-­‐word  phrase  that’s  in  both  glosses  •  Add  a  score  of  n2    •  Paper  and  specially  prepared  for  1  +  22  =  5  •  Compute  overlap  also  for  other  relaMons  

•  glosses  of  hypernyms  and  hyponyms  

Page 26: Intro$to$TextProcessing$ Lecture$8$behrang/15383/lecs/2013.11.13.Lec8.pdf11/13/13 1 15383: txt proc! Intro$to$TextProcessing$ Lecture$8$ Behrang$Mohit$ Some$slides$are$borrowed$or$adapted$from$Noah$Smith,$Dan$Jurafsky$and$Chris$

11/13/13  

26  

15383: txt proc!

Libraries  for  compuMng  thesaurus-­‐based  similarity  

•  NLTK  •  WordNet::Similarity  

– hwp://wn-­‐similarity.sourceforge.net/  

– Web-­‐based  interface:  •  hwp://marimba.d.umn.edu/cgi-­‐bin/similarity/similarity.cgi  

51  

15383: txt proc!

Problems  with  thesaurus-­‐based  meaning  

•  We  don’t  have  a  thesaurus  for  every  language  

•  Even  if  we  do,  they  have  problems  with  recall  – Many  words  are  missing  – Most  (if  not  all)  phrases  are  missing  

–  Some  connecMons  between  senses  are  missing  

–  Thesauri  work  less  well  for  verbs,  adjecMves  •  AdjecMves  and  verbs  have  less  structured  hyponymy  relaMons  

Page 27: Intro$to$TextProcessing$ Lecture$8$behrang/15383/lecs/2013.11.13.Lec8.pdf11/13/13 1 15383: txt proc! Intro$to$TextProcessing$ Lecture$8$ Behrang$Mohit$ Some$slides$are$borrowed$or$adapted$from$Noah$Smith,$Dan$Jurafsky$and$Chris$

11/13/13  

27  

15383: txt proc!

DistribuMonal  models  of  meaning  

•  Zellig  Harris  (1954):  “oculist  and  eye-­‐doctor  …  occur  in  almost  the  same  environments….                                  If  A  and  B  have  almost  iden,cal  environments  we  say  that  they  are  synonyms.  

•  Firth  (1957):  “You  shall  know  a  word  by  the  company  it  keeps!”  

•  Offer  much  higher  recall  than  hand-­‐built  thesauri  – Although  they  tend  to  have  lower  precision  

53  

15383: txt proc!

Intui,on  of  distribu,onal  word  similarity  

A bottle of Laban is on the table!She likes laban!Laban helps your digestion!We make laban out of milk.!

•  From context words humans can guess laban means – A  beverage  like  milk  

•  IntuiMon  for  algorithm:    –  Two  words  are  similar  if  they  have  similar  word  contexts.  

Page 28: Intro$to$TextProcessing$ Lecture$8$behrang/15383/lecs/2013.11.13.Lec8.pdf11/13/13 1 15383: txt proc! Intro$to$TextProcessing$ Lecture$8$ Behrang$Mohit$ Some$slides$are$borrowed$or$adapted$from$Noah$Smith,$Dan$Jurafsky$and$Chris$

11/13/13  

28  

15383: txt proc!

Clustering  related  words  

55  

15383: txt proc!

Summary  

•  We  need  to  process  and  get  the  meaning  of  the  text  – Lexical  semanMcs:  word-­‐level  meaning  

•  Usually  in  the  context  •  Challenge  ambiguity  (polysemy)  

– SemanMc  resources:  WordNet  

– Word  sense  disambiguaMon  – Word  similarity  

56