Cognitive Computation Group
Resources for Semantic Similarity
http://cogcomp.cs.illinois.edu
Textual Inference
Given a task like Question Answering… …you have a large set of documents, e.g. all articles from the
New York times for 2011 and 2012 …and a set of questions, e.g. “Who participated in the
gubernatorial debates in January 2012?” …You must return excerpts of the documents that answer the
questions.
What are the challenges?
Page 2
QA Example
Consider the following example question, and a sample document excerpt that might answer it:
Q. Where is the headquarters of the parent company of Solahart Services?
A. Aztec Solar, Inc. recently acquired Solahart Services of Stockton California. Aztec Solar, a Sacramento based residential and commercial solar company, is excited about acquiring Solahart's regional and national solar customers.
Page 3
QA Example
Given the QA pair on the previous page, a human reader might make the following inference steps:
1.Aztec Solar, Inc. recently acquired Solahart Services Aztec Solar, Inc. is the parent company of Solahart Services.
2.“Aztec Solar, Inc.” looks like a company name
3.Aztec Solar, a Sacramento based residential and commercial solar company Aztec Solar is based in Sacramento
4.“Aztec Solar” == “Aztec Solar, Inc.”
Page 4
QA Example An automated system may use a matching process like
this:
1. Rewrite the question:The headquarters of the parent company of
Solahart Services is in <LOCATION>
2. Match question entities and tokensLOCATION Sacramento; company Aztec Solar, Inc.
3. Apply structure-mapping rules<LOCATION>-based <COMPANY>
<COMPANY> headquarters in <LOCATION>
4. This example can be easily perturbed to be more difficult (to thwart a shallow system)
Page 5
Outline Introduction: Textual Inference example Semantic Textual Similarity task LLM: a baseline system Comparators
Overview Instances: WNSim, NESim
Annotators POS, Chunk, NER, Coreference, SRL
Curator Edison
Data structures Calling Curator Feature Extraction
Page 6
Textual Inference: Semantic Similarity
Grand NLP challenge: work at level of meaning of text Do these two sentences mean the same thing?
1. John said he is considered a witness but not a suspect.
2. "He is not a suspect anymore," John said.
If they are different, how different are they? …rate similarity on a scale 0…5: 0 == different topic; 5 ==
paraphrase http://www.cs.york.ac.uk/semeval-2012/task6/data/uploads/
datasets/train-readme.txt
Page 7
Examples from STS training corpus
Nationally, the federal Centers for Disease Control and Prevention recorded 4,156 cases of West Nile, including 284 deaths.
There were 293 human cases of West Nile in Indiana in 2002, including 11 deaths statewide.
Score: 1.667
Chavez said investigators feel confident they've got "at least one of the fires resolved in that regard."
Albuquerque Mayor Martin Chavez said investigators felt confident that with the arrests they had "at least one of the fires resolved.“
Score: 3.800
Page 8
CANDIDATE BASELINE:
LEXICAL LEVEL MATCHING (LLM)
Page 9
Words MatterApproximate similarity of meaning via lexical overlap – how many
words in common
But this isn’t exactly fool-proof…
Page 10
John Smith bought three cakes and two oranges
John bought two oranges
John Smith bought three cakes and two oranges
John bought three oranges
LLM Scoring Designed for Textual Entailment (inherently asymmetric) Proportion of matched Hypothesis tokens, normalized by
length of shorter text Let T be the Text, containing tokens indexed by j Let H be the Hypothesis, with tokens indexed by i Let S(word1, word2) be a lexical similarity function that
returns a value in the range [0,1]
Page 11
||
),(max
H
thS
Score iji
jLLM
LLM code
http://cogcomp.cs.illinois.edu/page/software_view/LLM
import edu.illinois.cs.cogcomp.mrcs.comparators.LlmComparator;
String source = "Of the three kings referred to by their last names, Atawanaba was the oldest.";
String target = "Three kings were named in the lawsuit.";
LlmComparator llm = new LlmComparator( config );
double result = llm.compareStrings( source, target );
Page 12
Can we do better? Depends on the application… more advanced task may
require more sophisticated patterns to separate classes Sparsity of features
Many words/sequences of words may not occur very often This means a learned classifier may not generalize well More abstract representation can help
Ambiguity of words – e.g. “terminal”, “moving” Additional information may help
Meaning encoded in structure – e.g. “Matthew Smith, the Maverick’s last hope…”
NLP annotation tools generally abstract over underlying words so that features generalize better
Page 13
COMPARATORS
Page 14
So you want to compare some text…. How similar are two words? Two strings? Two
paragraphs? Depends on what they are String edit distance is usually a weak measure … think about coreference resolution…
Solution: specialized metrics
Page 15
String 1 String 2 Norm. edit sim.
Shiite Shi’ ‘ite 0.667
Mr. Smith Mrs. Smith 0.900
Wilbur T. Gobsmack Mr. Gobsmack 0.611
Frigid Cold 0.167
Wealth Wreath 0.667
Paris France 0.167
WNSim Generate table mapping terms linked in WordNet
ontology Synonymy, Hypernymy, Meronymy
Score reflects distance (up to 3 edges, undirected – e.g. via lowest common subsumer)
Score is symmetric
Page 16
String 1 String 2 WNSim similarity
Shiite Shi’ ‘ite 0
Mr. Smith Mrs. Smith 0
Wilbur T. Gobsmack Mr. Gobsmack 0
Frigid Cold 1
Wealth Wreath 0
Paris France 0
Using WNSim Install and run the WNSim code
http://cogcomp.cs.illinois.edu/page/software_view/WNSim Sets up an xmlrpc server Expects xmlrpc ‘struct’ data structure (analogous to Dictionary)
STRUCT { FIRST_STRING: aString; SECOND_STRING:
anotherString }
Returns another xmlrpc data structure:
STRUCT { SCORE: aDouble; REASON: aString }
USE: call and cache (reduce network latency overhead) NOTE: LLM code has Java client…
Page 17
WNSim via Metric interface
String metricHost = “…”;
int metricPort = …;
XmlRpcMetricClient client = new XmlRpcMetricClient( “WNSim”,
metricHost,
metricPort
);
MetricResponse response = client.compareStrings( source_, target_ );
double score = response.score;
Page 18
NESim Set of entity-type-specific measures
Acronyms, Prefix/Title rules, distance metric
Score reflects similarity based on type information Score is asymmetric
Page 19
String 1 String 2 Norm. edit distance
Shiite Shi’ ‘ite 0.922
Joan Smith John Smith 0
Wilbur T. Gobsmack Mr. Gobsmack 0.95
Frigid Cold 0
Wealth Wreath 0.900
Paris France 0.411
Using NESim NESim package from CCG web site
http://cogcomp.cs.illinois.edu/page/software_view/NESim
NESim can use context to help determine similarity Specify token offsets of NE string to indicate context (optional) Specify Type as one of PER, LOC, ORG (optional)
[<Type>#]<original string>[#<start offset>#<end offset>]
Note: offsets are inclusive, token-based, zero offset
Uses specialized resources depending on the type (if specified) Rules/gazetteers for People’s names Acronyms for Organizations
Page 20
Using NESim (cont’d) Returns a score in [0, 1]
Threshold of 0.8 or higher is advised Weakly similar names are generally not semantically close
Put jar on classpath, call programmatically Loads large lists, so instantiate once only
import edu.illinois.cs.cogcomp.entityComparison.core.EntityComparison;
EntityComparison entityComparator = new EntityComparison();
entityComparator.compare( aName, anotherName );
double currentScore = entityComparator.getScore();
Problem: identifying NE boundaries, types
Page 21
ANNOTATORS
Page 22
Available from CCG
Tokenization/Sentence Splitting Part Of Speech Chunking Named Entity Recognition Coreference Semantic Role Labeling
Page 23
Tokenization and Sentence Segmentation Given a document, find the sentence and token
boundaries
The police chased Mr. Smith of Pink Forest, Fla. all the way to Bethesda, where he lived. Smith had escaped after a shoot-out at his workplace, Machinery Inc.
Why? Word counts may be important features Words may themselves be the object you want to classify “lived.” and “lived” should give the same information different analyses need to align if you want to leverage multiple
annotators from different sources/tasks
Page 24
Tokenization and Sentence Segmentation ctd.
Believe it or not, this is an open problem No agreed standard for token-level segmentation
e.g. “American-led” vs. “American - led”? e.g. “$ 32 M” vs “$32 M” and “$32M”?
Different tasks may use different standards No wildly successful sentence segmenter exists (see the
excerpts in news aggregators for some nice errors) Noisier text (e.g. online consumer reviews) poorer
performance (for reasons like inconsistent capitalization) LBJ distribution includes the Illinois tokenizer and
sentence segmenter
Page 25
Part of Speech (POS) Allows simple abstraction for pattern detection
Disambiguate a target, e.g. “make (a cake)” vs. “make (of car)”
Specify more abstract patterns, e.g. Noun Phrase: ( DT JJ* NN )
Specify context in abstract way e.g. “DT boy VBX” for “actions boys do” This expression will catch “a boy cried”, “some boy ran”, …
Page 26
POS DT NN VBD PP DT JJ NN
Word The boy stood on the burning deck
POS DT NN VBD PP DT JJ NN
Word A boy rode on a red bicycle
Chunking Identifies phrase-level constituents in sentences
[NP Boris] [ADVP regretfully] [VP told] [NP his wife] [SBAR that] [NP their child] [VP could not attend] [NP night school] [PP without] [NP permission] .
Useful for filtering: identify e.g. only noun phrases, or only verb phrases Groups modifiers with heads Useful for e.g. Mention Detection
Used as source of features, e.g. distance (abstracts away determiners, adjectives, for example), sequence,… More efficient to compute than full syntactic parse Applications in e.g. Information Extraction – getting (simple)
information about concepts of interest from text documents
Page 27
Named Entity Recognition Identifies and classifies strings of characters
representing proper nouns
[PER Neil A. Armstrong] , the 38-year-old civilian commander, radioed to earth and the mission control room here: “[LOC Houston] , [ORG Tranquility] Base here; the Eagle has landed."
Useful for filtering documents “I need to find news articles about organizations in which Bill
Gates might be involved…” Disambiguate tokens: “Chicago” (team) vs. “Chicago” (city)
Source of abstract features E.g. “Verbs that appear with entities that are Organizations” E.g. “Documents that have a high proportion of Organizations”
Page 28
Coreference Identify all phrases that refer to each entity of interest –
i.e., group mentions of concepts
[Neil A. Armstrong] , [the 38-year-old civilian commander], radioed to [earth]. [He] said the famous words, “[the Eagle] has landed”."
The Named Entity recognizer only gets us part-way… …if we ask, “what actions did Neil Armstrong perform?”,
we will miss many instances (e.g. “He said…”) Coreference resolver abstracts over different ways of
referring to the same person Useful in feature extraction, information extraction
Page 29
Semantic Role Labeler
SRL reveals relations and arguments in the sentence (where relations are expressed as verbs)
Cannot abstract over variability of expressing the relations – e.g. kill vs. murder vs. slay…
Page 30
CURATOR
Page 31
Big NLP We introduced a lot of tools, some of them quite
sophisticated The more complex, the bigger the memory requirement
NER: 1G; Coref: 1G; SRL: 4G ….
If you use tools from different sources, they may be… In different languages Using different data structures
If you run a lot of experiments on a single corpus, it would be nice to cache the results …and for your colleagues, nice if they can access that cache.
Curator is our solution to these problems.
Page 32
Curator Supports distributed NLP resources
Central point of contact Single set of interfaces Code generation in many languages (using Thrift)
Programmatic interface Defines set of common data structures used for interaction
Caches processed data Enables highly configurable NLP pipeline
Overhead: Annotation is all at the level of character offsets:
Normalization/mapping to token level required Need to wrap tools to provide requisite data structures
Page 33
Curator
Page 34
NER
SRL
POS, Chunker
Cache
Curator
Using Curator for Flexible NLP Pipeline
http://cogcomp.cs.illinois.edu/curator/demo/index_beta.html
For this class only: dedicated curator instance Temporary instance with host, port accessible to class members
http://cogcomp.cs.illinois.edu/trac/wiki/CuratorDataStructures
Recommended: access using Edison library (next)
Page 35
Edison: An NLP Library
http://cogcomp.cs.illinois.edu/software/edison/ Convenient interface to Curator
Converts to token-level indexing (often more convenient)
Supports feature extraction over trees Apply to syntactic parse/dependency, and to SRL/NOM E.g. see
http://cogcomp.cs.illinois.edu/software/edison/FeaturesExample.html for examples of dependency path features
Page 36
Serializing TextAnnotationspublic void serializeAnnotations( List< TextAnnotation > annotations_, String
outputFile_ ) throws Exception { try { ObjectOutputStream objOut =
new ObjectOutputStream(new FileOutputStream( outputFile_ ) ); objOut.writeObject( new Integer( annotations_.size() ) ); for ( TextAnnotation ta: annotations_ ) { System.err.println( "serializing TA for text '" + ta.getText() + "'..." ); objOut.writeObject( ta ); } objOut.close(); } catch (IOException e) {
…
} return; }
Page 37
K-best Views in Curator
The Charniak and Stanford parsers can be run in K-best mode
These will be added to Curator with k=50 This will be quite disk-hungry These components will probably *not* be cached
Curator uses a MultiParser interface for k-best parsers Generates a parse view in Record The parse view is a List of Forests: the k-th Forest contains the
k-th best parse for all sentences in record
Edison does NOT yet directly support getting k-best parses from Curator, BUT…
Page 38
K-best views in Edison
Edison supports k-best views
List<View> topKParses = ...; // A list of top-k parses, say from Charniak
ta.addView(ViewNames.PARSE_CHARNIAK, topKParses);
List<View> parses = ta.getTopKViews(ViewNames.PARSE_CHARNIAK);
int tokenId = 17; // some token
Page 39
Edison k-best example cont’d
Constituent c = new Constituent("", "",ta, tokenId, tokenId+1);
int treeId =0 ;
for(View parseTree: parses) {
for(Constituent parseConstituent: parseTree.
where(Queries.containsConstituent(c))) {
// do something with parseConstituent belonging to tree "treeId"
}
treeId++;
}
Page 40
A FINAL WORD
Page 41
LLM and Semantic Similarity LLM was designed for Textual Entailment, and is
asymmetric by design This task is a little different – trying to assess level of
semantic equivalence of two sentences S1 and S2 Still want to normalize (don’t want all short sentence
pairs to have lower scores than long sentence pairs), but consider evaluating for both (S1, S2) and for (S2, S1)
Page 42
FIN
Page 43