Upload
david-murgatroyd
View
49
Download
0
Tags:
Embed Size (px)
Citation preview
Basis Technology – Human Language Technology Conference 2012 1
Things, not Strings: From Entity Extraction to Entity Resolution David Murgatroyd
VP, Engineering
Basis Technology
Basis Technology – Human Language Technology Conference 2012 2
Your job is to analyze reciprocal antagonism between Christian and Islamic extremists across the globe. You want to find information on the Internet on Christian extremist reaction to the killing of the U.S. Ambassador to Libya.
Motivation
Basis Technology – Human Language Technology Conference 2012 4
✗
✗
✗
Basis Technology – Human Language Technology Conference 2012 10
✗
✗
✓
✗
✗
Basis Technology – Human Language Technology Conference 2012 14
That was a lot of work. Can text analytics help?
Help?
Basis Technology – Human Language Technology Conference 2012 15
✓
✗
✗
Filter out pages with the wrong guy?
Filter?
Filter Example
Basis Technology – Human Language Technology Conference 2012 18
✓
✗
✗
Add some filters (a/k/a facets)…
Filter?
Basis Technology – Human Language Technology Conference 2012 19
✓
✗
✗
Add some filters (a/k/a facets)…
Filter?
Basis Technology – Human Language Technology Conference 2012 20
✓
✗
✗
Add some filters (a/k/a facets)…
Filter?
Filter results by…
People <choice 1> <choice 2> <choice 3> …
Basis Technology – Human Language Technology Conference 2012 21
✓
✗
✗
But what can we use as choices?
Filter?
Filter results by…
People <choice 1> <choice 2> <choice 3> …
Basis Technology – Human Language Technology Conference 2012 22
Find names of person, places, organizations in document.
Entity Extraction (Name Tagging)
Basis Technology – Human Language Technology Conference 2012 23
Group names referring to the same person, within a document.
In-document Coreference Resolution
Basis Technology – Human Language Technology Conference 2012 24
✓
✗
✗
But what can we use as choices?
Filter choices?
Filter results by…
People <choice 1> <choice 2> <choice 3> …
Basis Technology – Human Language Technology Conference 2012 25
✓
✗
✗
Choices: first way that each person was mentioned in each document?
Filter choices?
Filter results by…
Persons named Kris Stephens Chris Stephens Dan Cathy George LiBle …
Basis Technology – Human Language Technology Conference 2012 26
✓
✗
Choices: first name string for each person in each document?
Filter?
Add filters…
Persons named Dan Cathy George LiBle …
Filtered by…
Persons named Chris Stephens ✗
Basis Technology – Human Language Technology Conference 2012 27
✓
✗
Choices: first name string for each person in each document?
Filter?
Add filters…
Persons named Dan Cathy George LiBle …
Filtered by…
Persons named Chris Stephens
Basis Technology – Human Language Technology Conference 2012 28
✓
✗
Problem: Ambiguity – one name, many entities
Filter?
Add filters…
Persons named Dan Cathy George LiBle …
Filtered by…
Persons named Chris Stephens
Basis Technology – Human Language Technology Conference 2012 29
✓
✗
Problem: Variety – one person, many names
Filter?
Add filters…
Filtered by…
Add filters…
Persons named Dan Cathy George LiBle …
Filtered by…
Persons named Chris Stephens
Basis Technology – Human Language Technology Conference 2012 30
✓
✗
Problem: Variety – one person, many names
Filter?
Add filters…
Persons named Dan Cathy George LiBle … Chris Stevens J. Christopher Stevens …
Filtered by…
Persons named Chris Stephens
Basis Technology – Human Language Technology Conference 2012 31
Where does your favorite data set fall?
Ambiguity
Variety
Thousands
1
# of documents
Millions
Billions
Basis Technology – Human Language Technology Conference 2012 32
✓
✗
✗
Magically group names by person across documents.
Deal with ambiguity and variety?
Filter results by…
People <choice 1> <choice 2> <choice 3> …
Basis Technology – Human Language Technology Conference 2012 33
✓
✗
✗
But there’s still the problem of choices…
Labels for choices?
Filter results by…
People <choice 1> <choice 2> <choice 3> …
Basis Technology – Human Language Technology Conference 2012 34
✓
✗
✗
Use person’s name from highest ranked doc? Still some ambiguity.
Labels for choices?
Filter results by…
People Kris Stephens Chris Stephens 1 Chris Stephens 2 …
Basis Technology – Human Language Technology Conference 2012 35
✓
✗
✗
Entity Resolution: group and also link to a database of known entities (e.g., Wikipedia).
Labels for choices?
Filter results by…
People Kris Stephens Chris Stephens 1 Chris Stephens 2 …
Kris Stephens J. Christopher Stevens Chris Stephens …
Basis Technology – Human Language Technology Conference 2012 36
✓
✗
✗
Labels for choices?
Filter results by…
People
For items not in the database, infer a unique label (e.g., for hypothetical Wikipedia page).
Kris Stephens J. Christopher Stevens Chris Stephens …
Basis Technology – Human Language Technology Conference 2012 37
✓
✗
✗
For items not in the database, infer a unique label (e.g., for hypothetical Wikipedia page).
Filter?
Filter results by…
People Kris Stephens (pastor) J. Christopher Stevens Chris Stephens (pastor)
Basis Technology – Human Language Technology Conference 2012 38
✓
✗
✗
Let’s give it a try…
Filter.
Filter results by…
People Kris Stephens (pastor) J. Christopher Stevens Chris Stephens (pastor) Dan Cathy George LiBle …
Basis Technology – Human Language Technology Conference 2012 39
✓
✗
Let’s give it a try…
Filter.
Add filters…
People Kris Stephens (pastor) Chris Stephens (pastor) Dan Cathy George LiBle …
Filtered by…
People J. Christopher Stevens
✗
Basis Technology – Human Language Technology Conference 2012 40
✓
Let’s give it a try…
Filter.
Add filters…
People Kris Stephens (pastor) Chris Stephens (pastor) Dan Cathy George LiBle …
Filtered by…
People J. Christopher Stevens
Basis Technology – Human Language Technology Conference 2012 41
✓
Let’s give it a try…
Filter.
Add filters…
People Kris Stephens (pastor) Chris Stephens (pastor) Dan Cathy George LiBle …
Filtered by…
People J. Christopher Stevens
Basis Technology – Human Language Technology Conference 2012 42
✓
Let’s give it a try…
Filter.
Add filters…
People Kris Stephens (pastor) Chris Stephens (pastor) Dan Cathy George LiBle …
Filtered by…
People J. Christopher Stevens
✓
✓
Basis Technology – Human Language Technology Conference 2012 43
Does it work?
How do you measure?
Basis Technology – Human Language Technology Conference 2012 44
Imagine this was the result of applying the filter with the name from wikipedia.
How do you measure?
Add filters…
People Kris Stephens (pastor) Chris Stephens (pastor) Dan Cathy George LiBle …
Filtered by…
People J. Christopher Stevens
Basis Technology – Human Language Technology Conference 2012 45
Precision: for each document, how much of the stuff grouped with it is correct?
How do you measure?
Add filters…
People Kris Stephens (pastor) Chris Stephens (pastor) Dan Cathy George LiBle …
Filtered by…
People J. Christopher Stevens
✓ ✓
✗ 2 / 3 = 67%
2 / 3 = 67%
1 / 3 = 33%
Basis Technology – Human Language Technology Conference 2012 46
Recall: for each document, how much of the correct stuff is grouped with?
How do you measure?
Add filters…
People Kris Stephens (pastor) Chris Stephens (pastor) Dan Cathy George LiBle …
Filtered by…
People J. Christopher Stevens
✓ ✓
2 / 5 = 40%
2 / 5 = 40%
✗ ✗ ✗
Basis Technology – Human Language Technology Conference 2012 47
Does it work?
We often combine Precision and Recall measurements into a single measurement, called “F”.
Basis Technology – Human Language Technology Conference 2012 48
Where does your favorite data set fall?
Ambiguity
Variety
Thousands
1
# of documents
Millions
Billions
Basis Technology – Human Language Technology Conference 2012 49
ACE 2005 WEPS-‐2 TAC pre-‐2012 TAC eng 2012 TAC zho 2012 TAC spa 2012 Basis Balanced Basis Ambig Basis Variance 1 Basis Variance 2
Where does your favorite data lie?
Ambiguity
Variety
1
F>=70
F>=?
Thousands
# of documents
Millions
Billions F>=85
corpus
Basis Technology – Human Language Technology Conference 2012 50
Let’s pretend you’re researching the pastors instead.
Trading off Errors
Filter results by…
People Kris Stephens (pastor) J. Christopher Stevens Chris Stephens (pastor) Dan Cathy George LiBle …
Basis Technology – Human Language Technology Conference 2012 51
What if you think there are too many (or too few)? Add a slider for making filter more fine (or coarse).
Trading off Errors
Add filters…
People J. Christopher Stevens Chris Stephens (pastor) Dan Cathy George LiBle …
Filtered by…
People Kris Stephens (pastor)
Basis Technology – Human Language Technology Conference 2012 52
Make the filter more fine.
Trading off Errors
Add filters…
People J. Christopher Stevens Chris Stephens (pastor) Dan Cathy George LiBle …
Filtered by…
People Kris Stephens (pastor)
Demo
Basis Technology – Human Language Technology Conference 2012 54
Questions
• Suggested questions: – Doesn’t Google already do this? – Speed? Scale? – Multi-lingual? – What other uses are there for entity resolution
beyond faceted search?
Basis Technology – Human Language Technology Conference 2012 55
For more information: Visit www.basistech.com
Write to [email protected]
Call 617-386-2090
Thank you!
Basis Technology – Human Language Technology Conference 2012 56
Doesn’t Google already do this? Some, when searching for famous entities.
Basis Technology – Human Language Technology Conference 2012 57
Speed/Scale
• Support from BRAVE for scale in CY13! • Research version:
– tested up to 1m docs – Sub-second per document – Incremental updates (i.e., you see documents
published minutes ago)
Basis Technology – Human Language Technology Conference 2012 58
Doesn’t Google already do this?
Basis Technology – Human Language Technology Conference 2012 59
Other uses for entity resolution ?
• Supporting relationship resolution by resolving participating entities in the them.
• Knowledge base population • Integrating disparate data sets • Alerting • Improving relevance of search results • Predictive Analytics