Muddy Boots
From Flickr user Garrulushttp://flickr.com/photos/garrulus/82714475/
Rattle Researchhttp://www.rattleresearch.com
Muddy Boots, tramping new trails through the BBCs existing pristine navigation paths
A brief interlude, Muddy Boots was developed as part of the Innovation labs process. Rattle are a digital R&D agency specializing in social innovation on the web
Innovation Labs+
A brief interlude, Muddy Boots was developed as part of the Innovation labs process. Rattle are a digital R&D agency specializing in social innovation on the web
Identify Key Entities
(context) Find reference URIs from Wikipedia
Find URI context
from del.icio.us
Rank URI’s via context ‘rating’
Original Workflow
Original MuddyBoots workflow for creating related links based on Wikipedia data. Produced interesting results but ‘interesting’ isn’t testable or measurable and we had one big problem
Ambiguity of term - how can you tell which ‘Apple’ is referenced in the article
Apple Apple ?
Ambiguity of term - how can you tell which ‘Apple’ is referenced in the article
Simplify the problem :
http://www.flickr.com/photos/donnagrayson/195244498/sizes/l/
Back to basics, only try to do one thing well - and solve the problem of ambiguity
Simplify the problem :
Unambiguously identify the “main actors” in a news story
Then add semantic markup for them
Answers the “who” in who, what, where, why ...
http://www.flickr.com/photos/donnagrayson/195244498/sizes/l/
Back to basics, only try to do one thing well - and solve the problem of ambiguity
Extract (& Classify Entities) Find In
DBpedia / Wikipedia
Extract Required Attributes Parse
Content &Markup
One Possible Workflow
Classify Entities via DBpedia
Entity extraction - many methods available, entity classification via DBpedia is very extensible, finally microformat markup, good for machines and the semantic web as a whole
Entity Extraction (& classification ?)
Leveraging existing web services to perform entity extraction is useful, especially when employing a voting system. We also used a local named entity extraction service, this is more useful in the future as we have some direction over it’s evolution
Entity Extraction (& classification ?)
Yahoo term extraction
TagThe.net
Lingpipe
Voting System
Leveraging existing web services to perform entity extraction is useful, especially when employing a voting system. We also used a local named entity extraction service, this is more useful in the future as we have some direction over it’s evolution
Using Wikipedia as a controlled vocabulary
Lucene vs DBpedia ‘disambiguates’ predicates, both have different advantages - Lucene finds something every time, whereas disambiguating with the predicates provides less false matches but also less results
Using Wikipedia as a controlled vocabulary
conText
Uses Lucene based index to lookup entitiesand find the ‘best match’ for an entity (no explicit
disambiguation required)
Muddy Boots
Uses DBpedia to find resource match and thendisambiguates using ‘disambiguates’ predicates and
comparing original story text to each resource
Lucene vs DBpedia ‘disambiguates’ predicates, both have different advantages - Lucene finds something every time, whereas disambiguating with the predicates provides less false matches but also less results
Entity classification and attribute selection
Example of classifying a person, use the predicates in DBpedia to perform classification, certain predicates only exist for a person
Sample of MuddyBoots output, classification of a BBC news article. Demonstrates ‘main actor’ discovery and automated microformatting and inclusion of extra content from DBpedia in a ‘featured actors’ sidebar. The inclusion of microformats means machines can now query this page in a more granular fashion
Added bonus of creating semantic links and using ‘web scale identifiers’. BBC Music beta aggregates around Music Brains identifiers, DBpedia knows about MusicBrainz, therefor we can provide news feeds for any artist on BBC Music beta using this relationship
The problems
Incorrect data in DBpedia/WikipediaTime sensitive data in DBpedia“Really tricky” disambiguations
Query and response times
The Queen, vs Queen is still a problem
The problems
Incorrect data in DBpedia/WikipediaTime sensitive data in DBpedia“Really tricky” disambiguations
Query and response times
The Queen, vs Queen is still a problem
Next steps
Testing phaseImproved NE classification
Speed improvementsAdd more entity types
Identify applications
Next steps
Testing phaseImproved NE classification
Speed improvementsAdd more entity types
Identify applications
http://www.flickr.com/photos/25094278@N02/2368194103/sizes/l/
Lets more away from silos of data, towards a shared, linked vision of our data