Download pdf - BBC Muddy Boots 08

Muddy Boots

From Flickr user Garrulushttp://flickr.com/photos/garrulus/82714475/

Rattle Researchhttp://www.rattleresearch.com

Muddy Boots, tramping new trails through the BBCs existing pristine navigation paths

http://flickr.com/photos/garrulus/82714475/

http://flickr.com/photos/garrulus/82714475/

A brief interlude, Muddy Boots was developed as part of the Innovation labs process. Rattle are a digital R&D agency specializing in social innovation on the web

Innovation Labs+

A brief interlude, Muddy Boots was developed as part of the Innovation labs process. Rattle are a digital R&D agency specializing in social innovation on the web

Identify Key Entities

(context) Find reference URIs from Wikipedia

Find URI context

from del.icio.us

Rank URI’s via context ‘rating’

Original Workflow

Original MuddyBoots workflow for creating related links based on Wikipedia data. Produced interesting results but ‘interesting’ isn’t testable or measurable and we had one big problem

Ambiguity of term - how can you tell which ‘Apple’ is referenced in the article

Apple Apple ?

Ambiguity of term - how can you tell which ‘Apple’ is referenced in the article

Simplify the problem :

http://www.flickr.com/photos/donnagrayson/195244498/sizes/l/

Back to basics, only try to do one thing well - and solve the problem of ambiguity



Simplify the problem :

Unambiguously identify the “main actors” in a news story

Then add semantic markup for them

Answers the “who” in who, what, where, why ...


Back to basics, only try to do one thing well - and solve the problem of ambiguity



Extract (& Classify Entities) Find In

DBpedia / Wikipedia

Extract Required Attributes Parse

Content &Markup

One Possible Workflow

Classify Entities via DBpedia

Entity extraction - many methods available, entity classification via DBpedia is very extensible, finally microformat markup, good for machines and the semantic web as a whole

Entity Extraction (& classification ?)

Leveraging existing web services to perform entity extraction is useful, especially when employing a voting system. We also used a local named entity extraction service, this is more useful in the future as we have some direction over it’s evolution

Entity Extraction (& classification ?)

Yahoo term extraction

TagThe.net

Lingpipe

Voting System

Leveraging existing web services to perform entity extraction is useful, especially when employing a voting system. We also used a local named entity extraction service, this is more useful in the future as we have some direction over it’s evolution

Using Wikipedia as a controlled vocabulary

Lucene vs DBpedia ‘disambiguates’ predicates, both have different advantages - Lucene finds something every time, whereas disambiguating with the predicates provides less false matches but also less results

Using Wikipedia as a controlled vocabulary

conText

Uses Lucene based index to lookup entitiesand find the ‘best match’ for an entity (no explicit

disambiguation required)

Muddy Boots

Uses DBpedia to find resource match and thendisambiguates using ‘disambiguates’ predicates and

comparing original story text to each resource

Lucene vs DBpedia ‘disambiguates’ predicates, both have different advantages - Lucene finds something every time, whereas disambiguating with the predicates provides less false matches but also less results

Entity classification and attribute selection

Example of classifying a person, use the predicates in DBpedia to perform classification, certain predicates only exist for a person

Sample of MuddyBoots output, classification of a BBC news article. Demonstrates ‘main actor’ discovery and automated microformatting and inclusion of extra content from DBpedia in a ‘featured actors’ sidebar. The inclusion of microformats means machines can now query this page in a more granular fashion

Added bonus of creating semantic links and using ‘web scale identifiers’. BBC Music beta aggregates around Music Brains identifiers, DBpedia knows about MusicBrainz, therefor we can provide news feeds for any artist on BBC Music beta using this relationship

The problems

Incorrect data in DBpedia/WikipediaTime sensitive data in DBpedia“Really tricky” disambiguations

Query and response times

The Queen, vs Queen is still a problem

The problems

Incorrect data in DBpedia/WikipediaTime sensitive data in DBpedia“Really tricky” disambiguations

Query and response times

The Queen, vs Queen is still a problem

Next steps

Testing phaseImproved NE classification

Speed improvementsAdd more entity types

Identify applications

Next steps

Testing phaseImproved NE classification

Speed improvementsAdd more entity types

Identify applications

http://www.flickr.com/photos/25094278@N02/2368194103/sizes/l/

Lets more away from silos of data, towards a shared, linked vision of our data