Semanticnews 230913-final

Preview:

DESCRIPTION

Slides presented about the SemanticNews project at the SematicMedia@theBritishLibrary event on September 23rd 2013.

Citation preview

Mark A Greenwood, Jonathon Hare, David R Newman, Wim Peters

SemanticMedia@TheBritishLibraryMonday 23rd September 2013

The Project Vision• Semantic News is 6 month project:• June to November 2013• Two 50% FTEs (1 Southampton, 1 Sheffield)

• An interactive `second screen’ to provide contextual information on Question Time questions• Use multiple data sources• Perform named entity recognition• Exploit Linked Open Datasets• Towards an almost real-time system

Where is the Data? (1)• Question Time in

2010• 34 episodes, 163

questions• BBC Subtitles• XML encoded• Broadcast as the

subtitles stream

Where is the Data? (2)• BBC Programmes Data• XML encoded• Information about the

programme, (panellists, topics, broadcast dates, etc.)

• Tweets• Taken from the Twitter

‘Garden Hose’ (10% stream)

Pre-parsing Subtitles Data• Raw XML subtitles• Remove duplicate words• Parse into CSV • time offset• sentence

• Break into questions• BBC Programmes data provides question time

offsets • Compare with subtitles time offsets and split

Pre-parsing Twitter Data• Twitter ‘Garden Hose’ for 2010 Dataset• Used Apache Hadoop and filtered on:• @bbcqt, @bbcquestiontime• #bbcqt, #bbcquestiontime, #questiontime• “Question Time” “David Dimbleby”

• Collated JSON results and imported into OpenRefine• Removed irrelevant fields• Filtered out tweets that did not contain “bbc”• Exported as CSV

Information Extraction with GATE● General Architecture for Text Engineering (GATE)

● Developed by University of Sheffield since 2000● Used by many researchers, scientists and

organisations all over the world● Includes various components for language processing

● Parsers, machine learning tools, stemmers, IR tools, IE components for various languages...

● Also performs visualising and manipulating of text, annotations, ontologies, parse trees, etc., and tools for evaluation

Linguistic pre-processing● Techniques

● Tokenization● Sentence Splitting● Language Identification● POS tagging● Morphological analysis

● Adapted for use with social media like Twitter

Named Entity Recognition● Approaches

● Gazetteer lookup● JAPE grammars● Co-reference

● Types● Location: countries, regions, cities etc.● Organisation: names of companies, government organisations,

committees, agencies, universities, etc.● Person: names of people ● Date: absolute dates like ‘October 2012’ or ‘2007’, as well as

relative dates, such as ‘last year’. ● Measurements: e.g. “8,596 km”, “one fifth”, percentages and

probabilities

Enrichment: LODIE● Under constant development in various projects

● Associates the most probable LOD URI with named entities

● Disambiguation against DBPedia

● Various techniques to enhance recall

Enrichment: LODIE

“Ken Clarke: The Labour plotters hide behind the knife and stab with the cloak! Brilliant!!”

“Hain just lost Labour votes by supporting the £25k �benefits of an extremist.”

Representing Extracted Information

Conceptualising a Question

http://www.youtube.com/watch?v=O3l9Mi-KylI

Show Me The Data!• Use (Linked) Open Data Datasets• Crime Data• Election Data (constituencies, majorities, etc.)• MP voting records• School league tables• NHS performance league tables• Economic Figures (GDP, Inflation, Unemployment)

• Compare and contrast

Let’s have some questions from our audience.

Recommended