68
Chris Sizemore Silver Oliver BBC ipedia as controlled vocabulary

Wikipedia as controlled vocabulary

Embed Size (px)

DESCRIPTION

The Essentials of Metadata and Taxonomy - Henry Stewart EventThe Next Wave: Using Wikipedia as a Controlled Vocabulary * Leveraging an online resource for internal use * Integrating pre-existing unique identifications numbers (UIDs) * Inherited relations * Capturing and cataloging * Risks and remedies Chris Sizemore BBC Future Technology & Media and Silver Oliver, BBC Future Technology & Media

Citation preview

Page 1: Wikipedia as controlled vocabulary

Chris SizemoreSilver OliverBBC

Wikipedia as controlled vocabulary

Page 2: Wikipedia as controlled vocabulary

I’m about ‘Victorians’

Page 3: Wikipedia as controlled vocabulary

BBC Topic Page

I’m about ‘Victorian

s’

Outside the BBC

BBC silo #1 BBC silo #3

BBC silo #2

Page 4: Wikipedia as controlled vocabulary

BBC Topic Page

I’m about ‘Victorian

s’

viktorianisch

V잊도 r 이안

Ελληνικά

NY Times, flickr,

wikipedia

Outside the BBC

BBC silo #1 BBC silo #3

BBC silo #2

Page 5: Wikipedia as controlled vocabulary

An index language exists primarily to:

Page 6: Wikipedia as controlled vocabulary

An index language exists primarily to:

• Allow an indexer to represent the subject matter of documents in a consistent way

Page 7: Wikipedia as controlled vocabulary

An index language exists primarily to:

• Allow an indexer to represent the subject matter of documents in a consistent way

• Bring the vocabulary used by the searcher into coincidence with the vocabulary used by the indexer

Page 8: Wikipedia as controlled vocabulary

An index language exists primarily to:

• Allow an indexer to represent the subject matter of documents in a consistent way

• Bring the vocabulary used by the searcher into coincidence with the vocabulary used by the indexer

• Provide means whereby a searcher can modulate the search strategy to attain comprehensive or selective results as user needs dictate

Page 9: Wikipedia as controlled vocabulary

An index language exists primarily to:

• Allow an indexer to represent the subject matter of documents in a consistent way

• Bring the vocabulary used by the searcher into coincidence with the vocabulary used by the indexer

• Provide means whereby a searcher can modulate the search strategy to attain comprehensive or selective results as user needs dictate

F.W. LancasterVocabulary control for information retrieval

Page 10: Wikipedia as controlled vocabulary

Could Wikipedia be used as a universal

language for identifying subjects?

Page 11: Wikipedia as controlled vocabulary

Story of Wikipedia-as-CV

Page 12: Wikipedia as controlled vocabulary

Story of Wikipedia-as-CV: personal origins

Page 13: Wikipedia as controlled vocabulary
Page 14: Wikipedia as controlled vocabulary

Story of Wikipedia-as-CV: personal origins

We needed a system to categorise movie & TV

reviews

Page 15: Wikipedia as controlled vocabulary

Story of Wikipedia-as-CV: personal origins

So of course we built a categorisation system from scratch -- including its own

controlled vocab

Page 16: Wikipedia as controlled vocabulary

Story of Wikipedia-as-CV: personal origins

And when people saw the system, they always said: “Hey, that reminds me of

Internet Movie Database…”

Page 17: Wikipedia as controlled vocabulary
Page 18: Wikipedia as controlled vocabulary

Story of Wikipedia-as-CV: personal origins

It struck me that the way Internet Movie Database is set up isn’t dissimilar to the structure of a

thesaurus or a very flat taxonomy…

Page 19: Wikipedia as controlled vocabulary

Story of Wikipedia-as-CV: personal origins

But its’s one where the emphasis is on “related to”, not broader/narrower,

synonym, antonym, etc

Page 20: Wikipedia as controlled vocabulary

Story of Wikipedia-as-CV: personal origins

From then, I couldn’t help but be drawn to websites where the structure

is clearly:

Page 21: Wikipedia as controlled vocabulary

Story of Wikipedia-as-CV: personal origins

From then, I couldn’t help but be drawn to websites where the structure

is clearly: “a single primary Concept per page --

and pages for related Concepts link to each other”

Page 22: Wikipedia as controlled vocabulary

Story of Wikipedia-as-CV: personal origins

Could those “one Concept per page” webpages be used as “terms” as in a

controlled vocabulary?

Page 23: Wikipedia as controlled vocabulary

Are some websites actually “indexing

languages” in disguise?

Page 24: Wikipedia as controlled vocabulary

conText --a Wikipedia-as-CV auto-categoriser

prototype

Page 25: Wikipedia as controlled vocabulary
Page 26: Wikipedia as controlled vocabulary

conText -- a Wikipedia-as-CV auto-categoriser

prototype:http://sells.welcomebackstage.com:5000/item/

submit

Page 27: Wikipedia as controlled vocabulary
Page 28: Wikipedia as controlled vocabulary

Demo of conText -- a Wikipedia-as-CV auto-categoriser

prototype

Page 29: Wikipedia as controlled vocabulary

Demo of conText -- a Wikipedia-as-CV auto-categoriser

prototype:

Take text from audience!

Page 30: Wikipedia as controlled vocabulary

Wikipedia is already being used across the Web as a form of

subject identification & disambiguation, in a grassroots

way:

Page 31: Wikipedia as controlled vocabulary

Wikipedia is already being used across the Web as a form of

subject identification & disambiguation, in a grassroots

way:

in the form of hyperlinks embedded by authors in blog

posts, news articles, music reviews, etc everywhere!

Page 32: Wikipedia as controlled vocabulary

http://en.wikipedia.org/wiki/British

http://en.wikipedia.org/wiki/Science_fiction

http://en.wikipedia.org/wiki/BBC

http://en.wikipedia.org/wiki/Time_travel

http://en.wikipedia.org/wiki/Dr_who

http://en.wikipedia.org/wiki/Tardis

Page 33: Wikipedia as controlled vocabulary

These days, by convention, when you link to Wikipedia from your webpage, more than saying “go and have a look at this other

page”, you are more likely giving a definition to a concept referred to in your content…

Page 34: Wikipedia as controlled vocabulary

These days, by convention, when you link to Wikipedia from your webpage, more than saying “go and have a look at this other

page”, you are more likely giving a definition to a concept referred to in your content…

Also used in this way for specific domains are Internet Movie Database (for films & TV

programmes), MySpace (for bands), Amazon (for books), etc

Page 35: Wikipedia as controlled vocabulary

For general knowledge, though,

Wikipedia is becoming the Web’s defacto

controlled vocabulary

Page 36: Wikipedia as controlled vocabulary

http://en.wikipedia.org/wiki/Heerlen

http://en.wikipedia.org/wiki/Beethoven

http://en.wikipedia.org/wiki/Amsterdam

http://en.wikipedia.org/wiki/Van_Gogh_Museum

Page 37: Wikipedia as controlled vocabulary

An index language exists primarily to:

• Allow an indexer to represent the subject matter of documents in a consistent way

• Bring the vocabulary used by the searcher into coincidence with the vocabulary used by the indexer

• Provide means whereby a searcher can modulate the search strategy to attain comprehensive or selective results as user needs dictate

F.W. LancasterVocabulary control for information retrieval

Page 38: Wikipedia as controlled vocabulary

Wikipedia pages provide the best scope

notes in the world

Page 39: Wikipedia as controlled vocabulary

Wikipedia pages provide the best scope

notes in the worldWikipedia-as-CV benefits from being developed through a social process, maintained and kept

current by the Wikipedia community

Page 40: Wikipedia as controlled vocabulary

Wikipedia pages provide the best scope

notes in the worldWikipedia-as-CV benefits from being developed through a social process, maintained and kept

current by the Wikipedia community

Each concept represents a consensus view and its meaning can be understood simply by reading the

associated Wikipedia page

Page 41: Wikipedia as controlled vocabulary

Wikipedia pages provide the best scope

notes in the world

For each Concept, the document edit history, discussion around concept definition, & debate is

important here…

Page 42: Wikipedia as controlled vocabulary
Page 43: Wikipedia as controlled vocabulary

An index language exists primarily to:

• Allow an indexer to represent the subject matter of documents in a consistent way

• Bring the vocabulary used by the searcher into coincidence with the vocabulary used by the indexer

• Provide means whereby a searcher can modulate the search strategy to attain comprehensive or selective results as user needs dictate

F.W. LancasterVocabulary control for information retrieval

Page 44: Wikipedia as controlled vocabulary

So, we can tag pretty accurately semi-automatically with globally

unique subject identifiers using this approach…

So what?

Page 45: Wikipedia as controlled vocabulary

So, we can tag pretty accurately semi-automatically with globally

unique subject identifiers using this approach…

So what?

Un-silo your content repository quickly and cheaply, by connecting

it to the Web via Wikipedia

Page 46: Wikipedia as controlled vocabulary
Page 47: Wikipedia as controlled vocabulary
Page 48: Wikipedia as controlled vocabulary
Page 49: Wikipedia as controlled vocabulary
Page 50: Wikipedia as controlled vocabulary

Now playing vs. the Web

Page 51: Wikipedia as controlled vocabulary
Page 52: Wikipedia as controlled vocabulary
Page 53: Wikipedia as controlled vocabulary

Now playing vs. the Web

Why not bring in BBC Archive materials to this service via Wikipedia-as-CV tagging and linked data bridge between Wikipedia & MusicBrainz?

Page 54: Wikipedia as controlled vocabulary
Page 55: Wikipedia as controlled vocabulary
Page 56: Wikipedia as controlled vocabulary

By using Wikipedia-as-CV, you can get your

repository onto this diagram quickly,

for free

Page 57: Wikipedia as controlled vocabulary
Page 58: Wikipedia as controlled vocabulary

An index language exists primarily to:

• Allow an indexer to represent the subject matter of documents in a consistent way

• Bring the vocabulary used by the searcher into coincidence with the vocabulary used by the indexer

• Provide means whereby a searcher can modulate the search strategy to attain comprehensive or selective results as user needs dictate

F.W. LancasterVocabulary control for information retrieval

Page 59: Wikipedia as controlled vocabulary

A Web-scale, globally accessible index language accidentally exists:

Page 60: Wikipedia as controlled vocabulary

A Web-scale, globally accessible index language accidentally exists:

• It encourages multiple indexers across the Web to represent the subject matter of any content in a consistent way

Page 61: Wikipedia as controlled vocabulary

A Web-scale, globally accessible index language accidentally exists:

• It encourages multiple indexers across the Web to represent the subject matter of any content in a consistent way

• It brings the vocabulary used by info seekers into coincidence with the vocabulary used by indexers -- the searchers ARE indexers, and vice versa

Page 62: Wikipedia as controlled vocabulary

A Web-scale, globally accessible index language accidentally exists:

• It encourages multiple indexers across the Web to represent the subject matter of any content in a consistent way

• It brings the vocabulary used by info seekers into coincidence with the vocabulary used by indexers -- the searchers ARE indexers, and vice versa

• It provides means whereby a searcher can modulate a search and/or browse strategy to attain comprehensive or selective results as user needs dictate

Page 63: Wikipedia as controlled vocabulary

A Web-scale, globally accessible index language accidentally exists:

• It encourages multiple indexers across the Web to represent the subject matter of any content in a consistent way

• It brings the vocabulary used by info seekers into coincidence with the vocabulary used by indexers -- the searchers ARE indexers, and vice versa

• It provides means whereby a searcher can modulate a search and/or browse strategy to attain comprehensive or selective results as user needs dictate

• It adds Web-scale navigation & cross-reference possibilities

Page 64: Wikipedia as controlled vocabulary

Chris SizemoreSilver OliverBBC

Wikipedia as controlled vocabularyWikipedia is a controlled vocabulary

Page 65: Wikipedia as controlled vocabulary

Chris SizemoreSilver OliverBBC

Wikipedia as controlled vocabularyWikipedia is a controlled vocabulary

Page 66: Wikipedia as controlled vocabulary

Chris SizemoreSilver OliverBBC

Wikipedia as controlled vocabulary

Chris SizemoreSilver OliverBBC

Wikipedia is a controlled vocabulary

Page 67: Wikipedia as controlled vocabulary

Chris SizemoreSilver OliverBBC

Wikipedia as controlled vocabulary

Chris SizemoreSilver OliverBBC

Wikipedia is a controlled vocabulary

Much thanks!

Questions, comments, & constructive criticism?

Page 68: Wikipedia as controlled vocabulary

Chris SizemoreSilver OliverBBC

Wikipedia as controlled vocabulary

http://flickr.com/photos/deniscollette/1817034358/