Upload
david-beavan
View
152
Download
2
Tags:
Embed Size (px)
DESCRIPTION
Talk given at Digital Humanities 2008 (DH2008) in Oulu, Finland on 27 June 2008. Web site: http://www.scottishcorpus.ac.uk/corpus/bnc/ Abstract: http://www.ekl.oulu.fi/dh2008/Digital%20Humanities%202008%20Book%20of%20Abstracts.pdf This paper demonstrates a web-based, interactive data visualisation, allowing users to quickly inspect and browse the collocational relationships present in a corpus. The software is inspired by tag clouds, first popularised by on-line photograph sharing website Flickr (www.flickr.com). A paper based on a prototype of this Collocate Cloud visualisation was given at Digital Resources for the Humanities and Arts 2007. The software has since matured, offering new ways of navigating and inspecting the source data. It has also been expanded to analyse additional corpora, such as the British National Corpus (http://www.natcorp.ox.ac.uk/), which will be the focus of this talk.
Citation preview
Glimpses through the clouds: Collocates in a new light David Beavan, Department of English Language
What are clouds?
Cloud properties
Alphabetical listing of items l Good for navigation l Quickly locate or discount a known item l Limited number of items
(Flickr tag cloud = 150)
Font size shows popularity l Good for browsing l Often used tags ‘jump out at you’ l Limited usefulness if less popular terms are sought
Word frequency cloud
Shares properties with tag clouds l Words listed alphabetically:
good for navigation l Font size shows frequency of word:
good for browsing
Restricted view l Summarises the document as a whole l Does not give insight into the usage or context of each word
for this we need to look at co-occurrences/collocates
Our corpus
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
Collocates of ‘blue’
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
Collocates of ‘blue’
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
Collocates of ‘blue’
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
Collocates of ‘blue’
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
Co-occurrences as a cloud
Using the British National Corpus (BNC) l Popular and well known l 100 million word corpus l British English l Compiled in early 1990s l Wide range of genres l Written and spoken data l 2007 XML edition
Co-occurrence clouds
Co-occurrence clouds l 100 most frequent co-occurring word pairs l Rendered as a cloud l Inherit cloud benefits of navigation and exploration l Allow user to create new clouds from visible words
What’s missing l KWIC concordance of word pairs l Measure of collocation strength
Our corpus
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
Collocates of ‘brown’
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
Collocate clouds
Collocate clouds l 100 most frequent co-occurring word pairs l Rendered as a cloud l Inherit cloud benefits of navigation and exploration l Allow user to create new clouds from visible words l KWIC concordance of word pairs l Measure of collocation strength
Future
Advantages l Easy to interpret and use l Lowers the barrier to corpus analysis l Iterative nature promotes browsing and investigation
Improvements l Allow use of stopwords / filter words l Configure ‘size’ of cloud l Show POS l Group words under their headword l Make your own?
Glimpses through the clouds: Collocates in a new light David Beavan [email protected]