An Unsupervised Approach for the Detection of Outliers in Corpora David Guthrie Louise Guthire,...

An Unsupervised Approach for the Detection of Outliers in Corpora

David GuthrieLouise Guthire, Yorick Wilks

The University of Sheffield

Corpora in CL

• Increasingly common in computational linguistics to use textual resources gathered automatically

o IR, scraping Web, etc.

• Construct corpora from specific blogs, bulletin boards, websites (Wikipedia, RottenTomatoes)

Corpora Can Contain Errors

• IR and scraping can lead to errors in precision

• Can contain entries that might be considered spam:

o Advertisingo gibberish messageso (more subtly) information that is an opinion

rather than a fact, rants about political figures

Difficult to verify

• The quality of corpora has a dramatic impact on the results of QA, ASR, TC, etc.

• Creation and validation of corpora has generally relied on humans

• Improve the consistency and quality of corpora

• Automatically identify and remove text from corpora that does not belong

Approach

• Treat the problem as a type of outlier detection

• We aim to find pieces of text in a corpus that differ significantly from the majority of text in that corpus and thus are ‘outliers’

Method

• Characterize each piece of text (document, segment, paragraph, …) in our corpus as a vector of features

• Use these vectors to construct a matrix, X, which has number of rows equal to the pieces of text in the corpus and number of columns equal to the number of features

Feature Matrix

Represent each piece of text as a vector of features

Characterizing Text

• 158 features computed for every piece of text (many of which have been used successfully for genre classification by Biber, Kessler, Argamon, …)

o Simple Surface Featureso Readability Measureso POS Distributions (RASP)o Vocabulary Obscurityo Emotional Affect (General Inquirer Dictionary)

Feature Matrix

Identify outlying Text

Outliers are ‘hidden’

• Use the Stahel-Donoho Estimator (SDE) to identify outliers

o Project the data down to one dimension and measure the outlyingness of each piece of text in that dimension

o For every piece of text, the goal is to find a projection of the that maximizes its robust z-score

o Especially suited to data with a large number of dimensions (features)

Robust Zscore of furthest point is <3

Robust z score for triangles in thisProjection is >12 std dev

• Where a is a direction (unit length vector) and

• xia is the projection of row xi onto direction a

• mad is the median absolute deviation

Outliers have a large SD

• The distances for each piece of text SD(xi) are then sorted and all pieces of text above a cutoff are marked as outliers

• We use

Experiments

• In each experiment we randomly select 50 segments of text from the Gigaword corpus of newswire and insert one piece of text from a different source to act as an ‘outlier’

• Measure the accuracy of automatically identifying the inserted segment as an outlier

• We varied the size of the pieces of text from 100 to 1000 words

Anarchist Cookbook

• Very different genre from newswire. The writing is much more procedural (e.g. instructions to build telephone phreaking devices) and also very informal (e.g. ``When the fuse contacts the balloon, watch out!!!'')

• Randomly one segment from the Anarchist Cookbook and attempt to identify outliers This is repeated 200 times for each segment size (100, 500,

and 1,000 words)

Cookbook Results

• Remember we are not using any training data and there is only a chance 1/51 (1.96%) of guessing the outlier correctly

Machine Translations

• 35 thousand words of Chinese news articles were hand picked (Wei Liu) and translated into English using Googles Chinese to English translation engine

• Similar genre to English newswire but translations are far from perfect and so the language use is very odd

• 200 test collections are created for each segment size as before

MT Results

Conclusions and Future Work

• Outlier detection can be a valuable tool for corpus linguistics (if we want a homogeneous corpus) Automatically clean corpora Does not require training data or human annotation

• This this method can be used reliably for relatively large pieces of text (1000 words). Threshold could be adjusted to insure a high precision at the expense of recall

• Looking at ways to increase accuracy by more intelligently picking directions for SDE and the cutoff to use for outliers

An Unsupervised Approach for the Detection of Outliers in Corpora David Guthrie Louise Guthire,...

Documents

Squat Wilks Sch Cat Po Nome Società Pp Cat Cat Età Squat Wilks … · 2013-01-14 · Squat Wilks Sch Cat Età Po Nome Società Pp Cat Cat Età Squat Wilks Luogo Record Nu 55Conti

Language Technologies Reality and Promise in AKT Yorick Wilks and Fabio Ciravegna Department of Computer Science, University of Sheffield

SW388R7 Data Analysis & Computers II Slide 1 Detecting Outliers Detecting univariate outliers Detecting multivariate outliers

Yorick de Mombynes La philosophie d'Ayn Rand

Automatic Report Generation from Ontologies: the MIAKT Approach Kalina Bontcheva, Yorick Wilks Department of Computer Science University of Sheffield

Zizoulas Wilks

il foglio di yorick

COM1070: Introduction to Artificial Intelligence: week 10 Yorick Wilks Computer Science Department

Stone Soup revisited: or the unity and disintegration of MT Yorick Wilks University of Sheffield ~ yorick

Automatic indexing and retrieval of crime-scene photographs Katerina Pastra, Horacio Saggion, Yorick Wilks NLP group, University of Sheffield Scene of

Yorick: An Interpreted Language

il foglio di yorick _ 2

COM1070: Introduction to Artificial Intelligence: week 5 Yorick Wilks Computer Science Department University of Sheffield

maggio-dicembre 1986fraassen/abstract/SubjSemantics.pdf · Roger Schank & Alex Kass Bas van Fraassen Yorick Wilks VSNOTIZIE maggio-dicembre 1986 Meaning and Mental Representations

IE (Wilks)-1 Information Extraction: Beyond Document Retrieval Robert Gaizauskas and Yorick Wilks Computational Linguistics and Chinese Language Processing

Wilks Tourism Risk Mgt

Dr Jeff Wilks

COM1070: Introduction to Artificial Intelligence: week 7 Yorick Wilks Computer Science Department University of Sheffield

COM1070: Introduction to Artificial Intelligence: week 2 Yorick Wilks Computer Science Department University of Sheffield

Noyaux Exotiques @ CERN- ISOLDE Yorick Blumenfeld