An Unsupervised Approach for the Detection of Outliers in Corpora David Guthrie Louise Guthire,...

Preview:

Citation preview

An Unsupervised Approach for the Detection of Outliers in Corpora

David GuthrieLouise Guthire, Yorick Wilks

The University of Sheffield

Corpora in CL

• Increasingly common in computational linguistics to use textual resources gathered automatically

o IR, scraping Web, etc.

• Construct corpora from specific blogs, bulletin boards, websites (Wikipedia, RottenTomatoes)

Corpora Can Contain Errors

• IR and scraping can lead to errors in precision

• Can contain entries that might be considered spam:

o Advertisingo gibberish messageso (more subtly) information that is an opinion

rather than a fact, rants about political figures

Difficult to verify

• The quality of corpora has a dramatic impact on the results of QA, ASR, TC, etc.

• Creation and validation of corpora has generally relied on humans

Goals

• Improve the consistency and quality of corpora

• Automatically identify and remove text from corpora that does not belong

Approach

• Treat the problem as a type of outlier detection

• We aim to find pieces of text in a corpus that differ significantly from the majority of text in that corpus and thus are ‘outliers’

Method

• Characterize each piece of text (document, segment, paragraph, …) in our corpus as a vector of features

• Use these vectors to construct a matrix, X, which has number of rows equal to the pieces of text in the corpus and number of columns equal to the number of features

Feature Matrix

X

Represent each piece of text as a vector of features

Characterizing Text

• 158 features computed for every piece of text (many of which have been used successfully for genre classification by Biber, Kessler, Argamon, …)

o Simple Surface Featureso Readability Measureso POS Distributions (RASP)o Vocabulary Obscurityo Emotional Affect (General Inquirer Dictionary)

Feature Matrix

X

Identify outlying Text

Outliers are ‘hidden’

SDE

• Use the Stahel-Donoho Estimator (SDE) to identify outliers

o Project the data down to one dimension and measure the outlyingness of each piece of text in that dimension

o For every piece of text, the goal is to find a projection of the that maximizes its robust z-score

o Especially suited to data with a large number of dimensions (features)

Robust Zscore of furthest point is <3

Robust z score for triangles in thisProjection is >12 std dev

SDE

• Where a is a direction (unit length vector) and

• xia is the projection of row xi onto direction a

• mad is the median absolute deviation

Outliers have a large SD

• The distances for each piece of text SD(xi) are then sorted and all pieces of text above a cutoff are marked as outliers

• We use

Experiments

• In each experiment we randomly select 50 segments of text from the Gigaword corpus of newswire and insert one piece of text from a different source to act as an ‘outlier’

• Measure the accuracy of automatically identifying the inserted segment as an outlier

• We varied the size of the pieces of text from 100 to 1000 words

Anarchist Cookbook

• Very different genre from newswire. The writing is much more procedural (e.g. instructions to build telephone phreaking devices) and also very informal (e.g. ``When the fuse contacts the balloon, watch out!!!'')

• Randomly one segment from the Anarchist Cookbook and attempt to identify outliers This is repeated 200 times for each segment size (100, 500,

and 1,000 words)

Cookbook Results

• Remember we are not using any training data and there is only a chance 1/51 (1.96%) of guessing the outlier correctly

Machine Translations

• 35 thousand words of Chinese news articles were hand picked (Wei Liu) and translated into English using Googles Chinese to English translation engine

• Similar genre to English newswire but translations are far from perfect and so the language use is very odd

• 200 test collections are created for each segment size as before

MT Results

Conclusions and Future Work

• Outlier detection can be a valuable tool for corpus linguistics (if we want a homogeneous corpus) Automatically clean corpora Does not require training data or human annotation

• This this method can be used reliably for relatively large pieces of text (1000 words). Threshold could be adjusted to insure a high precision at the expense of recall

• Looking at ways to increase accuracy by more intelligently picking directions for SDE and the cutoff to use for outliers

Recommended