Measuring the Quality of Web Content using Factual Information

gefördert durch das Kompetenzzentrenprogramm

www.know-center.at

© Know-Center 2012

Measuring the Quality of Web Content using Factual Information

16. April 2012

WebQuality 2012 workshop at WWW 2012

Elisabeth Lex, Michael Voelske , Marcelo Errecalde , Edgardo Ferretti, Leticia Cagnina, Christopher Horn, Benno Stein and Michael Granitzer

© Know-Center 2012

2

www.know-center.at

Agenda

Motivation

Approach

Results

Summary and Outlook

© Know-Center 2012

3

www.know-center.at

Motivation

People‘s decisions often based on Web content

lacking quality control, no verification

Inaccurate, incorrect infomation No fact checking

Measures needed to capture credibility and quality aspects

In respect to facts!

© Know-Center 2012

4

www.know-center.at

Approach

Measure information quality based on factual information

3 Approaches:

Use simple statistics about the facts obtained from text

Exploit relational information contained in facts

Use semantic relationships like meronymy and hypernymy

First approach:

Use simple statistical features about facts in a document

Indicates how informative a document is

Derive facts from Web content using Open Information Extraction

© Know-Center 2012

5

www.know-center.at

Definition of Factual Density

Fact Count

Factual Density

© Know-Center 2012

6

www.know-center.at

Experiments

Wikipedia: 1000 Featured and Good articles versus 1000 Non-Featured (randomly selected)

Featured: a comprehensive coverage of the major facts in the context of the article’s subject

Baseline: Word Count [Blumenstock 2008]

Featured articles longer than non-featured

Bias: longer docs contain more facts

Evaluation: 2 Datasets

Unbalanced: articles differ in length

Balanced: articles similar in length

© Know-Center 2012

7

www.know-center.at

Distributions of docs in both datasets in respect to word count

© Know-Center 2012

8

www.know-center.at

Precision/Recall curves of Factual Density

© Know-Center 2012

9

www.know-center.at

ResultsFactual Density on balanced corpus

© Know-Center 2012

10

www.know-center.at

Experiments – Relational Features

Approach 2: exploiting relational information contained in facts

Extract relational features from articles

Use relations from ReVerb: binary relations (e1, relation, e2)

Use them to train a classifier to discriminate between featured/good and non-featured

© Know-Center 2012

11

www.know-center.at

Experiments – Relational Features

Approach 2: exploiting relational information contained in facts

Extract relational features from articles

Use relations from ReVerb: binary relations (e1, relation, e2)

Use them to train a classifier to discriminate between featured/good and non-featured

© Know-Center 2012

12

www.know-center.at

Summary

Simple fact related measure: Factual Density

Based on Factual Density, featured/good articles can be separated from non-featured if article length similar

If articles differ in length, word count! For future work, combination of both

Plan to incorporate edit history: more editors, higher factual density

Preliminary experiments with relational features

Promising results, more work in this direction

Goal here is to bring semantics in to the field of Information Quality

We expect this to unlock several IQ dimensions, e.g. generality vs specificity

© Know-Center 2012

13

www.know-center.at

Thank you for your attention!

Elisabeth Lex

[email protected]

Technology

Measuring the Quality of Web Content using Factual Information