Internals Of An Aggregated Web News Feed

Preview:

Citation preview

Internals of anAggregated Web News Feed

newsfeed.ijs.si

Mitja Trampuš and Blaž NovakAI Lab, Jozef Stefan Institute

Monitor.Download.

txt

Clean.Enrich.

Expose.Use.

Monitor.Download.

txt

Expose.Use.

Clean.Enrich.

Monitor. Download.

• Sources: RSS, Google News, private feeds– 150 000 feeds– 15 000 publishers

• Sources of sources:– Bootstrap from public listings– Parse news articles for <link> entries

Monitor. Download.

• Quality management:– Punish technical errors– Adjustable crawl time

• Discovery delay for articles: 3 hours

txt

Expose.Use.

Clean.Enrich.

Monitor.Download.

Clean.1/2

• Methods in published papers work great– If evaluated on 10 sites

• Heuristic: Find the first block-level HTML element with lots of <p>aragraphs– failing that, a <td> or <div> with lots of text– avoid elements with lots of markup– site-independent

• Support for rNews/Schema.org

Clean.2/2

• Pitfalls– Pages with no content– Comments– Copyright notices

• Evaluation– 150 sites, one page per site• include content-less pages

– 95% precision, 95% recall

txt

Expose.Use.

Clean.

Enrich.Monitor.

Download.

Enrich.1/2

• Language detection:– 50 common languages: Chromium CLD– Long tail: Naive Bayes on character trigrams

• Language stats:– English 52%, German 7%, Spanish 7%,

French 4%, Russian 3%, ...,Chinese 1%, Slovene 0.2%

– 40 languages with >100 articles daily– 99% accuracy

Enrich.2/2

• enrycher.ijs.si– DMOZ categorization– Named entity detection, resolution– (Sentiment)– (Deep parsing)– English, Slovene, more languages coming

• Geo-tagging– Publisher (WHOIS, public listings)– Content (named entities)

txt

Monitor.Download.

Expose.Use.

Clean.Enrich.

Expose. Use.

• XML, gzip filesystem cache• HTTP service (polling)• Command-line client

• Live demo, API:http://newsfeed.ijs.si/

Technology.• Data volume: 100 000 articles/day

Peak throughput: 10 articles/second

• One machine for semantic processingOne machine for everything else

• Processing: Python, (Java, C++)Infrastructure: PostgreSQL, zeromq– Downloaders communicate through the DB– Processing strictly sequential, service-oriented• Each service: In case of errors, pass through

The Bright Future.

• Feed quality management

• Increase the number of sources– Non-western in particular

• Compute news clusters

Q&Amitja.trampus@ijs.si blaz.novak@ijs.si