Internals of anAggregated Web News Feed
newsfeed.ijs.si
Mitja Trampuš and Blaž NovakAI Lab, Jozef Stefan Institute
Monitor.Download.
txt
Clean.Enrich.
Expose.Use.
Monitor.Download.
txt
Expose.Use.
Clean.Enrich.
Monitor. Download.
• Sources: RSS, Google News, private feeds– 150 000 feeds– 15 000 publishers
• Sources of sources:– Bootstrap from public listings– Parse news articles for <link> entries
Monitor. Download.
• Quality management:– Punish technical errors– Adjustable crawl time
• Discovery delay for articles: 3 hours
txt
Expose.Use.
Clean.Enrich.
Monitor.Download.
Clean.1/2
• Methods in published papers work great– If evaluated on 10 sites
• Heuristic: Find the first block-level HTML element with lots of <p>aragraphs– failing that, a <td> or <div> with lots of text– avoid elements with lots of markup– site-independent
• Support for rNews/Schema.org
Clean.2/2
• Pitfalls– Pages with no content– Comments– Copyright notices
• Evaluation– 150 sites, one page per site• include content-less pages
– 95% precision, 95% recall
txt
Expose.Use.
Clean.
Enrich.Monitor.
Download.
Enrich.1/2
• Language detection:– 50 common languages: Chromium CLD– Long tail: Naive Bayes on character trigrams
• Language stats:– English 52%, German 7%, Spanish 7%,
French 4%, Russian 3%, ...,Chinese 1%, Slovene 0.2%
– 40 languages with >100 articles daily– 99% accuracy
Enrich.2/2
• enrycher.ijs.si– DMOZ categorization– Named entity detection, resolution– (Sentiment)– (Deep parsing)– English, Slovene, more languages coming
• Geo-tagging– Publisher (WHOIS, public listings)– Content (named entities)
txt
Monitor.Download.
Expose.Use.
Clean.Enrich.
Expose. Use.
• XML, gzip filesystem cache• HTTP service (polling)• Command-line client
• Live demo, API:http://newsfeed.ijs.si/
Technology.• Data volume: 100 000 articles/day
Peak throughput: 10 articles/second
• One machine for semantic processingOne machine for everything else
• Processing: Python, (Java, C++)Infrastructure: PostgreSQL, zeromq– Downloaders communicate through the DB– Processing strictly sequential, service-oriented• Each service: In case of errors, pass through
The Bright Future.
• Feed quality management
• Increase the number of sources– Non-western in particular
• Compute news clusters