25
FeedMe - a semantic RSS aggregator Nikola Ljubešić, Damir Boras, Mislav Cimperšak, Marija Tkalec Faculty of Humanities and Social Sciences University of Zagreb 08. lipnja 2010.

FeedMe - a semantic RSS aggregator · Existing aggregators • Google News • EMM NewsExplorer • MondoPress 08. lipnja 2010. RSS • RSS (Really Simple Syndication) - family of

  • Upload
    others

  • View
    14

  • Download
    0

Embed Size (px)

Citation preview

Page 1: FeedMe - a semantic RSS aggregator · Existing aggregators • Google News • EMM NewsExplorer • MondoPress 08. lipnja 2010. RSS • RSS (Really Simple Syndication) - family of

FeedMe - a semantic RSS aggregator

Nikola Ljubešić, Damir Boras, Mislav Cimperšak, Marija Tkalec

Faculty of Humanities and Social SciencesUniversity of Zagreb

08. lipnja 2010.

Page 2: FeedMe - a semantic RSS aggregator · Existing aggregators • Google News • EMM NewsExplorer • MondoPress 08. lipnja 2010. RSS • RSS (Really Simple Syndication) - family of

Overview

1. The basic idea

2. Our system

3. Statistical analysis of collected data

4. Usage examples

08. lipnja 2010.

Page 3: FeedMe - a semantic RSS aggregator · Existing aggregators • Google News • EMM NewsExplorer • MondoPress 08. lipnja 2010. RSS • RSS (Really Simple Syndication) - family of

Overview

1. The basic idea

2. Our system

3. Statistical analysis of collected data

4. Usage examples

08. lipnja 2010.

Page 4: FeedMe - a semantic RSS aggregator · Existing aggregators • Google News • EMM NewsExplorer • MondoPress 08. lipnja 2010. RSS • RSS (Really Simple Syndication) - family of

Aggregating news

• collecting news from different information sources as publishing them as a single source

• manual and automated

• automated - problem of repeating information - need for analysis and organization

08. lipnja 2010.

Page 5: FeedMe - a semantic RSS aggregator · Existing aggregators • Google News • EMM NewsExplorer • MondoPress 08. lipnja 2010. RSS • RSS (Really Simple Syndication) - family of

Existing aggregators

• Google News

• EMM NewsExplorer

• MondoPress

08. lipnja 2010.

Page 6: FeedMe - a semantic RSS aggregator · Existing aggregators • Google News • EMM NewsExplorer • MondoPress 08. lipnja 2010. RSS • RSS (Really Simple Syndication) - family of

RSS

• RSS (Really Simple Syndication) - family of web feed formats used to publish frequently updated works

• XML file - readable by humans and machines

• RSS structured, (X)HTML nowadays still not - easier data harvesting through RSS

08. lipnja 2010.

Page 7: FeedMe - a semantic RSS aggregator · Existing aggregators • Google News • EMM NewsExplorer • MondoPress 08. lipnja 2010. RSS • RSS (Really Simple Syndication) - family of

Google Reader

• on-line RSS aggregator

• problems

• loss of information

• repeating information

• unwanted information

08. lipnja 2010.

Page 8: FeedMe - a semantic RSS aggregator · Existing aggregators • Google News • EMM NewsExplorer • MondoPress 08. lipnja 2010. RSS • RSS (Really Simple Syndication) - family of

Our idea

• collect RSS server-side - no loss of entries

• cluster RSS entries concerning their content - complex entries, no duplicates

• enable users to filter information - “affirmate” ie. “negate” specific feeds

08. lipnja 2010.

Page 9: FeedMe - a semantic RSS aggregator · Existing aggregators • Google News • EMM NewsExplorer • MondoPress 08. lipnja 2010. RSS • RSS (Really Simple Syndication) - family of

Filtering

• publish only feed entries containing n or more original feed entries

• “affirmate” feeds - publishing only feed entries containing at least one original entry of all the “affirmative” feeds

• “negate” feeds - not publish feed entries containing any of the original entries from any negated feed

08. lipnja 2010.

Page 10: FeedMe - a semantic RSS aggregator · Existing aggregators • Google News • EMM NewsExplorer • MondoPress 08. lipnja 2010. RSS • RSS (Really Simple Syndication) - family of

Overview

1. The basic idea

2. Our system

3. Statistical analysis of collected data

4. Usage examples

08. lipnja 2010.

Page 11: FeedMe - a semantic RSS aggregator · Existing aggregators • Google News • EMM NewsExplorer • MondoPress 08. lipnja 2010. RSS • RSS (Really Simple Syndication) - family of

FeedMe

• back-end - collecting RSS entries on a half an hour basis and organizing them into clusters

• front-end - web application for

• creating groups of feeds (filtering - minimum elements, affirmating, negating)

• browsing the compiled groups

• publishing groups as new RSS feeds

08. lipnja 2010.

Page 12: FeedMe - a semantic RSS aggregator · Existing aggregators • Google News • EMM NewsExplorer • MondoPress 08. lipnja 2010. RSS • RSS (Really Simple Syndication) - family of

08. lipnja 2010.

Page 13: FeedMe - a semantic RSS aggregator · Existing aggregators • Google News • EMM NewsExplorer • MondoPress 08. lipnja 2010. RSS • RSS (Really Simple Syndication) - family of

Overview

1. The basic idea

2. Our system

3. Statistical analysis of collected data

4. Usage examples

08. lipnja 2010.

Page 14: FeedMe - a semantic RSS aggregator · Existing aggregators • Google News • EMM NewsExplorer • MondoPress 08. lipnja 2010. RSS • RSS (Really Simple Syndication) - family of

The collected data

• 388 RSS feeds

• 38 different portals

• collected from 2010-05-10

• more than 100.000 entries

• cca. 30.000 clusters

08. lipnja 2010.

Page 15: FeedMe - a semantic RSS aggregator · Existing aggregators • Google News • EMM NewsExplorer • MondoPress 08. lipnja 2010. RSS • RSS (Really Simple Syndication) - family of

Distribution of documents regarding the cluster size

0

0,20

0,40

0,60

0,80

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

08. lipnja 2010.

Page 16: FeedMe - a semantic RSS aggregator · Existing aggregators • Google News • EMM NewsExplorer • MondoPress 08. lipnja 2010. RSS • RSS (Really Simple Syndication) - family of

Portals publishing on “large” events (>2)Net.hr

Monitor.hrTportal.hr

Index.hrDnevnik.hrNacional.hr

Jutarnji.hrHRT.hr

24sata.hrVecernji.hr

SlobodnaDalmacija.hrRTL.hr

0 20 40 60 80

16

19

24

27

30

45

49

54

64

66

68

77

08. lipnja 2010.

Page 17: FeedMe - a semantic RSS aggregator · Existing aggregators • Google News • EMM NewsExplorer • MondoPress 08. lipnja 2010. RSS • RSS (Really Simple Syndication) - family of

Portals publishing new stories first

Index.hrNet.hr

Monitor.hrDnevnik.hrNacional.hrTportal.hrJutarnji.hr

Vecernji.hrHRT.hr

SlobodnaDalmacija.hr24sata.hr

RTL.hr

0 50 100 150 200

31

50

51

59

62

121

122

131

143

151

161

195

08. lipnja 2010.

Page 18: FeedMe - a semantic RSS aggregator · Existing aggregators • Google News • EMM NewsExplorer • MondoPress 08. lipnja 2010. RSS • RSS (Really Simple Syndication) - family of

Portals publishing new stories first (normalized by portal size)

Tportal.hrJutarnji.hr

Net.hrHRT.hr

Vecernji.hrNacional.hrDnevnik.hrMonitor.hr

RTL.hrIndex.hr

24sata.hrSlobodnaDalmacija.hr

0 0,10 0,20 0,29 0,39

0,31

0,31

0,31

0,32

0,32

0,32

0,32

0,34

0,35

0,38

0,38

0,39

08. lipnja 2010.

Page 19: FeedMe - a semantic RSS aggregator · Existing aggregators • Google News • EMM NewsExplorer • MondoPress 08. lipnja 2010. RSS • RSS (Really Simple Syndication) - family of

Plagiates?Tportal.hr

Dnevnik.hr

Nacional.hr

Net.hr

Jutarnji.hr

Index.hr

Monitor.hr

SlobodnaDalmacija.hr

HRT.hr

0 0,08 0,15 0,23 0,30

0,01

0,01

0,01

0,01

0,02

0,03

0,06

0,09

0,24

08. lipnja 2010.

Page 20: FeedMe - a semantic RSS aggregator · Existing aggregators • Google News • EMM NewsExplorer • MondoPress 08. lipnja 2010. RSS • RSS (Really Simple Syndication) - family of

Overview

1. The basic idea

2. Our system

3. Statistical analysis of collected data

4. Usage examples

08. lipnja 2010.

Page 21: FeedMe - a semantic RSS aggregator · Existing aggregators • Google News • EMM NewsExplorer • MondoPress 08. lipnja 2010. RSS • RSS (Really Simple Syndication) - family of

Filtering by minimum number of elements

08. lipnja 2010.

Page 22: FeedMe - a semantic RSS aggregator · Existing aggregators • Google News • EMM NewsExplorer • MondoPress 08. lipnja 2010. RSS • RSS (Really Simple Syndication) - family of

Filtering by affirmating feeds

08. lipnja 2010.

Page 23: FeedMe - a semantic RSS aggregator · Existing aggregators • Google News • EMM NewsExplorer • MondoPress 08. lipnja 2010. RSS • RSS (Really Simple Syndication) - family of

Filtering by negating feeds

08. lipnja 2010.

Page 24: FeedMe - a semantic RSS aggregator · Existing aggregators • Google News • EMM NewsExplorer • MondoPress 08. lipnja 2010. RSS • RSS (Really Simple Syndication) - family of

Future steps

• user-defined RSS sources

• full-text news portals

• different sources - social networks

• topic tracking

• named entity identification

• sentiment analysis and mining

08. lipnja 2010.

Page 25: FeedMe - a semantic RSS aggregator · Existing aggregators • Google News • EMM NewsExplorer • MondoPress 08. lipnja 2010. RSS • RSS (Really Simple Syndication) - family of

Thank you! Questions?

08. lipnja 2010.