18
ethodologie de Recherche RSS Join Engine Maroun Baydoun inf1312, OGL Marwan Azzam inf1311, OGL Thursday, 14 May, 2010

Rss Join Engine

Embed Size (px)

DESCRIPTION

Internet users often subscribe to many RSS feeds to stay up to date on the latest news. However, this information is spread out across many sources, which makes it difficult for users to keep track of the most important headlines. Therefore, it is essential to come up with a tool to bring together news from different sources and present the user with a single feed containing the intersection of all other feeds.

Citation preview

Page 1: Rss Join Engine

Methodologie de Recherche

RSS Join Engine

Maroun Baydoun inf1312, OGL

Marwan Azzam inf1311, OGL

Thursday, 14 May, 2010

Page 2: Rss Join Engine

Contents

1 Introduction 3

2 Related Work 3

2.1 Joining RSS Feeds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2 Comparing Text Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2.1 Document Index Graph Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.3 Relating Rss Items (News) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.4 Websites and Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 Hypothesis 8

4 Architecture 8

5 Pseudo-code 9

6 Development 11

7 Implementation 11

8 Simulation 12

9 Consideration 15

9.1 Advantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

9.2 Disadvantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

9.3 Possible improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

1

Page 3: Rss Join Engine

9.4 Other applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

10 Conclusion 16

2

Page 4: Rss Join Engine

1 Introduction

RSS (most commonly expanded as ”Really Simple Syndication”) is a family of web feed formats used

to publish frequently updated workssuch as blog entries, news headlines, audio, and videoin a standardized

format.

Internet users often subscribe to many RSS feeds to stay up to date on the latest news. However, this

information is spread out across many sources, which makes it difficult for users to keep track of the most

important headlines. Therefore, it is essential to come up with a tool to bring together news from different

sources and present the user with a single feed containing the intersection of all other feeds.

2 Related Work

2.1 Joining RSS Feeds

To intersect different feeds, we start by defining semantic relatedness between RSS elements/items in order

to determine semantic relations regarding the meaning of terms instead of just their syntactic properties.

XML documents can be compared following:

• Their structure. (structure-based similarity)

• Their content. (content-based similarity)

• A combination of both. (hybrid similarity)

RSS feeds can be related in several ways:

• Inclusion: a news item can be completely included in another news item.

• Intersection: two news items might refer to similar concepts. We say the two news items intersect.

• Opposition: two news items might refer to the same topic but in opposite ways.

3

Page 5: Rss Join Engine

2.2 Comparing Text Documents

2.2.1 Document Index Graph Model

In order to relate RSS feeds, its helpful to use a document clustering technique based on segregating

documents into groups so that each group represents a topic that is different than those represented by other

groups.

Any clustering technique relies on 4 concepts: a data representation model, a similarity measure, a cluster

model, a clustering algorithm to build the clusters based on the similarity measure and the data model.

Traditional document clustering techniques rely on single term analysis. They use the Vector Space Model.

The Vector Space Model represents a document as a vector of terms. The vector contains the weights of

the terms (the frequency of the terms for example). Similarity between documents can be measured using

similarity measures applied to the vectors (such as cosine). This model uses single-term analysis only (no

word proximity or phrase based analysis).

Even though this approach is widely used, it proves insufficient because it leaves out phrase analysis. Therefore

a better method would be to combine single-term and phrase analysis.

This brings us to the introduction of a new model, the Document Index Graph (DIG).

The DIG is based on the graph theory. It uses graph properties to match any-length sentence from a document

to any number of previously seen documents. The amount of time taken by this process is proportional to

the number of words in the new document.

4

Page 6: Rss Join Engine

2.3 Relating Rss Items (News)

Each node or element of an RSS Tree is a pair having e = η, ζ where e.η is the element name and e.ζ its

content.

The concept of neighborhood

Neighborhood is used for identifying the relationships between text and is consequently used for RSS elements.

Neighborhood can be classified as follows:

• Semantic Neighborhood : The semantic neighborhood of a concept Ci is defined as the set of concepts

Cj in a given knowledge base KB, related with Ci via the hyponymy ( ≺ ) or meronymy ( ≺≺ ) semantic

relations, directly or via transitivity.

• Global Semantic Neighborhood: The global semantic neighborhood of a concept Ci is the union of each

semantic neighborhood w.r.t. all synonymy ( ≡ ), hyponymy ( ≺ ) and meronymy ( ≺≺ ) relations

altogether.

• Antonym Neighborhood: The antonym neighborhood of a concept Ci is defined as the set of concepts

Ci, in a given knowledge base KB, related with Ci via the antonymy relation (ω), directly or transitively

5

Page 7: Rss Join Engine

via synonymy ( ≡ ), hyponymy ( ≺ ) or hypernym ( � ).

Relations and relatedness of RSS elements:

For two simple elements e1, and e2, the Element Relatedness (ER) algorithm returns a pair quantifying the

semantic relatedness SemRel value and Relation based on corresponding TR label and content values.

Relation relies on a rule-based method combining the label and value relationships as follows:

• Elements e1 and e2 are disjoint if either their labels or values are disjoint.

• Element e1 includes e2, if e1.η includes e2.η and e1.ζ includes e2.ζ.

• Two elements e1 and e2 intersect if either their labels or values intersect.

• Two elements e1 and e2 are equal if both their labels and values are equal.

• Two elements e1 and e2 are opposite if both their contents are opposite.

For two RSS items I1 and I2, each containing a group of elements, the Item Relatedness (IR) Algorithm

returns a pair containing SemRel and Relation.

By combining relations between sub elements, the relation between two items I1 and I2 is identified using

the following rule-based method:

• Items I1 and I2 are disjoint if all elements ei and ej are disjoint (elements are disjoint if there is no

relatedness whatsoever between them, i.e., SemRel(I1, I2) = 0).

• Item I1 includes I2, if all elements in ei include all those in ej .

6

Page 8: Rss Join Engine

• Two items I1 and I2 intersect if at least two of their elements intersect.

• Two items I1 and I2 are equal if all their elements in ei equal to all those in ej .

• Two items I1 and I2 are opposite if at least two of their respective elements are opposite.

2.4 Websites and Applications

There are many websites and applications that provide services related to RSS feeds aggregations, but none

of those solutions implements an RSS join engine based on semantics. They simply enable users to merge

many news feeds into a single feed.

These tools are:

• xFruits (http://www.xfruits.com)

• Flock (http://flock.sourceforge.net/index.html)

• RSSOwl (http://www.rssowl.org)

• BlogBridge (http://www.blogbridge.com)

• Yahoo Pipes (http://pipes.yahoo.com)

• Feedzeo (http://feedzeo.sourceforge.net)

7

Page 9: Rss Join Engine

3 Hypothesis

Based on what is presented in the earlier parts, the simplest method was to consider two phrases similar

if they have a predefined number of words in common. However, this method reveals substantial weaknesses

because it neglects the semantics of the phrases. Sentences written differently but conveying the same meaning

will be deemed not similar.

Thus, the proposed solution consists of implementing a phrase-based document similarity algorithm based

on an index graph model to create a RSS join engine. This engine will take five RSS feeds as input, place a

window on each feed, and then run the similarity algorithm in order to intersect the feeds.

4 Architecture

1. Feeds: The user has the possibility at most any five Rss feeds.

2. Parser: The application will rely on the ROME Rss feed parser, which accepts as a parameter the URL

of the feed, and returns the list of its Items.

3. Windows: On every list of item, we place a window, which contains the five most recent items.

8

Page 10: Rss Join Engine

4. Join Engine: The join engine applies the Phrase-based document similarity algorithm on the items

contained in the windows.

5 Pseudo-code

CREATE GRAPH:

FOR EACH feed

Read feed

Parse feed

Create window

Sort feed items by publish date

Include the five most recent items in the window

CALL build graph

END FOR

BUILD GRAPH:

Create document

Fill document with feed items

FOR each sentence in document

IF first-word of sentence NOT in graph

Add first-word of sentence in graph

END IF

Create list

FOR EACH word in sentence

IF previous-word, word IS edge in graph

Extend phrase matches in the list for sentences that continue along previous-word, word

Add the new phrase matches to the list

ElSE

Add previous-word, word to graph

Update sentence path in nodes previous-word and word

END IF

9

Page 11: Rss Join Engine

END FOR

END FOR

10

Page 12: Rss Join Engine

6 Development

It is a java web application developed on NetBeans IDE 6.8, using java version Java EE6 and Glashfish v3

as application server. The JavaServer Faces framework is adopted to simplify development.

The application uses many external open source libraries that are not provided by default in java:

1. ROME, JDOM: for RSS feed parsing.

2. OpenNLP: for natural language processing.

3. JGrapht: for graph manipulation.

7 Implementation

The main technique used here allows parsing every Rss feed using ROME and JDOM libraries by entering

in input its URL, and getting as output a SyndFeed. The SyndFeed type represents any kind of feed (RSS,

ATOM ). Afterwards, from the returned SyndFeed a list of items is acquired. Those items are of type

SyndEntry. Next, a window is associated with every feed in order to contain the fifteen most recent entries

from the generated list.

At the end of this step, you will have a maximum of five windows, each containing fifteen entries. OpenNLP

is now used to split each entry into sentences saved in an Array of String. Then StringToKenizer Class is

used to divide each sentence into an Array of word.

After that, JGrapht is brought into play in order to create an empty directed graph, which constitutes the

basis of the Graph Index Model. The building process of the graph goes as follows:

• For every word, we check if it already exists in the graph; if not, we add it.

• For every two consecutive words, an edge is created in the graph.

Phrase matching and graph building take place simultaneously. Phrase matching occurs over the following

steps:

11

Page 13: Rss Join Engine

• If an edge already exists between two consecutive words, the path in the graph extending that node is

followed until the last existing edge is reached; a matching phrase is detected and added to the list of

already matched phrases.

• At the end of processing some of the matching phrases must be eliminated because they dont hold any

semantic value.

The remaining sentences should be evaluated in order to assess the degree of similarity between RSS entries.

This evaluation concerns:

• The length of the sentence.

• The weight of the sentence in the entry.

However, using phrase based matching solely can be deemed insufficient. A better approach is to incorporate

single-term similarity. Once inter-entries similarity measures are established, a new RSS feed is created using

ROME to contain the matched entries. This feed is returned as the join result.

8 Simulation

To test the application, the following was done:

1. Launch it.

2. Fill the textfields with RSS feeds URLs.

3. Inspect the results.

One example is the following:

12

Page 14: Rss Join Engine

13

Page 15: Rss Join Engine

Other tests are done, and the results are illustrated in the table below:

These results point out the following observations:

1. Given that the tests are carried out online, there is a big probability that the feeds are constantly

changing. Therefore, each evaluation can take place on different inputs.

2. The number of matched entries in not directly linked to the number of feeds entered.

14

Page 16: Rss Join Engine

9 Consideration

9.1 Advantages

1. Ease of use: simple user interface.

2. High performance: low time processing.

3. Efficiency: minimal use of bandwidth.

9.2 Disadvantages

1. Limited inputs to five.

2. Not the optimal technique (though presents less overhead).

9.3 Possible improvements

1. Expand the maximum inputs number, without sacrificing performance.

2. Improve this technique to include semantic similarity so the rate of matched entries increases.

3. Create different versions of the application for different platforms like mobile phones, desktop applica-

tions

9.4 Other applications

This technique can be also useful in other fields of application. It can be applied to match inputs other than

RSS feeds.

In general it can be used to match any text content such as speeches, researches

15

Page 17: Rss Join Engine

10 Conclusion

In this paper, we described a technique for creating an RSS join engine. We discussed how RSS feeds can

be joined. Afterwards, we examined how text documents can be compared, and focused on the Document

Index Graph Model in the context of a phrase-based document similarity. Then we moved to enumerate how

RSS items can be related, before looking into previously developed websites and applications that attempted

to solve the question of how to join RSS feeds and finding that none of the preexisting solutions is well

adapted to this task.

By suggesting a technique based on the Graph Index Model, we were able to take advantage of the efficiency

and the ease of implementation of that model. Furthermore, this technique can be further improved, and

can even be applied to fields other than RSS feeds.

16

Page 18: Rss Join Engine

References

[1] Relating RSS News/Items

Fekade Getahun, Joe Tekli, Chbeir Richard, Marco Viviani, Kokou Yetongnon

Laboratoire Electronique, Informatique et Image

(LE2I) UMR-CNRS Universit de Bourgogne Sciences et Techniques

http://vision.u-bourgogne.fr/Le2i/user data/publications/2356 Chapter-LNCS-ICWE%20final.pdf

[2] Phrase-based Document Similarity Based on an Index Graph Model

Khaled M. Hammouda Mohamed S. Kamel

Department of Systems Design Engineering

University of Waterloo

Waterloo, Ontario, Canada N2L 3G1

E-mail: hammouda,[email protected]

http://pami.uwaterloo.ca/pub/hammouda/hammouda icdm02.pdf

17