Upload
maroun-baydoun
View
798
Download
0
Embed Size (px)
DESCRIPTION
Internet users often subscribe to many RSS feeds to stay up to date on the latest news. However, this information is spread out across many sources, which makes it difficult for users to keep track of the most important headlines. Therefore, it is essential to come up with a tool to bring together news from different sources and present the user with a single feed containing the intersection of all other feeds.
Citation preview
Methodologie de Recherche
RSS Join Engine
Maroun Baydoun inf1312, OGL
Marwan Azzam inf1311, OGL
Thursday, 14 May, 2010
Contents
1 Introduction 3
2 Related Work 3
2.1 Joining RSS Feeds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Comparing Text Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.1 Document Index Graph Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Relating Rss Items (News) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.4 Websites and Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3 Hypothesis 8
4 Architecture 8
5 Pseudo-code 9
6 Development 11
7 Implementation 11
8 Simulation 12
9 Consideration 15
9.1 Advantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
9.2 Disadvantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
9.3 Possible improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1
9.4 Other applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
10 Conclusion 16
2
1 Introduction
RSS (most commonly expanded as ”Really Simple Syndication”) is a family of web feed formats used
to publish frequently updated workssuch as blog entries, news headlines, audio, and videoin a standardized
format.
Internet users often subscribe to many RSS feeds to stay up to date on the latest news. However, this
information is spread out across many sources, which makes it difficult for users to keep track of the most
important headlines. Therefore, it is essential to come up with a tool to bring together news from different
sources and present the user with a single feed containing the intersection of all other feeds.
2 Related Work
2.1 Joining RSS Feeds
To intersect different feeds, we start by defining semantic relatedness between RSS elements/items in order
to determine semantic relations regarding the meaning of terms instead of just their syntactic properties.
XML documents can be compared following:
• Their structure. (structure-based similarity)
• Their content. (content-based similarity)
• A combination of both. (hybrid similarity)
RSS feeds can be related in several ways:
• Inclusion: a news item can be completely included in another news item.
• Intersection: two news items might refer to similar concepts. We say the two news items intersect.
• Opposition: two news items might refer to the same topic but in opposite ways.
3
2.2 Comparing Text Documents
2.2.1 Document Index Graph Model
In order to relate RSS feeds, its helpful to use a document clustering technique based on segregating
documents into groups so that each group represents a topic that is different than those represented by other
groups.
Any clustering technique relies on 4 concepts: a data representation model, a similarity measure, a cluster
model, a clustering algorithm to build the clusters based on the similarity measure and the data model.
Traditional document clustering techniques rely on single term analysis. They use the Vector Space Model.
The Vector Space Model represents a document as a vector of terms. The vector contains the weights of
the terms (the frequency of the terms for example). Similarity between documents can be measured using
similarity measures applied to the vectors (such as cosine). This model uses single-term analysis only (no
word proximity or phrase based analysis).
Even though this approach is widely used, it proves insufficient because it leaves out phrase analysis. Therefore
a better method would be to combine single-term and phrase analysis.
This brings us to the introduction of a new model, the Document Index Graph (DIG).
The DIG is based on the graph theory. It uses graph properties to match any-length sentence from a document
to any number of previously seen documents. The amount of time taken by this process is proportional to
the number of words in the new document.
4
2.3 Relating Rss Items (News)
Each node or element of an RSS Tree is a pair having e = η, ζ where e.η is the element name and e.ζ its
content.
The concept of neighborhood
Neighborhood is used for identifying the relationships between text and is consequently used for RSS elements.
Neighborhood can be classified as follows:
• Semantic Neighborhood : The semantic neighborhood of a concept Ci is defined as the set of concepts
Cj in a given knowledge base KB, related with Ci via the hyponymy ( ≺ ) or meronymy ( ≺≺ ) semantic
relations, directly or via transitivity.
• Global Semantic Neighborhood: The global semantic neighborhood of a concept Ci is the union of each
semantic neighborhood w.r.t. all synonymy ( ≡ ), hyponymy ( ≺ ) and meronymy ( ≺≺ ) relations
altogether.
• Antonym Neighborhood: The antonym neighborhood of a concept Ci is defined as the set of concepts
Ci, in a given knowledge base KB, related with Ci via the antonymy relation (ω), directly or transitively
5
via synonymy ( ≡ ), hyponymy ( ≺ ) or hypernym ( � ).
Relations and relatedness of RSS elements:
For two simple elements e1, and e2, the Element Relatedness (ER) algorithm returns a pair quantifying the
semantic relatedness SemRel value and Relation based on corresponding TR label and content values.
Relation relies on a rule-based method combining the label and value relationships as follows:
• Elements e1 and e2 are disjoint if either their labels or values are disjoint.
• Element e1 includes e2, if e1.η includes e2.η and e1.ζ includes e2.ζ.
• Two elements e1 and e2 intersect if either their labels or values intersect.
• Two elements e1 and e2 are equal if both their labels and values are equal.
• Two elements e1 and e2 are opposite if both their contents are opposite.
For two RSS items I1 and I2, each containing a group of elements, the Item Relatedness (IR) Algorithm
returns a pair containing SemRel and Relation.
By combining relations between sub elements, the relation between two items I1 and I2 is identified using
the following rule-based method:
• Items I1 and I2 are disjoint if all elements ei and ej are disjoint (elements are disjoint if there is no
relatedness whatsoever between them, i.e., SemRel(I1, I2) = 0).
• Item I1 includes I2, if all elements in ei include all those in ej .
6
• Two items I1 and I2 intersect if at least two of their elements intersect.
• Two items I1 and I2 are equal if all their elements in ei equal to all those in ej .
• Two items I1 and I2 are opposite if at least two of their respective elements are opposite.
2.4 Websites and Applications
There are many websites and applications that provide services related to RSS feeds aggregations, but none
of those solutions implements an RSS join engine based on semantics. They simply enable users to merge
many news feeds into a single feed.
These tools are:
• xFruits (http://www.xfruits.com)
• Flock (http://flock.sourceforge.net/index.html)
• RSSOwl (http://www.rssowl.org)
• BlogBridge (http://www.blogbridge.com)
• Yahoo Pipes (http://pipes.yahoo.com)
• Feedzeo (http://feedzeo.sourceforge.net)
7
3 Hypothesis
Based on what is presented in the earlier parts, the simplest method was to consider two phrases similar
if they have a predefined number of words in common. However, this method reveals substantial weaknesses
because it neglects the semantics of the phrases. Sentences written differently but conveying the same meaning
will be deemed not similar.
Thus, the proposed solution consists of implementing a phrase-based document similarity algorithm based
on an index graph model to create a RSS join engine. This engine will take five RSS feeds as input, place a
window on each feed, and then run the similarity algorithm in order to intersect the feeds.
4 Architecture
1. Feeds: The user has the possibility at most any five Rss feeds.
2. Parser: The application will rely on the ROME Rss feed parser, which accepts as a parameter the URL
of the feed, and returns the list of its Items.
3. Windows: On every list of item, we place a window, which contains the five most recent items.
8
4. Join Engine: The join engine applies the Phrase-based document similarity algorithm on the items
contained in the windows.
5 Pseudo-code
CREATE GRAPH:
FOR EACH feed
Read feed
Parse feed
Create window
Sort feed items by publish date
Include the five most recent items in the window
CALL build graph
END FOR
BUILD GRAPH:
Create document
Fill document with feed items
FOR each sentence in document
IF first-word of sentence NOT in graph
Add first-word of sentence in graph
END IF
Create list
FOR EACH word in sentence
IF previous-word, word IS edge in graph
Extend phrase matches in the list for sentences that continue along previous-word, word
Add the new phrase matches to the list
ElSE
Add previous-word, word to graph
Update sentence path in nodes previous-word and word
END IF
9
END FOR
END FOR
10
6 Development
It is a java web application developed on NetBeans IDE 6.8, using java version Java EE6 and Glashfish v3
as application server. The JavaServer Faces framework is adopted to simplify development.
The application uses many external open source libraries that are not provided by default in java:
1. ROME, JDOM: for RSS feed parsing.
2. OpenNLP: for natural language processing.
3. JGrapht: for graph manipulation.
7 Implementation
The main technique used here allows parsing every Rss feed using ROME and JDOM libraries by entering
in input its URL, and getting as output a SyndFeed. The SyndFeed type represents any kind of feed (RSS,
ATOM ). Afterwards, from the returned SyndFeed a list of items is acquired. Those items are of type
SyndEntry. Next, a window is associated with every feed in order to contain the fifteen most recent entries
from the generated list.
At the end of this step, you will have a maximum of five windows, each containing fifteen entries. OpenNLP
is now used to split each entry into sentences saved in an Array of String. Then StringToKenizer Class is
used to divide each sentence into an Array of word.
After that, JGrapht is brought into play in order to create an empty directed graph, which constitutes the
basis of the Graph Index Model. The building process of the graph goes as follows:
• For every word, we check if it already exists in the graph; if not, we add it.
• For every two consecutive words, an edge is created in the graph.
Phrase matching and graph building take place simultaneously. Phrase matching occurs over the following
steps:
11
• If an edge already exists between two consecutive words, the path in the graph extending that node is
followed until the last existing edge is reached; a matching phrase is detected and added to the list of
already matched phrases.
• At the end of processing some of the matching phrases must be eliminated because they dont hold any
semantic value.
The remaining sentences should be evaluated in order to assess the degree of similarity between RSS entries.
This evaluation concerns:
• The length of the sentence.
• The weight of the sentence in the entry.
However, using phrase based matching solely can be deemed insufficient. A better approach is to incorporate
single-term similarity. Once inter-entries similarity measures are established, a new RSS feed is created using
ROME to contain the matched entries. This feed is returned as the join result.
8 Simulation
To test the application, the following was done:
1. Launch it.
2. Fill the textfields with RSS feeds URLs.
3. Inspect the results.
One example is the following:
12
13
Other tests are done, and the results are illustrated in the table below:
These results point out the following observations:
1. Given that the tests are carried out online, there is a big probability that the feeds are constantly
changing. Therefore, each evaluation can take place on different inputs.
2. The number of matched entries in not directly linked to the number of feeds entered.
14
9 Consideration
9.1 Advantages
1. Ease of use: simple user interface.
2. High performance: low time processing.
3. Efficiency: minimal use of bandwidth.
9.2 Disadvantages
1. Limited inputs to five.
2. Not the optimal technique (though presents less overhead).
9.3 Possible improvements
1. Expand the maximum inputs number, without sacrificing performance.
2. Improve this technique to include semantic similarity so the rate of matched entries increases.
3. Create different versions of the application for different platforms like mobile phones, desktop applica-
tions
9.4 Other applications
This technique can be also useful in other fields of application. It can be applied to match inputs other than
RSS feeds.
In general it can be used to match any text content such as speeches, researches
15
10 Conclusion
In this paper, we described a technique for creating an RSS join engine. We discussed how RSS feeds can
be joined. Afterwards, we examined how text documents can be compared, and focused on the Document
Index Graph Model in the context of a phrase-based document similarity. Then we moved to enumerate how
RSS items can be related, before looking into previously developed websites and applications that attempted
to solve the question of how to join RSS feeds and finding that none of the preexisting solutions is well
adapted to this task.
By suggesting a technique based on the Graph Index Model, we were able to take advantage of the efficiency
and the ease of implementation of that model. Furthermore, this technique can be further improved, and
can even be applied to fields other than RSS feeds.
16
References
[1] Relating RSS News/Items
Fekade Getahun, Joe Tekli, Chbeir Richard, Marco Viviani, Kokou Yetongnon
Laboratoire Electronique, Informatique et Image
(LE2I) UMR-CNRS Universit de Bourgogne Sciences et Techniques
http://vision.u-bourgogne.fr/Le2i/user data/publications/2356 Chapter-LNCS-ICWE%20final.pdf
[2] Phrase-based Document Similarity Based on an Index Graph Model
Khaled M. Hammouda Mohamed S. Kamel
Department of Systems Design Engineering
University of Waterloo
Waterloo, Ontario, Canada N2L 3G1
E-mail: hammouda,[email protected]
http://pami.uwaterloo.ca/pub/hammouda/hammouda icdm02.pdf
17