[IEEE 2010 Sixth International Conference on Autonomic and Autonomous Systems (ICAS) - Cancun, Mexico (2010.03.7-2010.03.13)] 2010 Sixth International Conference on Autonomic and Autonomous

Discarding Similar Data with Autonomic Data Killing Framework based on High-Level Petri Net Rules: an RSS Implementation

Wallace Pinheiro1,3, Marcelino Campos Oliveira Silva1,4, Thiago Rodrigues1, Geraldo Xexéo 1,2, Jano Souza 1,2 1 COPPE/UFRJ, Federal University of Rio de Janeiro

2 DCC-IM, Dept. of Computer Science, Institute of Mathematics 3 IME, Military Engineering Institute

4Chemtech Serviços de Engenharia e Software Rio de Janeiro, Brazil

{awallace, marcelino, thiagosr, xexeo, jano}@cos.ufrj.br

Abstract-This paper describes the evolutions obtained in the autonomic Data Killing framework that was proposed to eliminate undesirable data. The focus now is about discarding similar data. In order to do it, a modelling method is proposed that uses active rules to be applied through High-level Petri nets. Our method focuses in clustering news in groups by its level of similarity, selecting the newest news of the group and discarding the rest. One experiment has been done in order to proof that method is viable.

Keywords:Data Killing, High Level Petri-nets, Clustering, News Filtering.

I. INTRODUCTION With technological progress and especially with the rise

of the Internet, people have had to deal with the problem of how to manage data overload. This situation is a result of the huge volume information1 that is presently available.

Feeds RSS is an example that suffers from this ailment. They allow users, without the direct access to news sites, to receive information from different sources. These users often receive a large number of news items that are published on a daily basis while they have few resources to filter and classify them. These resources, albeit scarce, are supplied only by proprietary tools.

Therewith, in order to solve this information overload problem, we proposed the Autonomic Data Killing framework that addresses this problem discarding irrelevant data through Data Killing patterns based on Petri nets models.

The rule and inference modelling functions were created with CPN Tools, which allows the modelling of High-level Petri nets (HLPN). These nets were used in the creation of the Event-Condition-Action (ECA) rules.

In this paper, we present an implementation of our proposal: the Firefox RSS aggregator plugin, called Feed Organizer. We added the functionality of eliminate similar news to assist in reducing the information overload issue.

1 The reader should be aware of the subtle difference between data and

information. In this work, we consider that data as representation (notation) and information as meaning (denotation) and use these terms indistinctly.

This way, if two or more news items are considered similar, according to the levels chosen by each user, only the most recent will be shown.

To validate this research, an experiment was carried out that compared the opinions of the users with the news screened by the tool, where the results obtained were extremely satisfactory.

This paper is organized as follows: Section II presents background to this work, Section III details the method used in the framework, Section IV shows the application architecture, Section V explains the experiment carried out and, at the end, a conclusion is presented with future work described in Section VI.

II. LITERATURE REVIEW Data can be obsolete, inconsistent, similar, irrelevant,

prohibited. This kind of information can hinder the execution of a system or guide it to bad ways. Therefore, they should be removed. The elimination of this undesirable data motivated the creation of Autonomic Data Killing Framework, which is composed by many patterns and operators. Each operator addresses a type of data.[1]

In [2], the Allowed Data Operator and Discard Prohibited Data Operator were tested. These operators allow the user to select allowed data or discarding data considered prohibited, basing in keywords and similarity level. Now, we have made an experiment involving the Discarding Similar Data Operator.

According to [3], a manner of obtaining some relevant information of a large quantity of data is clustering, which is an unsupervised classification where groups or clusters are formed. It represents one of the most important part of data analysis called cluster analysis. This analysis involves organization of patterns (usually represented as arrays of attributes or point in a multi-dimensional space) in clusters, according some similarity metric. Intuitively, patterns belonging to a cluster must be more “similar” among them as compared to patterns belonging to other clusters. This important characteristic is used to discard similar data. In this article, we use as a first similarity measure approach Jaccard coefficient. But our framework supports other measures such as canonical sets. To implement this

2010 Sixth International Conference on Autonomic and Autonomous Systems

978-0-7695-3970-6/10 $26.00 © 2010 IEEE

DOI 10.1109/ICAS.2010.23

110

approach, we compare word by word of each news, after doing stemming and extracting stopwords.

It is important to consider that our framework deals with many different patterns and rules. In this context, clustering and similarity measures are just ways to obtain the less relevant data, while the Petri nets formalize this behavior.

In [4], it is possible to find a similar work, where it is used clustering method to identify correlated webpages. But, the main focus is different; we intend to formalize a method to apply clustering to discard news.

III.METHOD DESCRIPTION The CPN Tools tool is used as the autonomic element,

which allows the modelling of High-level Petri nets [5][6]. These nets enable the formalization of autonomic computing processes through Event-Condition-Action (ECA) rules. The rules are triggered as a result of events and of data contained in the repositories to be modelled. This way, autonomic systems can enjoy the main advantage of the use of active rules which is the monitoring of the environment with the least human intervention. Additionally, with the use of Petri nets it is possible to detect infinite loops in the creation of rules [7], a common problem in rule-modelling tools. Besides, it is possible to obtain a well-defined semantics that formalize in a non-ambiguous way the modelled elements. The steps below should be followed in the construction of a process pattern with these elements:

1) Taking the control flow and the repositories involved in the process as the basis creates the data flow. The interconnection between the flows should occur through common transitions. The data structures can be modelled as list and registration structures, where data occurrences are represented by colour tokens. It is proposed that only the data attributes needed for their identification in the real basis are imported. The following meanings for the Petri net elements are associated to the data flow:

• Place – corresponds to a temporary or persistent data storage;

• Transition – has a condition that allows the data to flow between repositories;

• Entry Arc – represents the connection between a place and a transition. Specifies the contents of the entry tokens through variables, pursuant to the functional programming paradigm;

• Exit Arc – represents the connection between a transition and a place. Specifies the contents of the exit tokens through variables (ECA rules), pursuant to the functional programming paradigm.

• Token – A token is a set:

{ }ntokentokentokenToken ,..., 21= , where each

nitokeni ≤≤1, , represents the occurrence of a data item.

Figure 1 gives an example of the modelling proposal for a basic Event-Condition-Action (ECA) control flow structure with two linked rules (DETECT SIMILAR DATA and DISCARD SIMILAR DATA).

Figure 1 - Modelling of the Flow of Events/If-Then Actions

Figure 2 gives an example of the modelling proposal for a data flow and control structure. The DETECT SIMILAR DATA transition only checks the existence of forbidden data items in the internal repository, and a reading arc is, therefore, used to connect these elements. The DISCARD SIMILAR DATA transition effectively eliminates the data, where two arcs are used, one to obtain all data items from the Internal Repository and another to return data items considered as not-forbidden to the same repository. Additionally, there is another arc that takes the forbidden data items to the Trash repository.

Figure 2 - Data Flow and Control Modelling

2) Create a place to store the parameters to be considered in data processing. These parameters will only be read through read arcs that connect the place created to all the transitions. The set of parameters for each pattern can vary as a result of their characteristics and of the needs of the users. This offers some flexibility to the patterns as they can be configured for different application scenarios. Parameter examples are: similarity strategy (Jaccard, Dice, key-words, canonical sets, etc.), attributes to be compared (name, Individual Taxpayer Registration Number, etc.), language of the data to be processed (English, Spanish, etc.), data processing type (stemming, stopwords, etc.), etc.

3) Create the functions of the arcs that depend on the parameters. Figure 3 shows two created functions: putSimilarData ( ) and putSub( ). The first function selects similar data items considering the existent data items in the Internal Repository and Parameters. The second function gathers the data items that not are similar in comparison to those items obtained by the first function.

111

Figure 3 - Process to Discard Similar Data

Having modelled the process, it starts to receive events and data from the systems, monitoring these environments and executing actions according to its behavioural rules. This allows the rules to act in an autonomic manner, reducing the work overload that people are subjected to.

The process modelled in this Section will be used in the autonomic data architecture for the disposal of similar news items. It is important to consider that the framework proposed supports many other processes that can be modelled using the technique shown in this study.

IV. IMPLEMENTATION ARCHITECTURE

The proposed framework allows the extension of the current applications with autonomic characteristics. It needs the combinations of many different components to provide the desired functionalities. Figure 4 presents the updated architecture with the components.

The orange boxes represent the developed components; the blue ones represent what will be implemented in future versions, what makes this architecture expansible; and the green boxes correspond to the other boxes support tools that have been already implemented in other works, and, current, can be downloaded on the Web. The components presented in this architecture are discussed in the context of each layer as follows:

• Model, Pattern and Rule Layer: it is responsible for the models, patterns and rules storage, recovering and updating. It may have the Data Killing patterns and composite event detection correspondent models, and other models and data manipulation patterns. This layer contains the Data Killing patterns and the composite event detection correspondent models developed in HLPN. It also allows the development of other patterns or models. [8]

• Inference and Processing Layer: it is responsible for the models and patterns creation, maintenance and processing. It allows specialists to model the data manipulators behaviour. At the same time, this layer allows these models to be processed through tools that can interpret or process models and patterns.

Figure 4 – Implementation architecture

112

• Communication Layer: it is responsible for the communication management between the other models, providing a higher level interface.

Currently, the SECONDO Data Base and the CPN Tools Application Program Interface execute three functions: data reading, event reading, and actions generation. These tools are responsible for receiving the data and events, respectively, coming from the databases created interfaces, applications and agents. These readers send the data and events to the Inference and Execution Layer. The actions generation is responsible for sending the decisions obtained from the Inference and Execution Layer to the Interface Layer.

• Interface Layer: it is responsible for managing event and data entries, converting them to calls to other framework components. Web application servers give the necessary support to the developed modules; a good example is the FeedOrganizer.war servlet module. This module discards undesirable news that comes from the RSS feeds, according to the options defined by the Feed Organizer tool users. Besides, this layer predicts other models development in future works.

• Application Layer: it is composed of components that use the framework. Currently, it consists of the Feed Organizer Firefox2 plug-in (a Feed Organizer tool component), future interfaces that allow the DBMS development, agents and autonomic applications. Modules that represent future works, identified in the architecture, need a specialist’s team from different areas, depending on the job that will be executed.

The different layers can be related as processes presented in Figure 5.

Figure 5 - Implementation architecture

Continuous lines indicate data and event flows. The dashed line indicates that the EXECUTE process can influence the MONITOR process. This happen, for instance, when data modified in the EXECUTE process fire an event that has been monitored.

V.DISCARD SIMILAR DATA EXPERIMENT In our work, we deal with Web news (streams of news)

and not scientific articles. In this context, the publication date is very important, because the most recent news tells us the most recent fact and, sometimes, may summarize the 2 www.mozilla.com/firefox/

previous ones. In our case, the length of the news is not so important as the date. However, it is possible, in our framework, to use different criteria to discard news, such as: length, similarity among the news and a set of keywords, etc. The user defines the similarity level and this parameter is used to define redundant news.

The algorithm for the Discarding Similar Data (DS) patterns compares the news, one by one, using the Jaccard measure of similarity. If the news items are considered similar to the level of similarity3 chosen, the oldest news item is eliminated. This can be seen as a clustering of the news items, based on the Jaccard method, to a certain level of similarity where only the most recent news item from each cluster remains. As one raises the level of similarity, the number of clusters goes up. For example, if we consider a 0% level of similarity, we will have only one cluster, as all the news items are at least 0% similar. In the case of a 100% level of similarity, if no identical news items are found we will have as many clusters as the number of news items.

With the goal of assessing the work of the tool proposed as well as the results obtained with their use, an experiment was carried out with a group of 18 undergraduate students at the Federal University of Rio de Janeiro.

In this experiment, 90 news items were randomly selected from three media channels (New York Times, Herald Tribune, and Washington Post), classified into three different domains: Business, Sports and General US News.

To assess the DS operator, the experiment was split into two stages. We initially asked the participants of the experiment to determine, for each news item the other news items they considered similar in each domain. With this, we got the result expected by the users. The second stage was consisted of running news filters using the DS patterns according to the following levels of similarity (in %): 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 30, 40, and 50, where the set of news items returned was recorded, for each level of similarity. We noticed that between 1% and 10% the number of returned news had a huge variation. Because of that we have a special attention to this set of numbers. In this stage we determined the result returned by the tool. In order to visualize the comparison we used a technique of clustering of information supplied by the users using similarity matrices and compared them with the results obtained by the tool.

Figure 6 shows the curves related to the percentage of discarded news items, that is, the percentage of the news items considered similar as per level of similarity.

3 Level of similarity takes into account the minimum level of

similarity between texts for the set value, i.e., the point of elimination for similar texts. This way, larger values are also returned, such as, for example, when the level of similarity is 5%, text that is 10% similar will also be returned.

113

Figure 6 – Characteristic DS curve

These are the curves that are characteristic of the DS pattern and are in line with the results expected, given that as the level of similarity goes up, the number of discarded news items goes down, as the comparison criterion between the news becomes stricter.

Our algorithm retains only the most recent news item in each cluster and, as a result, for lower similarity levels we will obtain less news items returned and more news items that are disposed of. This can be seen in the Figure 6 graph.

In the comparison of the result expected by the users with the result returned by the tool, we built three similarity matrices (30 x 30), corresponding to each one of the set domains. These matrices were duly normalized and their weights correspond to the number of times the students considered a pair of news items as similar.

To determine the sets of similar news items (clusters) after the opinion of the users, we chose the nearest neighbour clustering method [9]. We used this algorithm for each one of the domains and the results were converted into dendograms as shown below. Good results are marked by news items returned by the tool (not similar news) occupying different clusters (UCG) and a few non-occupied clusters (WA).

To determine the best combination between the number of clusters and the level of similarity, we created two measures that support this process and that are shown in Table 1.

Table 1 – Support Measures

UGC = RCLUSTERS / CLUSTERS WA = 1 – (RCLUSTERS / NRN)

Where: • UGC: percentage use of generated clusters. • RCLUSTERS: number of clusters generated with at least

one news item returned. • CLUSTERS: total number of generated clusters. • WA: wrong allocation percentage, i.e., percentage of

clusters that have more than 1 news item. • NRN: number of returned news items for a given similarity

level.

With the measures shown in Table 1, it is clear that we are seeking high UGC values and low WA values. Thus,

the optimal point is that where the (UGC – WA) is the minimum. This optimal point should be calculated for each measure of similarity considered. The dendogram depicts the application of the given formulas over the clustering algorithm based on a similarity matrix. This similarity matrix considers the relation among news built from user opinions.

As an example, Figure 7 displays the dendogram for the Business domain.

Figure 7 - Business domain dendogram with vertical line indicating the

best cluster division for the 20% level of similarity.

The darker vertical line shows the optimal division point for the clusters (23 clusters) for this level of similarity. Bold numbers represent news items returned by the tool for a 20% level of similarity, its total being equal to 23 news items. In this case, the UGC was of 96%, as a cluster was not returned and the WA was of 4% as a cluster had an undue allocation (more than one news item in the same cluster).

Figure 8 shows the graph for the WA and UGC results for 1% to 30% levels of similarity in the Business domain. In the analysis of the graph it is possible to see that as the level of similarity increases the results improve, that is, they get closer to the result pointed by the participants.

Results for Business domain

0%

20%

40%

60%

80%

100%

120%

1 5 10 15 20 25 30

Similarity Level

% o

f WA

and

UG

C

UGC

WA

Figure 8 – Business domain WA and UGC results.

114

For a 30% level of similarity, for example, it is possible to see that the UGC was equal to 100% and the WA was 12%, showing that user opinion and the results obtained by the tool are converging.

Figure 9 shows the RRA and UGC results graph for 1% to 30% levels of similarity for the Sports domain. As much as in the Business domain graph it is possible to see that as the level of similarity goes up the results get better. For a 30% level of similarity, for example, one can see that the UGC was equal to 93% and the WA was of 0%, thus showing optimal results.

Results for Sports Domain

0%

20%

40%

60%

80%

100%

120%

1 5 10 15 20 25 30

Similarity Level

% o

f WA

and

UG

C

UGC

WA

Figure 9 - Sports domain WA and UGC results

Figure 10 shows the graph with the WA and UGC results for 1% to 30% levels of similarity.

Results for US Domain

0%

20%

40%

60%

80%

100%

120%

1 5 10 15 20 25 30

Similarity Level

% o

f W

A a

nd U

GC

UGC

WA

Figure 10 - General US News WA and UGC results

As much as in the other graphs one can see that, in general terms, as the level of similarity goes up, the tendency is for the WA to stay close to 0% and the UGC to be close to 100%. For a 30% level of similarity, for example, we can see that the UGC was equal to 97% and the WA was 0%, thus showing excellent results.

It is important to point that we are using two different clustering methods to compare the results obtained by the tool (Jaccard-based clustering) with user opinion, based on the similarity matrix and dendograms.

VI.CONCLUSION AND FUTURE WORKS As a result of technological changes and the rise of the

internet, people have started to deal with the problem of data overload. Feeds RSS are very characteristic examples of this condition, something aggravated by the fact that its users very often have little or no resources to filter and rate

the large number of news items that are published on a daily basis.

This work proposes a modelling method that uses High-level Petri nets and active rules. This strategy offers advantages in the modelling of rules for our Data Killing system. In the modelling of the rules, Petri nets allow the verification of infinite loops and provide well set, not ambiguous, semantics. As regards the monitoring of the environment, active rules reduce the interaction of the systems with people, an important requirement of autonomic systems.

To validate this research, an experiment has been done to measure the efficiency of DS Operator. In this way, the experiment compared the opinions of users with the news items returned by the tool using clustering algorithm where the results obtained were extremely satisfactory.

As future works we plan to test more similarity metrics and include more autonomic characteristics, improving the discarding process basing on user preferences. We intend to carry out another experiment doing a comparison of the tool being used during a regular time period and one where a significant event was happening. This will verify if there are fewer clusters during a big event in comparison of when nothing unusual is happening in the world.

VII.REFERENCES [1] Pinheiro, W. A. ; Xexéo, G. and Souza, J. M. . “Autonomic Patterns:

Modelling Data Killing Patterns Using High-Level Petri Nets”. The Fourth International Conference on Autonomic and Autonomous Systems, ICAS’ 08, 2008 pp. 198-203

[2] Pinheiro, W; Silva, Marcelino; Rodrigues, T; Silva, Marcelo; Silva, Marcio; Xexeo, G. and Souza, J "Autonomic RSS: Discarding Irrelevant News,". The Fifth International Conference on Autonomic and Autonomous Systems, ICAS’ 09, 2009. pp. 148-153

[3] Jain, A. K.; Murty, M. N. and Flynn, P. J. “Data clustering: a review”. ACM Comput. Surv., v. 31, n. 3, pp. 264-323. doi: 10.1145/331499.331504, 1999.

[4] Strehl, A; Gosh, J and Mooney, R. “Impact of similarity measures on web-page clustering”. In Proc. AAAI Workshop on AI for Web Search Austin, pp 58-64. July 2000.

[5] K. Jensen; L.M. Kristensen and L. Wells, “Coloured Petri Nets and CPN Tools for Modelling and Validation of Concurrent Systems”, In International Journal on Software Tools for Technology Transfer (STTT). 2007.

[6] Jensen, P. “Coloured Petri Nets”. 2nd Ed., 234 pages. Springer, 1996. [7] Aalst, W. and Hofstede, A. “Verification of Workflow Task

Structures: A Petri-Net-Based Approach”. Information Systems, v. 25, n. 1, pp. 43-69. doi: 10.1016/S0306-4379(00)00008-9, 2000.

[8] Pinheiro, W. ; Barros, R. ; Rodrigues Nt., J. A. ; Xexéo, G. and Souza, J. “Using Active Rules and Petri Nets to Composite Event Detection in Autonomic Systems”. In: 3th Latin American Autonomic Computing Symposium, 2008, pp. 81-84

[9] Yin, J.; Fan, X.; Chen, Y. and Ren, J. “High-Dimensional Shared Nearest Neighbour Clustering Algorithm”. In: Fuzzy Systems and Knowledge Discovery. pp.494-502. doi:10.1007/11540007_60, 2005.

115

Documents

[IEEE 2010 Sixth International Conference on Autonomic and Autonomous Systems (ICAS) - Cancun, Mexico (2010.03.7-2010.03.13)] 2010 Sixth International Conference on Autonomic and Autonomous