colaborative webservices

8/14/2019 colaborative webservices

1/7

Collaborative Web Data Record Extraction

Gengxin Miao, Firat Kart, L. E. Moser, P. M. Melliar-Smith

Department of Electrical and Computer Engineering

University of California, Santa Barbara

Santa Barbara, CA, 93106{miao, fkart, moser, pmms}@ece.ucsb.edu

AbstractThis paper describes a Web Service that automati-cally parses and extracts data records from Web pages containingstructured data. The Web Service allows multiple users to shareand manage a Web data record extraction task to increase itsutility. A recommendation system, based on the ProbabilisticLatency Semantic Indexing algorithm, enables a user to findpotentially interesting content or other users who share thesame interests with the user. A distributed computing platformimproves the scalability of the Web Service in supporting multipleusers by employing multiple server computers. A Web Serviceinterface allows users to access the Web Service, and allows

programmers to develop their own applications and, thus, extendthe functionality of the Web Service.

Index Termscollaborative information extraction, data min-ing, Web Service.

I. INTRODUCTION

On the Web, the amount of structured data is 500 times

greater than the amount of unstructured data. As one of the

major sources of structured data, the deep Web has been

estimated to contain more than 450,000 databases [5] in which

structured data are stored. The structured data in the deep

Web are continually evolving, and might be updated as often

as once every second. Deep Web pages can be dynamically

generated from the data in the deep Web. A single deep Webpage typically contains a large number of Web records [12],

[13], i.e., HTML regions, each of which corresponds to an

individual data object. When browsing these deep Web pages,

a user is usually interested in only a small number of data

objects. The diverse data and the evolving characteristics of

the deep Web make it difficult for users to locate the data

objects of interest in a friendly and timely manner.

There have been extensive studies of fully automatic meth-

ods to extract data objects from the Web [1], [7]. A typical

process to extract data objects from a Web page consists of

three steps. The first step is to identify Web records that

represent individual data objects (e.g., products). The second

step is to extract data object attributes (e.g., product names,prices, and images) from the Web records. Corresponding

attributes in different Web records are aligned, resulting in

spreadsheet-like data [17], [18]. The final step is the optional

task (which is very difficult in general) of interpreting aligned

attributes and assigning appropriate labels [16], [19].

In this paper we describe a Web Service for extracting Web

data records from deep Web pages. Attribute alignment is not

necessary. Approaches such as EXALG [1] and RoadRunner

[7] are applicable only when there are multiple deep Web

pages that use the same template, which is not guaranteed

in our case. SRR [18] is intended for extracting data records

from Web pages returned by search engines. Because the user-

specified data source can occur within any domain, we employ

a domain-independent approach for extracting data records

from deep Web pages [13].

The Internet has brought collaboration of individuals to

a whole new level. Not only can colleagues or friends col-

laborate with each other, but also individuals can collabo-

rate without even knowing each other. Wikipedia provides

a collaborative authoring platform that aggregates individual

intelligence by allowing any authorized user to modify an

article on which he/she has knowledge. Facebook allows

different users to collaborate in developing applications and,

hence, provides a useful social network service. Support for

collaborative and social interactions in an information seeking

system [6], [9] improves the utility of the system by allowing

users to share information and tasks.

In this paper we present a Web Service that supports

collaboration among multiple users, who are interested in the

same information, e.g., price of a certain product. Different

users are aware of different data sources that contain relevantinformation, e.g., different e-commerce Web sites carrying the

product. By collaboration, the users can obtain more complete

and relevant information. Our Web Service allows authorized

users to share the extracted Web data records from the deep

Web and to manage the Web data record extraction task.

Collaborative filtering [10] produces recommendations by

computing the similarities between one users preference and

the preferences of other users. Algorithms for collaborative

filtering fall into two categories: rank-based algorithms, e.g.,

RankBoost [8], and probabilistic model-based algorithms e.g.,

Latent Dirichlet Allocation (LDA) [3] and Probabilistic La-

tency Semantic Indexing (PLSI) [11]. RankBoost combines

multiple partial preferences into a unified ranking. PLSI andLDA analyze the co-occurrences of two different types of

data, e.g., documents and words. With the help of a latent

semantic layer, PLSI and LDA estimate the joint probability

of any given pair of documents and words. We employ

PLSI to produce personal recommendations for the users.

The extracted Web data records are ranked based on their

probabilities of co-occurrence for a particular user, and top-

ranked objects are recommended to the user as they align with

the users preferences.

2009 IEEE International Conference on Web Services

978-0-7695-3709-2/09 $25.00 2009 IEEE

DOI 10.1109/ICWS.2009.109

896

Authorized licensed use limited to: MISRIMAL NAVAJEE MUNOTH JAIN ENGG COLLEGE. Downloaded on November 28, 2009 at 21:02 from IEEE Xplore. Restrictions apply.


2/7

II. MOTIVATION

To motivate our Web Service for collaborative Web data

record extraction, we present several example scenarios, in

which a filtered summary of Web data records that match the

users interests is produced from a list on a Web page.

Example 1. Bob is traveling by train across California. Bob

wants to read some interesting stories from his favorite Website using his cell phone to fill the time (10 hours) spent

on the train. However, the Web site is not friendly to small

hand-held devices. Both the bandwidth and the screen size are

limited. It takes minutes to download a large Web page onto

Bobs cell phone, and he needs to scroll both horizontally and

vertically to locate the interesting content on the Web page.

Bob completely loses interest after this frustrating browsing

experience. If there were a Web Service that parses and

extracts the interesting information from the Web page and

then sends Bob only the interesting content, his experience

would be much more satisfying.

Example 2. Mike is a soccer fan and likes the soccer player

David Beckham very much. However, he happens to be busyon the day when Beckham plays an important game. Mike

cannot watch the entire game, but he does not want to miss the

moment when Beckham scores a goal. He knows a Web site

that broadcasts the video of the game online and another Web

site that provides live news on the game in text. Mike wishes

that there were a Web Service that periodically extracts and

parses the live news and notifies him when Beckham scores,

so that he can catch the replay of the goal.

Example 3. Alice decides to go to Long Beach for Spring

break with her friends. They start looking for a vacation rental

on the local forums on the Web one month before their vaca-

tion. Alice feels exhausted after reading through the vacation

rental advertisements posted on the various forums, most of

which turn out to be irrelevant. It takes Alice and her friends

a lot of precious time reading listings and circulating the

information among themselves. The process would be greatly

facilitated if a Web Service were available that automatically

scans the forums of interests and extracts only the relevant

rental information for Alice and her friends to share.

When a user browses a deep Web page, the user typically

has a particular information need in mind. Only a fraction

of the Web data records found on a deep Web page match

the users information need. Browsing the entire Web page is

tedious, and can be expensive in time or bandwidth.

Our Web Service for collaborative Web data record extrac-

tion returns only the Web data records of interest to the user

and, thus, increases the efficiency of the deep Web browsingprocess. In addition, it allows multiple users to share and

manage the Web data record extraction task to enhance the

benefits further.

III. WEB DATA RECORD EXTRACTION

Our Web Service for Web data record extraction focuses on

the deep Web for the following reasons:

The amount of data in the deep Web is much greater than

that on the surface Web.

Fig. 1. An example Web page containing the live news of a soccer game.

Fig. 2. HTML code template used to render the soccer game live news.

The deep Web is a good source of structured data,

which are more suitable for automatic processing than

the unstructured data on the surface Web.

The dynamic content found in deep Web pages is gen-

erally of greater interest than the static content found in

static Web pages on the surface Web.

Our Web Service employs a Web data record extraction

technique, based on HTML tag path clustering [13], that we

developed. In an automatically generated deep Web page, the

Web records, e.g., live news about a soccer game, are rendered

in visually repeating patterns. First, the Web record extraction

algorithm identifies the visually repeating part in a Web page.

Then, a Web page segmentation algorithm looks for the exact

boundaries of each Web data record.

A. Finding Visually Repeating Patterns

In a Web page (HTML document), the visual information is

conveyed by HTML tag paths. Visually repeating patterns in

a Web page correspond to the repeated occurrence of HTML

tag paths in the Web page. A unique HTML tag path might

have multiple occurrences in a Web page. A set of HTML tagpaths that repeatedly occur in the Web page in a similar way

corresponds to a set of Web records. The occurrence positions

are indicated using a binary vector, referred to as a visual

signal vector. Each unique HTML tag path corresponds to a

visual signal vector.

By evaluating the similarity between the visual signal

vectors, we can discover whether two unique HTML tag

paths have similar repeated occurrence patterns. We construct

a pairwise similarity matrix, in which each element is the

897



3/7

Fig. 3. Unique HTML tag paths extracted from the soccer game live newsWeb page.

similarity measurement of a pair of visual signal vectors. We

then apply a spectral clustering algorithm [15] to the similarity

matrix to discover a set of unique HTML tag paths, i.e.,

visually repeating patterns on the Web page, which correspond

to the Web records.

For example, Figure 1 is an automatically generated Web

page that contains live news on a soccer game between

Germany and Spain. The soccer game Web page is updated

whenever live news is uploaded. Each live news record is

rendered using the HTML code template shown in Figure2. The template corresponds to the unique HTML tag paths

17 through 20 in Figure 3, which is one of the clusters

generated by the spectral clustering algorithm. The grouped

unique HTML tag paths are then passed to the Web page

segmentation algorithm to find the exact boundaries of each

live news record.

B. Web Page Segmentation

The Web page segmentation algorithm takes as input a set of

HTML tag paths and examines their occurrences to determine

the exact boundaries of the Web records. Each occurrence

of a unique HTML tag path corresponds to a node in the

DOM tree, a tree structure obtained by parsing the HTMLdocument. If a unique HTML tag path A is a prefix of a

unique HTML tag path B, then the occurrences of A are

ancestors of the occurrences of B in the DOM tree and, hence,

correspond to larger pieces of HTML text. In this case, A is

an ancestor visual signal of B. A set of unique HTML tag

paths corresponds to a HTML template. An occurrence of a

tag path maps to a part of a Web record or an entire Web

record. A larger piece of HTML text is more likely to cover

an entire Web record than a smaller piece of HTML text.

Figure 4 shows the ancestor and descendant relationships

within a set of unique tag paths in the Web page for the soccer

game live news. An occurrence of an upper level tag path

corresponds to a larger piece of HTML text than a lower level

tag path. In this example, the dl node corresponds to the entire

news record. All occurrences of the dl nodes following the tag

path are extracted; each one is a news record. The results forthe soccer game live news Web page are shown in Figure 5.

Fig. 4. Ancestor / descendant relationships within a set of unique HTMLtag paths for the soccer game live news Web page.

Fig. 5. Web data record extraction results for the soccer game live newsWeb page.

The Web data record extraction algorithm has linear timecomplexity in the length of the Web page. We use this

algorithm in our Web Service to extract information from the

Web pages.

IV. SYSTEM ARCHITECTURE

Our system enables users to submit Web data record

extraction tasks using the Web Service. Different kinds of

applications on different kinds of devices can access the Web

Service, as shown in Figure 6. Having received tasks from

898



4/7

multiple client applications, the Web Service executes the

Web data record extraction tasks in parallel. Depending on

the Web Service call, the results are returned to the client

application or are uploaded to an Atom server to be published

as a syndication feed for subscribed consumers.

Fig. 6. Use of the Web data record extraction Web Service.

Our Web Service for collaborative Web data record extrac-

tion allows multiple users to share results and manage tasks,

as shown in Figure 7. A client is identified using a unique

clientID and is authorized by password verification. Once a

client submits a new task to the Web Service, it becomes the

administrative client for that task. The client can authorize

other clients to access the results returned by the task or tomanage the task. Authorized clients access the Web Service

to list tasks that they have permission to manage and URLs to

the results that they can access. They can use either the Web

Service or the Atom server to access the task or the results,

respectively. Our Web Service also provides a recommendation

facility to allow users to find Web data record extraction tasks

of interest to them.

Tasks submitted to the Web Service are executed in parallel

for scalability reasons, as shown in Figure 8. When a new Web

data record extraction task is submitted, the master computer

divides the task up into multiple sub-tasks and puts them

into a task queue. Thus, the master computer maintains the

list of tasks to be executed and distributes them among aset of worker computers for the purposes of load balancing.

The workers retrieve the data resources from the deep Web

using user-specified URLs. The extracted Web records are

filtered based on the user-defined filtering rules. Final results

are gathered at the master computer. and the workers are then

ready to take on new tasks. The master computer determines

whether to pass back the result set, containing the list of Web

records, directly to the calling client or to save it at the Atom

server for the client to consume later.

Fig. 7. Collaborative data extraction.

V. IMPLEMENTATION

Our system provides a Web Service interface that allows

clients to access the Web data record extraction service, a

distributed computing platform that performs the Web record

extraction computations in parallel, and a backend database

that stores information for the multiple collaborative users. The

data from the deep Web sites are aggregated in a database atthe Atom server.

A. Distributed Computing Platform

The distributed computing platform is similar to the CILK

system [4] for multithreaded parallel programming, except

that CILK is implemented in C++ whereas our system is

implemented in Java. A job is divided up into a number ofsub-

jobs. A job is finished when all of its sub-jobs are executed. In

CILK, there might be dependence relationships between the

sub-jobs. For example, if sub-job A takes the output of sub-job

B as input, then sub-job A can be executed only when sub-

job B has finished. In our system, there are no dependence

relationships between sub-jobs, i.e., a Web record extractiontask does not depend on any other tasks. Thus, our problem is

slightly easier than the general problem addressed by CILK.

We employ the concept of work stealing used by CILK to

avoid multiple workers requesting work from the master at

the same time and, hence, avoid the network communication

bottleneck at the master. The master in our system maintains

the list of jobs that are ready to be executed. Each worker

maintains its own job queue. When its job queue length is

less than a threshold, MinJobs, the worker either requests a

899



5/7

Fig. 8. Distributed computing platform for the Web data record extraction Web Service.

new job from the master with probability p, or steals a job

from a randomly picked worker with probability 1p. Usingthis work stealing strategy, the system balances the network

bandwidth usage among all of the workers to avoid a burst of

requests at the master.

The master keeps track of the jobs that have been assigned

to the workers until all of them are executed successfully. If a

worker fails to respond to the master within a certain amount

of time, the master marks the worker as dead and puts all

of the tasks assigned to that worker back into the job queue

for execution. In this manner, the system is protected against

failures of the workers. The master can also be protected from

failures by means of a backup server.

B. Backend Database

The backend database stores the user information, the Web

data record extraction task information, and the corresponding

results. The structure of the database is shown in Figure 9.

Fig. 9. Backend database.

The U ser Account table stores the user account informa-

tion. Each record corresponds to a user account created by a

client. The attributes include two mandatary fields, Username

and P assword, and several optional fields for the users

profile, Age, Interest, Occupation, etc.

Once a new Web data record extraction task is submitted

to the server, the system creates a new entry in the Task

table. TaskID is a unique identifier for the task; Name is

a user-defined attribute that helps to identify the task; and

Description is an attribute that briefly describes the task.

FilterRule is a logic expression that is used to filter the

extracted records. For example, Keyword1 AND Keyword2

means that the user wants the set of data records that contain

both Keyword1 and Keyword2. The F ilterRule attribute is

optional. If the F ilterRule field is missing, the system returns

all of the extracted Web records.

The TaskSchedule table stores information related to

which task is to be executed and when. A Web data record

extraction task might need to be executed repeatedly. For

example, in the soccer game live news application, the client

wants the results to be updated every few seconds because

news can be posted at any time. According to the user-

specified starting time, ending time and refresh frequency, the

system creates multiple task schedule entries for the same Web

data record extraction task, which is executed repeatedly.

The User-taskAuthorization table describes the user-to-

task authorization relationships. There are three authorization

levels: 0 means the user can access the results of the task;

1 means the user can manage the task; and 2 means the

user is the administrator of the task. Username Public

is system-reserved. If the Username attribute of a User-

taskAuthorization record is Public, it indicates that the

corresponding task is publicly available.

900



6/7

The DataResource table stores the user-specified URLs

for the Web pages that contain the target information. A Web

data extraction task can be associated with multiple URLs and,

hence, multiple DataResource records.

The Result table stores the location of the Web records that

are extracted.

C. Web Service Interface

Our Web Service allows client access in a variety of ways.

The Web Service interface is described using WSDL. A client

accesses the Web Service by sending a SOAP request message

to the Web Service, and the Web Service returns the results

to the client in a SOAP response message. The key operations

provided by the Web Service are the following.

CreateUserAccount: A new client uses this operation to

create a new account by providing a Username, Password,

Interests, etc. in a SOAP request message. The Web

Service creates a new record in the UserAccount table.

The SOAP response message indicates whether or not the

user account is successfully created.

ExtractDataRecords : Using this operation, a client sub-

mits a Web data record extraction task to the Web

Service. In the SOAP request message, the client provides

a Username, Password, URL for the target Web page

that contains the list of Web records, and a filtering

rule. The Web Service creates records in the User-

taskAuthorization table, Task table, TaskResource table

and TaskResult table. The Web records extracted from the

Web page are filtered using the filtering rule and stored

at the Atom server for the user to retrieve at a later time.

Meanwhile, the SOAP response message containing the

Web record is returned to the client.

PeriodicallyExtractDataRecords : The Web Service sup-

ports periodic updating of the Web data record extractionresults. In the SOAP request message, the client indicates

the data source, starting time, ending time, and requested

update frequency. The server creates a new record in the

Task table and multiple records in the TaskSchedule table.

The scheduled tasks are executed at pre-defined times,

and the results are stored at the Atom server as Atom

feeds. The client consumes the data by subscribing for

the data at the Atom server. The Atom server performs

an identification check before authorizing a user to access

the Atom feeds. The SOAP response message contains

only the TaskID and the URL to the Atom feeds.

AddConsumer / RemoveConsumer: The administrative

user of a task can use these operations to authorize /deauthorize other users access to the results stored at

the Atom server. If the user adds / removes a special

username Public, the Atom feeds containing the task

results will be made publicly available / unavailable. To

use this operation, the user must specify Username, Pass-

word and TaskID. The SOAP response message indicates

whether or not the job is successfully executed.

AddManager / RemoveManager: The administrative user

of a task can use these operations to authorize / deautho-

rize another user as manager of the task.

AddResource / RemoveResource: The administrative user

and all authorized users that can manage the task

use these operations to add / remove URLs to target

Web pages that contain Web records. Similar opera-

tions include UpdateName, UpdateDescription, Update-

Frequency, UpdateEndingtime, etc. The server updates

the corresponding records in the Task table, TaskSchedule

table and TaskResource table.

ListTasks: This operation lists all of the tasks for which

the user is authorized to view the results. The SOAP

request message indicates the Username and Password.

The SOAP response message contains a list of tasks with

descriptions and the URLs to access the resulting Atom

feeds at the Atom server.

RecommendTask / RecommendUser: These operations

help the user to locate the publicly available tasks that

might be of interest to the user, or other users who have

the same interests. We employ PLSI [11] to generate

recommendations. The co-occurrence probability of each

user / task pair is estimated. For each user, we have a

ranked list of tasks that have high probability of co-

occurrence for that user. The top-ranked publicly accessi-

ble tasks are recommended to the user. For recommending

friends to a user, we compare the edit distance of any

pair of user interests. The users with interests closest to

those of a particular user are recommended to the user

as potential friends.

VI. RELATED WOR K

Traditional information search and retrieval services for the

Web, such as those provided by the search engines of Googleand Yahoo!, consider a Web page as an atomic-level object. A

user is lead to a Web page even though he / she is interested

in only a small part of the content of the Web page. On the

other hand, if the information in which the user is interested

is located on multiple Web pages, the search engines will not

aggregate this information, and users have to access all of the

related Web pages manually to obtain a broad view of the

available information.

The Web should be considered as a repository of informa-

tion, rather than as a repository of Web pages. The wide use

of information in the Web has driven general-purpose search

engines to perform vertical Web search. In a particular domain,

Web data records in Web pages are extracted and aggregatedtogether to satisfy users information needs. Google and Mi-

crosoft provide vertical search engines for online shopping,

publications, recruiting advertisements, etc.

However, the Web contains information about so many

different topics; moreover, the information is coupled together

for multi-disciplinary fields. It is difficult to divide up the

information in the Web into a reasonable number of non-

overlapping domains. It is even harder to build a vertical

search engine for each such domain. A domain-independent

901



7/7

Web object retrieval service is proposed in [14]. However, to

identify whether or not a Web page contains a set of data

objects is a non-trivial problem.

We address the information search and retrieval problem in

the case that users know the data source locations, but it is

inconvenient for the users to locate the relevant information

by browsing the Web pages themselves. The reasons are theusers precious time, the limitations in network bandwidth, the

display area on mobile devices, etc. Our Web Service provides

better results for users queries, by extracting, filtering and / or

aggregating data records from the Web pages in a user-defined

manner. Different from existing vertical search engines, our

Web Service for extraction of Web data records supports

collaboration among users. Data extraction results can be

shared among multiple users, and multiple users can manage

the same Web data record extraction task to provide more

complete and relevant data.

VII. CONCLUSION AND FUTURE WOR K

This paper has described a Web Service that automaticallyparses and extracts data records from Web pages containing

structured data. The Web Service allows multiple users to

share and manage a Web data record extraction task. A

recommendation system, based on the Probabilistic Latency

Semantic Indexing algorithm, enables a user to find potentially

interesting content or other users who have similar interests.

A distributed computing platform improves the scalability of

the Web Service. A Web Service interface allows users to

access the Web Service, and allows programmers to develop

their own applications and, thus, extend the functionality of

the Web Service.

In future work, we plan to do extensive performance eval-

uation of the Web Service to determine the query rate that

the system supports, with a single server and with multiple

servers using our distributed computing platform, and also

with multiple databases at multiple Web sites. We also plan

to investigate the use of the Web Service in various example

applications, such as those mentioned earlier in the paper.

REFERENCES

[1] A. Arasu and H. Garcia-Molina. Extracting structured data from Webpages. In Proceedings of the 2003 ACM International Conference onthe Management of Data, June 2003, San Diego, CA, pp. 337-348.

[2] M. K. Bergman. The deep Web: Surfacing hidden value. Technicalreport, BrightPlanet LLC, December 2000.

[3] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, vol. 3, 2003, pp. 993-1022.

[4] R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H.Randall, and Y. Zhou. Cilk: An efficient multithreaded runtime system, Journal of Parallel and Distributed Computing, pages 207-216, 1995.

[5] K. C. Chang, B. He, C. Li, M. Patel, and Z. Zhang. Structured databases

on the Web: Observations and implications. ACM SIGMOD Record, vol.33, no. 3, 2004, pp. 61-70.

[6] E. H. Chi. Information seeking can be social. Computer, vol. 42, no. 3,March 2009, pp. 42-46.

[7] V. Crescenzi, G. Mecca, and P. Merialdo. RoadRunner: Towards auto-matic data extraction from large Web sites. In Proceedings of the 27th International Conference on Very Large Data Bases, September 2001,Rome, Italy, pp. 109-118.

[8] Y. Freund, R. Iyer, R. E. Schapire, and Y. Singer. An efficient boostingalgorithm for combining preferences. Journal of Machine LearningResearch, vol. 4, 2003, pp. 933-969.

[9] G. Golovchinsky and P. Qvarfordt. Collaborative information seeking.Computer, vol. 42, no. 3, March 2009, pp. 47-51.

[10] J. L. Herlocker, J. A. Konstan, and J. Riedl. Explaining collaborativefiltering recommendations. In Proceedings of the 2000 ACM Conferenceon Computer Supported Cooperative Work, Philadelphia, PA, December2000, pp. 241-150.

[11] T. Hofmann. Probabilistic latent semantic indexing. In Proceedingsof the 22nd Annual International ACM Conference on Research and Development in Information Retrieval, Berkeley, CA, August 1999, pp.50-57.

[12] B. Liu. Mining data records in Web pages. In Proceedings of the ACMInternational Conference on Knowledge Discovery and Data Mining,Washington, D.C., August 2003, pp. 601-606

[13] G. Miao, J. Tatemura, A. Sawires, W. P. Hsiu, and L. E. Moser.Extracting data records from the Web using tag path clustering. InProceedings of the 18th International World Wide Web Conference,Madrid, Spain, 2009, pp. 981-990

[14] Z. Nie, Y. Ma, S. Shi, J.-R. Wen, and W.-Y. Ma. Web object retrieval.In Proceedings of the 16th International Conference on the World WideWeb, Banff, Alberta, Canada, May 2007, pp. 81-90

[15] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEETransactions on Pattern Analysis and Machine Intelligence, vol. 22, no.8, 2000, pp. 888-905.

[16] J. Wang and F. H. Lochovsky. Data extraction and label assignment for

Web databases. In Proceedings of the 12th International Conference onthe World Wide Web, Budapest, Hungary, May 2003, pp. 187-196.

[17] Y. Zhai and B. Liu. Web data extraction based on partial tree alignment.In Proceedings of the 14th International Conference on the World WideWeb, Chiba, Japan, May 2005, pp. 76-85.

[18] H. Zhao, W. Meng, and C. Yu. Mining templates from search resultrecords of search engines. In Proceedings of the 13th ACM InternationalConference on Knowledge Discovery and Data Mining, San Jose, CA,August 2007, pp. 884-893.

[19] J. Zhu, Z. Nie, J. Wen, B. Zhang, and W. Ma. Simultaneous recorddetection and attribute labeling in Web data extraction. In Proceedingsof the 12th ACM International Conference on Knowledge Discovery and

Data Mining, Philadelphia, PA, August 2006, pp. 494-503.

902

Documents

colaborative webservices