Upload
katherine-williams
View
221
Download
0
Embed Size (px)
Citation preview
8/14/2019 colaborative webservices
1/7
Collaborative Web Data Record Extraction
Gengxin Miao, Firat Kart, L. E. Moser, P. M. Melliar-Smith
Department of Electrical and Computer Engineering
University of California, Santa Barbara
Santa Barbara, CA, 93106{miao, fkart, moser, pmms}@ece.ucsb.edu
AbstractThis paper describes a Web Service that automati-cally parses and extracts data records from Web pages containingstructured data. The Web Service allows multiple users to shareand manage a Web data record extraction task to increase itsutility. A recommendation system, based on the ProbabilisticLatency Semantic Indexing algorithm, enables a user to findpotentially interesting content or other users who share thesame interests with the user. A distributed computing platformimproves the scalability of the Web Service in supporting multipleusers by employing multiple server computers. A Web Serviceinterface allows users to access the Web Service, and allows
programmers to develop their own applications and, thus, extendthe functionality of the Web Service.
Index Termscollaborative information extraction, data min-ing, Web Service.
I. INTRODUCTION
On the Web, the amount of structured data is 500 times
greater than the amount of unstructured data. As one of the
major sources of structured data, the deep Web has been
estimated to contain more than 450,000 databases [5] in which
structured data are stored. The structured data in the deep
Web are continually evolving, and might be updated as often
as once every second. Deep Web pages can be dynamically
generated from the data in the deep Web. A single deep Webpage typically contains a large number of Web records [12],
[13], i.e., HTML regions, each of which corresponds to an
individual data object. When browsing these deep Web pages,
a user is usually interested in only a small number of data
objects. The diverse data and the evolving characteristics of
the deep Web make it difficult for users to locate the data
objects of interest in a friendly and timely manner.
There have been extensive studies of fully automatic meth-
ods to extract data objects from the Web [1], [7]. A typical
process to extract data objects from a Web page consists of
three steps. The first step is to identify Web records that
represent individual data objects (e.g., products). The second
step is to extract data object attributes (e.g., product names,prices, and images) from the Web records. Corresponding
attributes in different Web records are aligned, resulting in
spreadsheet-like data [17], [18]. The final step is the optional
task (which is very difficult in general) of interpreting aligned
attributes and assigning appropriate labels [16], [19].
In this paper we describe a Web Service for extracting Web
data records from deep Web pages. Attribute alignment is not
necessary. Approaches such as EXALG [1] and RoadRunner
[7] are applicable only when there are multiple deep Web
pages that use the same template, which is not guaranteed
in our case. SRR [18] is intended for extracting data records
from Web pages returned by search engines. Because the user-
specified data source can occur within any domain, we employ
a domain-independent approach for extracting data records
from deep Web pages [13].
The Internet has brought collaboration of individuals to
a whole new level. Not only can colleagues or friends col-
laborate with each other, but also individuals can collabo-
rate without even knowing each other. Wikipedia provides
a collaborative authoring platform that aggregates individual
intelligence by allowing any authorized user to modify an
article on which he/she has knowledge. Facebook allows
different users to collaborate in developing applications and,
hence, provides a useful social network service. Support for
collaborative and social interactions in an information seeking
system [6], [9] improves the utility of the system by allowing
users to share information and tasks.
In this paper we present a Web Service that supports
collaboration among multiple users, who are interested in the
same information, e.g., price of a certain product. Different
users are aware of different data sources that contain relevantinformation, e.g., different e-commerce Web sites carrying the
product. By collaboration, the users can obtain more complete
and relevant information. Our Web Service allows authorized
users to share the extracted Web data records from the deep
Web and to manage the Web data record extraction task.
Collaborative filtering [10] produces recommendations by
computing the similarities between one users preference and
the preferences of other users. Algorithms for collaborative
filtering fall into two categories: rank-based algorithms, e.g.,
RankBoost [8], and probabilistic model-based algorithms e.g.,
Latent Dirichlet Allocation (LDA) [3] and Probabilistic La-
tency Semantic Indexing (PLSI) [11]. RankBoost combines
multiple partial preferences into a unified ranking. PLSI andLDA analyze the co-occurrences of two different types of
data, e.g., documents and words. With the help of a latent
semantic layer, PLSI and LDA estimate the joint probability
of any given pair of documents and words. We employ
PLSI to produce personal recommendations for the users.
The extracted Web data records are ranked based on their
probabilities of co-occurrence for a particular user, and top-
ranked objects are recommended to the user as they align with
the users preferences.
2009 IEEE International Conference on Web Services
978-0-7695-3709-2/09 $25.00 2009 IEEE
DOI 10.1109/ICWS.2009.109
896
Authorized licensed use limited to: MISRIMAL NAVAJEE MUNOTH JAIN ENGG COLLEGE. Downloaded on November 28, 2009 at 21:02 from IEEE Xplore. Restrictions apply.
8/14/2019 colaborative webservices
2/7
II. MOTIVATION
To motivate our Web Service for collaborative Web data
record extraction, we present several example scenarios, in
which a filtered summary of Web data records that match the
users interests is produced from a list on a Web page.
Example 1. Bob is traveling by train across California. Bob
wants to read some interesting stories from his favorite Website using his cell phone to fill the time (10 hours) spent
on the train. However, the Web site is not friendly to small
hand-held devices. Both the bandwidth and the screen size are
limited. It takes minutes to download a large Web page onto
Bobs cell phone, and he needs to scroll both horizontally and
vertically to locate the interesting content on the Web page.
Bob completely loses interest after this frustrating browsing
experience. If there were a Web Service that parses and
extracts the interesting information from the Web page and
then sends Bob only the interesting content, his experience
would be much more satisfying.
Example 2. Mike is a soccer fan and likes the soccer player
David Beckham very much. However, he happens to be busyon the day when Beckham plays an important game. Mike
cannot watch the entire game, but he does not want to miss the
moment when Beckham scores a goal. He knows a Web site
that broadcasts the video of the game online and another Web
site that provides live news on the game in text. Mike wishes
that there were a Web Service that periodically extracts and
parses the live news and notifies him when Beckham scores,
so that he can catch the replay of the goal.
Example 3. Alice decides to go to Long Beach for Spring
break with her friends. They start looking for a vacation rental
on the local forums on the Web one month before their vaca-
tion. Alice feels exhausted after reading through the vacation
rental advertisements posted on the various forums, most of
which turn out to be irrelevant. It takes Alice and her friends
a lot of precious time reading listings and circulating the
information among themselves. The process would be greatly
facilitated if a Web Service were available that automatically
scans the forums of interests and extracts only the relevant
rental information for Alice and her friends to share.
When a user browses a deep Web page, the user typically
has a particular information need in mind. Only a fraction
of the Web data records found on a deep Web page match
the users information need. Browsing the entire Web page is
tedious, and can be expensive in time or bandwidth.
Our Web Service for collaborative Web data record extrac-
tion returns only the Web data records of interest to the user
and, thus, increases the efficiency of the deep Web browsingprocess. In addition, it allows multiple users to share and
manage the Web data record extraction task to enhance the
benefits further.
III. WEB DATA RECORD EXTRACTION
Our Web Service for Web data record extraction focuses on
the deep Web for the following reasons:
The amount of data in the deep Web is much greater than
that on the surface Web.
Fig. 1. An example Web page containing the live news of a soccer game.
Fig. 2. HTML code template used to render the soccer game live news.
The deep Web is a good source of structured data,
which are more suitable for automatic processing than
the unstructured data on the surface Web.
The dynamic content found in deep Web pages is gen-
erally of greater interest than the static content found in
static Web pages on the surface Web.
Our Web Service employs a Web data record extraction
technique, based on HTML tag path clustering [13], that we
developed. In an automatically generated deep Web page, the
Web records, e.g., live news about a soccer game, are rendered
in visually repeating patterns. First, the Web record extraction
algorithm identifies the visually repeating part in a Web page.
Then, a Web page segmentation algorithm looks for the exact
boundaries of each Web data record.
A. Finding Visually Repeating Patterns
In a Web page (HTML document), the visual information is
conveyed by HTML tag paths. Visually repeating patterns in
a Web page correspond to the repeated occurrence of HTML
tag paths in the Web page. A unique HTML tag path might
have multiple occurrences in a Web page. A set of HTML tagpaths that repeatedly occur in the Web page in a similar way
corresponds to a set of Web records. The occurrence positions
are indicated using a binary vector, referred to as a visual
signal vector. Each unique HTML tag path corresponds to a
visual signal vector.
By evaluating the similarity between the visual signal
vectors, we can discover whether two unique HTML tag
paths have similar repeated occurrence patterns. We construct
a pairwise similarity matrix, in which each element is the
897
Authorized licensed use limited to: MISRIMAL NAVAJEE MUNOTH JAIN ENGG COLLEGE. Downloaded on November 28, 2009 at 21:02 from IEEE Xplore. Restrictions apply.
8/14/2019 colaborative webservices
3/7
Fig. 3. Unique HTML tag paths extracted from the soccer game live newsWeb page.
similarity measurement of a pair of visual signal vectors. We
then apply a spectral clustering algorithm [15] to the similarity
matrix to discover a set of unique HTML tag paths, i.e.,
visually repeating patterns on the Web page, which correspond
to the Web records.
For example, Figure 1 is an automatically generated Web
page that contains live news on a soccer game between
Germany and Spain. The soccer game Web page is updated
whenever live news is uploaded. Each live news record is
rendered using the HTML code template shown in Figure2. The template corresponds to the unique HTML tag paths
17 through 20 in Figure 3, which is one of the clusters
generated by the spectral clustering algorithm. The grouped
unique HTML tag paths are then passed to the Web page
segmentation algorithm to find the exact boundaries of each
live news record.
B. Web Page Segmentation
The Web page segmentation algorithm takes as input a set of
HTML tag paths and examines their occurrences to determine
the exact boundaries of the Web records. Each occurrence
of a unique HTML tag path corresponds to a node in the
DOM tree, a tree structure obtained by parsing the HTMLdocument. If a unique HTML tag path A is a prefix of a
unique HTML tag path B, then the occurrences of A are
ancestors of the occurrences of B in the DOM tree and, hence,
correspond to larger pieces of HTML text. In this case, A is
an ancestor visual signal of B. A set of unique HTML tag
paths corresponds to a HTML template. An occurrence of a
tag path maps to a part of a Web record or an entire Web
record. A larger piece of HTML text is more likely to cover
an entire Web record than a smaller piece of HTML text.
Figure 4 shows the ancestor and descendant relationships
within a set of unique tag paths in the Web page for the soccer
game live news. An occurrence of an upper level tag path
corresponds to a larger piece of HTML text than a lower level
tag path. In this example, the dl node corresponds to the entire
news record. All occurrences of the dl nodes following the tag
path are extracted; each one is a news record. The results forthe soccer game live news Web page are shown in Figure 5.
Fig. 4. Ancestor / descendant relationships within a set of unique HTMLtag paths for the soccer game live news Web page.
Fig. 5. Web data record extraction results for the soccer game live newsWeb page.
The Web data record extraction algorithm has linear timecomplexity in the length of the Web page. We use this
algorithm in our Web Service to extract information from the
Web pages.
IV. SYSTEM ARCHITECTURE
Our system enables users to submit Web data record
extraction tasks using the Web Service. Different kinds of
applications on different kinds of devices can access the Web
Service, as shown in Figure 6. Having received tasks from
898
Authorized licensed use limited to: MISRIMAL NAVAJEE MUNOTH JAIN ENGG COLLEGE. Downloaded on November 28, 2009 at 21:02 from IEEE Xplore. Restrictions apply.
8/14/2019 colaborative webservices
4/7
multiple client applications, the Web Service executes the
Web data record extraction tasks in parallel. Depending on
the Web Service call, the results are returned to the client
application or are uploaded to an Atom server to be published
as a syndication feed for subscribed consumers.
Fig. 6. Use of the Web data record extraction Web Service.
Our Web Service for collaborative Web data record extrac-
tion allows multiple users to share results and manage tasks,
as shown in Figure 7. A client is identified using a unique
clientID and is authorized by password verification. Once a
client submits a new task to the Web Service, it becomes the
administrative client for that task. The client can authorize
other clients to access the results returned by the task or tomanage the task. Authorized clients access the Web Service
to list tasks that they have permission to manage and URLs to
the results that they can access. They can use either the Web
Service or the Atom server to access the task or the results,
respectively. Our Web Service also provides a recommendation
facility to allow users to find Web data record extraction tasks
of interest to them.
Tasks submitted to the Web Service are executed in parallel
for scalability reasons, as shown in Figure 8. When a new Web
data record extraction task is submitted, the master computer
divides the task up into multiple sub-tasks and puts them
into a task queue. Thus, the master computer maintains the
list of tasks to be executed and distributes them among aset of worker computers for the purposes of load balancing.
The workers retrieve the data resources from the deep Web
using user-specified URLs. The extracted Web records are
filtered based on the user-defined filtering rules. Final results
are gathered at the master computer. and the workers are then
ready to take on new tasks. The master computer determines
whether to pass back the result set, containing the list of Web
records, directly to the calling client or to save it at the Atom
server for the client to consume later.
Fig. 7. Collaborative data extraction.
V. IMPLEMENTATION
Our system provides a Web Service interface that allows
clients to access the Web data record extraction service, a
distributed computing platform that performs the Web record
extraction computations in parallel, and a backend database
that stores information for the multiple collaborative users. The
data from the deep Web sites are aggregated in a database atthe Atom server.
A. Distributed Computing Platform
The distributed computing platform is similar to the CILK
system [4] for multithreaded parallel programming, except
that CILK is implemented in C++ whereas our system is
implemented in Java. A job is divided up into a number ofsub-
jobs. A job is finished when all of its sub-jobs are executed. In
CILK, there might be dependence relationships between the
sub-jobs. For example, if sub-job A takes the output of sub-job
B as input, then sub-job A can be executed only when sub-
job B has finished. In our system, there are no dependence
relationships between sub-jobs, i.e., a Web record extractiontask does not depend on any other tasks. Thus, our problem is
slightly easier than the general problem addressed by CILK.
We employ the concept of work stealing used by CILK to
avoid multiple workers requesting work from the master at
the same time and, hence, avoid the network communication
bottleneck at the master. The master in our system maintains
the list of jobs that are ready to be executed. Each worker
maintains its own job queue. When its job queue length is
less than a threshold, MinJobs, the worker either requests a
899
Authorized licensed use limited to: MISRIMAL NAVAJEE MUNOTH JAIN ENGG COLLEGE. Downloaded on November 28, 2009 at 21:02 from IEEE Xplore. Restrictions apply.
8/14/2019 colaborative webservices
5/7
Fig. 8. Distributed computing platform for the Web data record extraction Web Service.
new job from the master with probability p, or steals a job
from a randomly picked worker with probability 1p. Usingthis work stealing strategy, the system balances the network
bandwidth usage among all of the workers to avoid a burst of
requests at the master.
The master keeps track of the jobs that have been assigned
to the workers until all of them are executed successfully. If a
worker fails to respond to the master within a certain amount
of time, the master marks the worker as dead and puts all
of the tasks assigned to that worker back into the job queue
for execution. In this manner, the system is protected against
failures of the workers. The master can also be protected from
failures by means of a backup server.
B. Backend Database
The backend database stores the user information, the Web
data record extraction task information, and the corresponding
results. The structure of the database is shown in Figure 9.
Fig. 9. Backend database.
The U ser Account table stores the user account informa-
tion. Each record corresponds to a user account created by a
client. The attributes include two mandatary fields, Username
and P assword, and several optional fields for the users
profile, Age, Interest, Occupation, etc.
Once a new Web data record extraction task is submitted
to the server, the system creates a new entry in the Task
table. TaskID is a unique identifier for the task; Name is
a user-defined attribute that helps to identify the task; and
Description is an attribute that briefly describes the task.
FilterRule is a logic expression that is used to filter the
extracted records. For example, Keyword1 AND Keyword2
means that the user wants the set of data records that contain
both Keyword1 and Keyword2. The F ilterRule attribute is
optional. If the F ilterRule field is missing, the system returns
all of the extracted Web records.
The TaskSchedule table stores information related to
which task is to be executed and when. A Web data record
extraction task might need to be executed repeatedly. For
example, in the soccer game live news application, the client
wants the results to be updated every few seconds because
news can be posted at any time. According to the user-
specified starting time, ending time and refresh frequency, the
system creates multiple task schedule entries for the same Web
data record extraction task, which is executed repeatedly.
The User-taskAuthorization table describes the user-to-
task authorization relationships. There are three authorization
levels: 0 means the user can access the results of the task;
1 means the user can manage the task; and 2 means the
user is the administrator of the task. Username Public
is system-reserved. If the Username attribute of a User-
taskAuthorization record is Public, it indicates that the
corresponding task is publicly available.
900
Authorized licensed use limited to: MISRIMAL NAVAJEE MUNOTH JAIN ENGG COLLEGE. Downloaded on November 28, 2009 at 21:02 from IEEE Xplore. Restrictions apply.
8/14/2019 colaborative webservices
6/7
The DataResource table stores the user-specified URLs
for the Web pages that contain the target information. A Web
data extraction task can be associated with multiple URLs and,
hence, multiple DataResource records.
The Result table stores the location of the Web records that
are extracted.
C. Web Service Interface
Our Web Service allows client access in a variety of ways.
The Web Service interface is described using WSDL. A client
accesses the Web Service by sending a SOAP request message
to the Web Service, and the Web Service returns the results
to the client in a SOAP response message. The key operations
provided by the Web Service are the following.
CreateUserAccount: A new client uses this operation to
create a new account by providing a Username, Password,
Interests, etc. in a SOAP request message. The Web
Service creates a new record in the UserAccount table.
The SOAP response message indicates whether or not the
user account is successfully created.
ExtractDataRecords : Using this operation, a client sub-
mits a Web data record extraction task to the Web
Service. In the SOAP request message, the client provides
a Username, Password, URL for the target Web page
that contains the list of Web records, and a filtering
rule. The Web Service creates records in the User-
taskAuthorization table, Task table, TaskResource table
and TaskResult table. The Web records extracted from the
Web page are filtered using the filtering rule and stored
at the Atom server for the user to retrieve at a later time.
Meanwhile, the SOAP response message containing the
Web record is returned to the client.
PeriodicallyExtractDataRecords : The Web Service sup-
ports periodic updating of the Web data record extractionresults. In the SOAP request message, the client indicates
the data source, starting time, ending time, and requested
update frequency. The server creates a new record in the
Task table and multiple records in the TaskSchedule table.
The scheduled tasks are executed at pre-defined times,
and the results are stored at the Atom server as Atom
feeds. The client consumes the data by subscribing for
the data at the Atom server. The Atom server performs
an identification check before authorizing a user to access
the Atom feeds. The SOAP response message contains
only the TaskID and the URL to the Atom feeds.
AddConsumer / RemoveConsumer: The administrative
user of a task can use these operations to authorize /deauthorize other users access to the results stored at
the Atom server. If the user adds / removes a special
username Public, the Atom feeds containing the task
results will be made publicly available / unavailable. To
use this operation, the user must specify Username, Pass-
word and TaskID. The SOAP response message indicates
whether or not the job is successfully executed.
AddManager / RemoveManager: The administrative user
of a task can use these operations to authorize / deautho-
rize another user as manager of the task.
AddResource / RemoveResource: The administrative user
and all authorized users that can manage the task
use these operations to add / remove URLs to target
Web pages that contain Web records. Similar opera-
tions include UpdateName, UpdateDescription, Update-
Frequency, UpdateEndingtime, etc. The server updates
the corresponding records in the Task table, TaskSchedule
table and TaskResource table.
ListTasks: This operation lists all of the tasks for which
the user is authorized to view the results. The SOAP
request message indicates the Username and Password.
The SOAP response message contains a list of tasks with
descriptions and the URLs to access the resulting Atom
feeds at the Atom server.
RecommendTask / RecommendUser: These operations
help the user to locate the publicly available tasks that
might be of interest to the user, or other users who have
the same interests. We employ PLSI [11] to generate
recommendations. The co-occurrence probability of each
user / task pair is estimated. For each user, we have a
ranked list of tasks that have high probability of co-
occurrence for that user. The top-ranked publicly accessi-
ble tasks are recommended to the user. For recommending
friends to a user, we compare the edit distance of any
pair of user interests. The users with interests closest to
those of a particular user are recommended to the user
as potential friends.
VI. RELATED WOR K
Traditional information search and retrieval services for the
Web, such as those provided by the search engines of Googleand Yahoo!, consider a Web page as an atomic-level object. A
user is lead to a Web page even though he / she is interested
in only a small part of the content of the Web page. On the
other hand, if the information in which the user is interested
is located on multiple Web pages, the search engines will not
aggregate this information, and users have to access all of the
related Web pages manually to obtain a broad view of the
available information.
The Web should be considered as a repository of informa-
tion, rather than as a repository of Web pages. The wide use
of information in the Web has driven general-purpose search
engines to perform vertical Web search. In a particular domain,
Web data records in Web pages are extracted and aggregatedtogether to satisfy users information needs. Google and Mi-
crosoft provide vertical search engines for online shopping,
publications, recruiting advertisements, etc.
However, the Web contains information about so many
different topics; moreover, the information is coupled together
for multi-disciplinary fields. It is difficult to divide up the
information in the Web into a reasonable number of non-
overlapping domains. It is even harder to build a vertical
search engine for each such domain. A domain-independent
901
Authorized licensed use limited to: MISRIMAL NAVAJEE MUNOTH JAIN ENGG COLLEGE. Downloaded on November 28, 2009 at 21:02 from IEEE Xplore. Restrictions apply.
8/14/2019 colaborative webservices
7/7
Web object retrieval service is proposed in [14]. However, to
identify whether or not a Web page contains a set of data
objects is a non-trivial problem.
We address the information search and retrieval problem in
the case that users know the data source locations, but it is
inconvenient for the users to locate the relevant information
by browsing the Web pages themselves. The reasons are theusers precious time, the limitations in network bandwidth, the
display area on mobile devices, etc. Our Web Service provides
better results for users queries, by extracting, filtering and / or
aggregating data records from the Web pages in a user-defined
manner. Different from existing vertical search engines, our
Web Service for extraction of Web data records supports
collaboration among users. Data extraction results can be
shared among multiple users, and multiple users can manage
the same Web data record extraction task to provide more
complete and relevant data.
VII. CONCLUSION AND FUTURE WOR K
This paper has described a Web Service that automaticallyparses and extracts data records from Web pages containing
structured data. The Web Service allows multiple users to
share and manage a Web data record extraction task. A
recommendation system, based on the Probabilistic Latency
Semantic Indexing algorithm, enables a user to find potentially
interesting content or other users who have similar interests.
A distributed computing platform improves the scalability of
the Web Service. A Web Service interface allows users to
access the Web Service, and allows programmers to develop
their own applications and, thus, extend the functionality of
the Web Service.
In future work, we plan to do extensive performance eval-
uation of the Web Service to determine the query rate that
the system supports, with a single server and with multiple
servers using our distributed computing platform, and also
with multiple databases at multiple Web sites. We also plan
to investigate the use of the Web Service in various example
applications, such as those mentioned earlier in the paper.
REFERENCES
[1] A. Arasu and H. Garcia-Molina. Extracting structured data from Webpages. In Proceedings of the 2003 ACM International Conference onthe Management of Data, June 2003, San Diego, CA, pp. 337-348.
[2] M. K. Bergman. The deep Web: Surfacing hidden value. Technicalreport, BrightPlanet LLC, December 2000.
[3] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, vol. 3, 2003, pp. 993-1022.
[4] R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H.Randall, and Y. Zhou. Cilk: An efficient multithreaded runtime system, Journal of Parallel and Distributed Computing, pages 207-216, 1995.
[5] K. C. Chang, B. He, C. Li, M. Patel, and Z. Zhang. Structured databases
on the Web: Observations and implications. ACM SIGMOD Record, vol.33, no. 3, 2004, pp. 61-70.
[6] E. H. Chi. Information seeking can be social. Computer, vol. 42, no. 3,March 2009, pp. 42-46.
[7] V. Crescenzi, G. Mecca, and P. Merialdo. RoadRunner: Towards auto-matic data extraction from large Web sites. In Proceedings of the 27th International Conference on Very Large Data Bases, September 2001,Rome, Italy, pp. 109-118.
[8] Y. Freund, R. Iyer, R. E. Schapire, and Y. Singer. An efficient boostingalgorithm for combining preferences. Journal of Machine LearningResearch, vol. 4, 2003, pp. 933-969.
[9] G. Golovchinsky and P. Qvarfordt. Collaborative information seeking.Computer, vol. 42, no. 3, March 2009, pp. 47-51.
[10] J. L. Herlocker, J. A. Konstan, and J. Riedl. Explaining collaborativefiltering recommendations. In Proceedings of the 2000 ACM Conferenceon Computer Supported Cooperative Work, Philadelphia, PA, December2000, pp. 241-150.
[11] T. Hofmann. Probabilistic latent semantic indexing. In Proceedingsof the 22nd Annual International ACM Conference on Research and Development in Information Retrieval, Berkeley, CA, August 1999, pp.50-57.
[12] B. Liu. Mining data records in Web pages. In Proceedings of the ACMInternational Conference on Knowledge Discovery and Data Mining,Washington, D.C., August 2003, pp. 601-606
[13] G. Miao, J. Tatemura, A. Sawires, W. P. Hsiu, and L. E. Moser.Extracting data records from the Web using tag path clustering. InProceedings of the 18th International World Wide Web Conference,Madrid, Spain, 2009, pp. 981-990
[14] Z. Nie, Y. Ma, S. Shi, J.-R. Wen, and W.-Y. Ma. Web object retrieval.In Proceedings of the 16th International Conference on the World WideWeb, Banff, Alberta, Canada, May 2007, pp. 81-90
[15] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEETransactions on Pattern Analysis and Machine Intelligence, vol. 22, no.8, 2000, pp. 888-905.
[16] J. Wang and F. H. Lochovsky. Data extraction and label assignment for
Web databases. In Proceedings of the 12th International Conference onthe World Wide Web, Budapest, Hungary, May 2003, pp. 187-196.
[17] Y. Zhai and B. Liu. Web data extraction based on partial tree alignment.In Proceedings of the 14th International Conference on the World WideWeb, Chiba, Japan, May 2005, pp. 76-85.
[18] H. Zhao, W. Meng, and C. Yu. Mining templates from search resultrecords of search engines. In Proceedings of the 13th ACM InternationalConference on Knowledge Discovery and Data Mining, San Jose, CA,August 2007, pp. 884-893.
[19] J. Zhu, Z. Nie, J. Wen, B. Zhang, and W. Ma. Simultaneous recorddetection and attribute labeling in Web data extraction. In Proceedingsof the 12th ACM International Conference on Knowledge Discovery and
Data Mining, Philadelphia, PA, August 2006, pp. 494-503.
902