colaborative webservices

Embed Size (px)

Citation preview

  • 8/14/2019 colaborative webservices

    1/7

    Collaborative Web Data Record Extraction

    Gengxin Miao, Firat Kart, L. E. Moser, P. M. Melliar-Smith

    Department of Electrical and Computer Engineering

    University of California, Santa Barbara

    Santa Barbara, CA, 93106{miao, fkart, moser, pmms}@ece.ucsb.edu

    AbstractThis paper describes a Web Service that automati-cally parses and extracts data records from Web pages containingstructured data. The Web Service allows multiple users to shareand manage a Web data record extraction task to increase itsutility. A recommendation system, based on the ProbabilisticLatency Semantic Indexing algorithm, enables a user to findpotentially interesting content or other users who share thesame interests with the user. A distributed computing platformimproves the scalability of the Web Service in supporting multipleusers by employing multiple server computers. A Web Serviceinterface allows users to access the Web Service, and allows

    programmers to develop their own applications and, thus, extendthe functionality of the Web Service.

    Index Termscollaborative information extraction, data min-ing, Web Service.

    I. INTRODUCTION

    On the Web, the amount of structured data is 500 times

    greater than the amount of unstructured data. As one of the

    major sources of structured data, the deep Web has been

    estimated to contain more than 450,000 databases [5] in which

    structured data are stored. The structured data in the deep

    Web are continually evolving, and might be updated as often

    as once every second. Deep Web pages can be dynamically

    generated from the data in the deep Web. A single deep Webpage typically contains a large number of Web records [12],

    [13], i.e., HTML regions, each of which corresponds to an

    individual data object. When browsing these deep Web pages,

    a user is usually interested in only a small number of data

    objects. The diverse data and the evolving characteristics of

    the deep Web make it difficult for users to locate the data

    objects of interest in a friendly and timely manner.

    There have been extensive studies of fully automatic meth-

    ods to extract data objects from the Web [1], [7]. A typical

    process to extract data objects from a Web page consists of

    three steps. The first step is to identify Web records that

    represent individual data objects (e.g., products). The second

    step is to extract data object attributes (e.g., product names,prices, and images) from the Web records. Corresponding

    attributes in different Web records are aligned, resulting in

    spreadsheet-like data [17], [18]. The final step is the optional

    task (which is very difficult in general) of interpreting aligned

    attributes and assigning appropriate labels [16], [19].

    In this paper we describe a Web Service for extracting Web

    data records from deep Web pages. Attribute alignment is not

    necessary. Approaches such as EXALG [1] and RoadRunner

    [7] are applicable only when there are multiple deep Web

    pages that use the same template, which is not guaranteed

    in our case. SRR [18] is intended for extracting data records

    from Web pages returned by search engines. Because the user-

    specified data source can occur within any domain, we employ

    a domain-independent approach for extracting data records

    from deep Web pages [13].

    The Internet has brought collaboration of individuals to

    a whole new level. Not only can colleagues or friends col-

    laborate with each other, but also individuals can collabo-

    rate without even knowing each other. Wikipedia provides

    a collaborative authoring platform that aggregates individual

    intelligence by allowing any authorized user to modify an

    article on which he/she has knowledge. Facebook allows

    different users to collaborate in developing applications and,

    hence, provides a useful social network service. Support for

    collaborative and social interactions in an information seeking

    system [6], [9] improves the utility of the system by allowing

    users to share information and tasks.

    In this paper we present a Web Service that supports

    collaboration among multiple users, who are interested in the

    same information, e.g., price of a certain product. Different

    users are aware of different data sources that contain relevantinformation, e.g., different e-commerce Web sites carrying the

    product. By collaboration, the users can obtain more complete

    and relevant information. Our Web Service allows authorized

    users to share the extracted Web data records from the deep

    Web and to manage the Web data record extraction task.

    Collaborative filtering [10] produces recommendations by

    computing the similarities between one users preference and

    the preferences of other users. Algorithms for collaborative

    filtering fall into two categories: rank-based algorithms, e.g.,

    RankBoost [8], and probabilistic model-based algorithms e.g.,

    Latent Dirichlet Allocation (LDA) [3] and Probabilistic La-

    tency Semantic Indexing (PLSI) [11]. RankBoost combines

    multiple partial preferences into a unified ranking. PLSI andLDA analyze the co-occurrences of two different types of

    data, e.g., documents and words. With the help of a latent

    semantic layer, PLSI and LDA estimate the joint probability

    of any given pair of documents and words. We employ

    PLSI to produce personal recommendations for the users.

    The extracted Web data records are ranked based on their

    probabilities of co-occurrence for a particular user, and top-

    ranked objects are recommended to the user as they align with

    the users preferences.

    2009 IEEE International Conference on Web Services

    978-0-7695-3709-2/09 $25.00 2009 IEEE

    DOI 10.1109/ICWS.2009.109

    896

    Authorized licensed use limited to: MISRIMAL NAVAJEE MUNOTH JAIN ENGG COLLEGE. Downloaded on November 28, 2009 at 21:02 from IEEE Xplore. Restrictions apply.

  • 8/14/2019 colaborative webservices

    2/7

    II. MOTIVATION

    To motivate our Web Service for collaborative Web data

    record extraction, we present several example scenarios, in

    which a filtered summary of Web data records that match the

    users interests is produced from a list on a Web page.

    Example 1. Bob is traveling by train across California. Bob

    wants to read some interesting stories from his favorite Website using his cell phone to fill the time (10 hours) spent

    on the train. However, the Web site is not friendly to small

    hand-held devices. Both the bandwidth and the screen size are

    limited. It takes minutes to download a large Web page onto

    Bobs cell phone, and he needs to scroll both horizontally and

    vertically to locate the interesting content on the Web page.

    Bob completely loses interest after this frustrating browsing

    experience. If there were a Web Service that parses and

    extracts the interesting information from the Web page and

    then sends Bob only the interesting content, his experience

    would be much more satisfying.

    Example 2. Mike is a soccer fan and likes the soccer player

    David Beckham very much. However, he happens to be busyon the day when Beckham plays an important game. Mike

    cannot watch the entire game, but he does not want to miss the

    moment when Beckham scores a goal. He knows a Web site

    that broadcasts the video of the game online and another Web

    site that provides live news on the game in text. Mike wishes

    that there were a Web Service that periodically extracts and

    parses the live news and notifies him when Beckham scores,

    so that he can catch the replay of the goal.

    Example 3. Alice decides to go to Long Beach for Spring

    break with her friends. They start looking for a vacation rental

    on the local forums on the Web one month before their vaca-

    tion. Alice feels exhausted after reading through the vacation

    rental advertisements posted on the various forums, most of

    which turn out to be irrelevant. It takes Alice and her friends

    a lot of precious time reading listings and circulating the

    information among themselves. The process would be greatly

    facilitated if a Web Service were available that automatically

    scans the forums of interests and extracts only the relevant

    rental information for Alice and her friends to share.

    When a user browses a deep Web page, the user typically

    has a particular information need in mind. Only a fraction

    of the Web data records found on a deep Web page match

    the users information need. Browsing the entire Web page is

    tedious, and can be expensive in time or bandwidth.

    Our Web Service for collaborative Web data record extrac-

    tion returns only the Web data records of interest to the user

    and, thus, increases the efficiency of the deep Web browsingprocess. In addition, it allows multiple users to share and

    manage the Web data record extraction task to enhance the

    benefits further.

    III. WEB DATA RECORD EXTRACTION

    Our Web Service for Web data record extraction focuses on

    the deep Web for the following reasons:

    The amount of data in the deep Web is much greater than

    that on the surface Web.

    Fig. 1. An example Web page containing the live news of a soccer game.

    Fig. 2. HTML code template used to render the soccer game live news.

    The deep Web is a good source of structured data,

    which are more suitable for automatic processing than

    the unstructured data on the surface Web.

    The dynamic content found in deep Web pages is gen-

    erally of greater interest than the static content found in

    static Web pages on the surface Web.

    Our Web Service employs a Web data record extraction

    technique, based on HTML tag path clustering [13], that we

    developed. In an automatically generated deep Web page, the

    Web records, e.g., live news about a soccer game, are rendered

    in visually repeating patterns. First, the Web record extraction

    algorithm identifies the visually repeating part in a Web page.

    Then, a Web page segmentation algorithm looks for the exact

    boundaries of each Web data record.

    A. Finding Visually Repeating Patterns

    In a Web page (HTML document), the visual information is

    conveyed by HTML tag paths. Visually repeating patterns in

    a Web page correspond to the repeated occurrence of HTML

    tag paths in the Web page. A unique HTML tag path might

    have multiple occurrences in a Web page. A set of HTML tagpaths that repeatedly occur in the Web page in a similar way

    corresponds to a set of Web records. The occurrence positions

    are indicated using a binary vector, referred to as a visual

    signal vector. Each unique HTML tag path corresponds to a

    visual signal vector.

    By evaluating the similarity between the visual signal

    vectors, we can discover whether two unique HTML tag

    paths have similar repeated occurrence patterns. We construct

    a pairwise similarity matrix, in which each element is the

    897

    Authorized licensed use limited to: MISRIMAL NAVAJEE MUNOTH JAIN ENGG COLLEGE. Downloaded on November 28, 2009 at 21:02 from IEEE Xplore. Restrictions apply.

  • 8/14/2019 colaborative webservices

    3/7

    Fig. 3. Unique HTML tag paths extracted from the soccer game live newsWeb page.

    similarity measurement of a pair of visual signal vectors. We

    then apply a spectral clustering algorithm [15] to the similarity

    matrix to discover a set of unique HTML tag paths, i.e.,

    visually repeating patterns on the Web page, which correspond

    to the Web records.

    For example, Figure 1 is an automatically generated Web

    page that contains live news on a soccer game between

    Germany and Spain. The soccer game Web page is updated

    whenever live news is uploaded. Each live news record is

    rendered using the HTML code template shown in Figure2. The template corresponds to the unique HTML tag paths

    17 through 20 in Figure 3, which is one of the clusters

    generated by the spectral clustering algorithm. The grouped

    unique HTML tag paths are then passed to the Web page

    segmentation algorithm to find the exact boundaries of each

    live news record.

    B. Web Page Segmentation

    The Web page segmentation algorithm takes as input a set of

    HTML tag paths and examines their occurrences to determine

    the exact boundaries of the Web records. Each occurrence

    of a unique HTML tag path corresponds to a node in the

    DOM tree, a tree structure obtained by parsing the HTMLdocument. If a unique HTML tag path A is a prefix of a

    unique HTML tag path B, then the occurrences of A are

    ancestors of the occurrences of B in the DOM tree and, hence,

    correspond to larger pieces of HTML text. In this case, A is

    an ancestor visual signal of B. A set of unique HTML tag

    paths corresponds to a HTML template. An occurrence of a

    tag path maps to a part of a Web record or an entire Web

    record. A larger piece of HTML text is more likely to cover

    an entire Web record than a smaller piece of HTML text.

    Figure 4 shows the ancestor and descendant relationships

    within a set of unique tag paths in the Web page for the soccer

    game live news. An occurrence of an upper level tag path

    corresponds to a larger piece of HTML text than a lower level

    tag path. In this example, the dl node corresponds to the entire

    news record. All occurrences of the dl nodes following the tag

    path are extracted; each one is a news record. The results forthe soccer game live news Web page are shown in Figure 5.

    Fig. 4. Ancestor / descendant relationships within a set of unique HTMLtag paths for the soccer game live news Web page.

    Fig. 5. Web data record extraction results for the soccer game live newsWeb page.

    The Web data record extraction algorithm has linear timecomplexity in the length of the Web page. We use this

    algorithm in our Web Service to extract information from the

    Web pages.

    IV. SYSTEM ARCHITECTURE

    Our system enables users to submit Web data record

    extraction tasks using the Web Service. Different kinds of

    applications on different kinds of devices can access the Web

    Service, as shown in Figure 6. Having received tasks from

    898

    Authorized licensed use limited to: MISRIMAL NAVAJEE MUNOTH JAIN ENGG COLLEGE. Downloaded on November 28, 2009 at 21:02 from IEEE Xplore. Restrictions apply.

  • 8/14/2019 colaborative webservices

    4/7

    multiple client applications, the Web Service executes the

    Web data record extraction tasks in parallel. Depending on

    the Web Service call, the results are returned to the client

    application or are uploaded to an Atom server to be published

    as a syndication feed for subscribed consumers.

    Fig. 6. Use of the Web data record extraction Web Service.

    Our Web Service for collaborative Web data record extrac-

    tion allows multiple users to share results and manage tasks,

    as shown in Figure 7. A client is identified using a unique

    clientID and is authorized by password verification. Once a

    client submits a new task to the Web Service, it becomes the

    administrative client for that task. The client can authorize

    other clients to access the results returned by the task or tomanage the task. Authorized clients access the Web Service

    to list tasks that they have permission to manage and URLs to

    the results that they can access. They can use either the Web

    Service or the Atom server to access the task or the results,

    respectively. Our Web Service also provides a recommendation

    facility to allow users to find Web data record extraction tasks

    of interest to them.

    Tasks submitted to the Web Service are executed in parallel

    for scalability reasons, as shown in Figure 8. When a new Web

    data record extraction task is submitted, the master computer

    divides the task up into multiple sub-tasks and puts them

    into a task queue. Thus, the master computer maintains the

    list of tasks to be executed and distributes them among aset of worker computers for the purposes of load balancing.

    The workers retrieve the data resources from the deep Web

    using user-specified URLs. The extracted Web records are

    filtered based on the user-defined filtering rules. Final results

    are gathered at the master computer. and the workers are then

    ready to take on new tasks. The master computer determines

    whether to pass back the result set, containing the list of Web

    records, directly to the calling client or to save it at the Atom

    server for the client to consume later.

    Fig. 7. Collaborative data extraction.

    V. IMPLEMENTATION

    Our system provides a Web Service interface that allows

    clients to access the Web data record extraction service, a

    distributed computing platform that performs the Web record

    extraction computations in parallel, and a backend database

    that stores information for the multiple collaborative users. The

    data from the deep Web sites are aggregated in a database atthe Atom server.

    A. Distributed Computing Platform

    The distributed computing platform is similar to the CILK

    system [4] for multithreaded parallel programming, except

    that CILK is implemented in C++ whereas our system is

    implemented in Java. A job is divided up into a number ofsub-

    jobs. A job is finished when all of its sub-jobs are executed. In

    CILK, there might be dependence relationships between the

    sub-jobs. For example, if sub-job A takes the output of sub-job

    B as input, then sub-job A can be executed only when sub-

    job B has finished. In our system, there are no dependence

    relationships between sub-jobs, i.e., a Web record extractiontask does not depend on any other tasks. Thus, our problem is

    slightly easier than the general problem addressed by CILK.

    We employ the concept of work stealing used by CILK to

    avoid multiple workers requesting work from the master at

    the same time and, hence, avoid the network communication

    bottleneck at the master. The master in our system maintains

    the list of jobs that are ready to be executed. Each worker

    maintains its own job queue. When its job queue length is

    less than a threshold, MinJobs, the worker either requests a

    899

    Authorized licensed use limited to: MISRIMAL NAVAJEE MUNOTH JAIN ENGG COLLEGE. Downloaded on November 28, 2009 at 21:02 from IEEE Xplore. Restrictions apply.

  • 8/14/2019 colaborative webservices

    5/7

    Fig. 8. Distributed computing platform for the Web data record extraction Web Service.

    new job from the master with probability p, or steals a job

    from a randomly picked worker with probability 1p. Usingthis work stealing strategy, the system balances the network

    bandwidth usage among all of the workers to avoid a burst of

    requests at the master.

    The master keeps track of the jobs that have been assigned

    to the workers until all of them are executed successfully. If a

    worker fails to respond to the master within a certain amount

    of time, the master marks the worker as dead and puts all

    of the tasks assigned to that worker back into the job queue

    for execution. In this manner, the system is protected against

    failures of the workers. The master can also be protected from

    failures by means of a backup server.

    B. Backend Database

    The backend database stores the user information, the Web

    data record extraction task information, and the corresponding

    results. The structure of the database is shown in Figure 9.

    Fig. 9. Backend database.

    The U ser Account table stores the user account informa-

    tion. Each record corresponds to a user account created by a

    client. The attributes include two mandatary fields, Username

    and P assword, and several optional fields for the users

    profile, Age, Interest, Occupation, etc.

    Once a new Web data record extraction task is submitted

    to the server, the system creates a new entry in the Task

    table. TaskID is a unique identifier for the task; Name is

    a user-defined attribute that helps to identify the task; and

    Description is an attribute that briefly describes the task.

    FilterRule is a logic expression that is used to filter the

    extracted records. For example, Keyword1 AND Keyword2

    means that the user wants the set of data records that contain

    both Keyword1 and Keyword2. The F ilterRule attribute is

    optional. If the F ilterRule field is missing, the system returns

    all of the extracted Web records.

    The TaskSchedule table stores information related to

    which task is to be executed and when. A Web data record

    extraction task might need to be executed repeatedly. For

    example, in the soccer game live news application, the client

    wants the results to be updated every few seconds because

    news can be posted at any time. According to the user-

    specified starting time, ending time and refresh frequency, the

    system creates multiple task schedule entries for the same Web

    data record extraction task, which is executed repeatedly.

    The User-taskAuthorization table describes the user-to-

    task authorization relationships. There are three authorization

    levels: 0 means the user can access the results of the task;

    1 means the user can manage the task; and 2 means the

    user is the administrator of the task. Username Public

    is system-reserved. If the Username attribute of a User-

    taskAuthorization record is Public, it indicates that the

    corresponding task is publicly available.

    900

    Authorized licensed use limited to: MISRIMAL NAVAJEE MUNOTH JAIN ENGG COLLEGE. Downloaded on November 28, 2009 at 21:02 from IEEE Xplore. Restrictions apply.

  • 8/14/2019 colaborative webservices

    6/7

    The DataResource table stores the user-specified URLs

    for the Web pages that contain the target information. A Web

    data extraction task can be associated with multiple URLs and,

    hence, multiple DataResource records.

    The Result table stores the location of the Web records that

    are extracted.

    C. Web Service Interface

    Our Web Service allows client access in a variety of ways.

    The Web Service interface is described using WSDL. A client

    accesses the Web Service by sending a SOAP request message

    to the Web Service, and the Web Service returns the results

    to the client in a SOAP response message. The key operations

    provided by the Web Service are the following.

    CreateUserAccount: A new client uses this operation to

    create a new account by providing a Username, Password,

    Interests, etc. in a SOAP request message. The Web

    Service creates a new record in the UserAccount table.

    The SOAP response message indicates whether or not the

    user account is successfully created.

    ExtractDataRecords : Using this operation, a client sub-

    mits a Web data record extraction task to the Web

    Service. In the SOAP request message, the client provides

    a Username, Password, URL for the target Web page

    that contains the list of Web records, and a filtering

    rule. The Web Service creates records in the User-

    taskAuthorization table, Task table, TaskResource table

    and TaskResult table. The Web records extracted from the

    Web page are filtered using the filtering rule and stored

    at the Atom server for the user to retrieve at a later time.

    Meanwhile, the SOAP response message containing the

    Web record is returned to the client.

    PeriodicallyExtractDataRecords : The Web Service sup-

    ports periodic updating of the Web data record extractionresults. In the SOAP request message, the client indicates

    the data source, starting time, ending time, and requested

    update frequency. The server creates a new record in the

    Task table and multiple records in the TaskSchedule table.

    The scheduled tasks are executed at pre-defined times,

    and the results are stored at the Atom server as Atom

    feeds. The client consumes the data by subscribing for

    the data at the Atom server. The Atom server performs

    an identification check before authorizing a user to access

    the Atom feeds. The SOAP response message contains

    only the TaskID and the URL to the Atom feeds.

    AddConsumer / RemoveConsumer: The administrative

    user of a task can use these operations to authorize /deauthorize other users access to the results stored at

    the Atom server. If the user adds / removes a special

    username Public, the Atom feeds containing the task

    results will be made publicly available / unavailable. To

    use this operation, the user must specify Username, Pass-

    word and TaskID. The SOAP response message indicates

    whether or not the job is successfully executed.

    AddManager / RemoveManager: The administrative user

    of a task can use these operations to authorize / deautho-

    rize another user as manager of the task.

    AddResource / RemoveResource: The administrative user

    and all authorized users that can manage the task

    use these operations to add / remove URLs to target

    Web pages that contain Web records. Similar opera-

    tions include UpdateName, UpdateDescription, Update-

    Frequency, UpdateEndingtime, etc. The server updates

    the corresponding records in the Task table, TaskSchedule

    table and TaskResource table.

    ListTasks: This operation lists all of the tasks for which

    the user is authorized to view the results. The SOAP

    request message indicates the Username and Password.

    The SOAP response message contains a list of tasks with

    descriptions and the URLs to access the resulting Atom

    feeds at the Atom server.

    RecommendTask / RecommendUser: These operations

    help the user to locate the publicly available tasks that

    might be of interest to the user, or other users who have

    the same interests. We employ PLSI [11] to generate

    recommendations. The co-occurrence probability of each

    user / task pair is estimated. For each user, we have a

    ranked list of tasks that have high probability of co-

    occurrence for that user. The top-ranked publicly accessi-

    ble tasks are recommended to the user. For recommending

    friends to a user, we compare the edit distance of any

    pair of user interests. The users with interests closest to

    those of a particular user are recommended to the user

    as potential friends.

    VI. RELATED WOR K

    Traditional information search and retrieval services for the

    Web, such as those provided by the search engines of Googleand Yahoo!, consider a Web page as an atomic-level object. A

    user is lead to a Web page even though he / she is interested

    in only a small part of the content of the Web page. On the

    other hand, if the information in which the user is interested

    is located on multiple Web pages, the search engines will not

    aggregate this information, and users have to access all of the

    related Web pages manually to obtain a broad view of the

    available information.

    The Web should be considered as a repository of informa-

    tion, rather than as a repository of Web pages. The wide use

    of information in the Web has driven general-purpose search

    engines to perform vertical Web search. In a particular domain,

    Web data records in Web pages are extracted and aggregatedtogether to satisfy users information needs. Google and Mi-

    crosoft provide vertical search engines for online shopping,

    publications, recruiting advertisements, etc.

    However, the Web contains information about so many

    different topics; moreover, the information is coupled together

    for multi-disciplinary fields. It is difficult to divide up the

    information in the Web into a reasonable number of non-

    overlapping domains. It is even harder to build a vertical

    search engine for each such domain. A domain-independent

    901

    Authorized licensed use limited to: MISRIMAL NAVAJEE MUNOTH JAIN ENGG COLLEGE. Downloaded on November 28, 2009 at 21:02 from IEEE Xplore. Restrictions apply.

  • 8/14/2019 colaborative webservices

    7/7

    Web object retrieval service is proposed in [14]. However, to

    identify whether or not a Web page contains a set of data

    objects is a non-trivial problem.

    We address the information search and retrieval problem in

    the case that users know the data source locations, but it is

    inconvenient for the users to locate the relevant information

    by browsing the Web pages themselves. The reasons are theusers precious time, the limitations in network bandwidth, the

    display area on mobile devices, etc. Our Web Service provides

    better results for users queries, by extracting, filtering and / or

    aggregating data records from the Web pages in a user-defined

    manner. Different from existing vertical search engines, our

    Web Service for extraction of Web data records supports

    collaboration among users. Data extraction results can be

    shared among multiple users, and multiple users can manage

    the same Web data record extraction task to provide more

    complete and relevant data.

    VII. CONCLUSION AND FUTURE WOR K

    This paper has described a Web Service that automaticallyparses and extracts data records from Web pages containing

    structured data. The Web Service allows multiple users to

    share and manage a Web data record extraction task. A

    recommendation system, based on the Probabilistic Latency

    Semantic Indexing algorithm, enables a user to find potentially

    interesting content or other users who have similar interests.

    A distributed computing platform improves the scalability of

    the Web Service. A Web Service interface allows users to

    access the Web Service, and allows programmers to develop

    their own applications and, thus, extend the functionality of

    the Web Service.

    In future work, we plan to do extensive performance eval-

    uation of the Web Service to determine the query rate that

    the system supports, with a single server and with multiple

    servers using our distributed computing platform, and also

    with multiple databases at multiple Web sites. We also plan

    to investigate the use of the Web Service in various example

    applications, such as those mentioned earlier in the paper.

    REFERENCES

    [1] A. Arasu and H. Garcia-Molina. Extracting structured data from Webpages. In Proceedings of the 2003 ACM International Conference onthe Management of Data, June 2003, San Diego, CA, pp. 337-348.

    [2] M. K. Bergman. The deep Web: Surfacing hidden value. Technicalreport, BrightPlanet LLC, December 2000.

    [3] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, vol. 3, 2003, pp. 993-1022.

    [4] R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H.Randall, and Y. Zhou. Cilk: An efficient multithreaded runtime system, Journal of Parallel and Distributed Computing, pages 207-216, 1995.

    [5] K. C. Chang, B. He, C. Li, M. Patel, and Z. Zhang. Structured databases

    on the Web: Observations and implications. ACM SIGMOD Record, vol.33, no. 3, 2004, pp. 61-70.

    [6] E. H. Chi. Information seeking can be social. Computer, vol. 42, no. 3,March 2009, pp. 42-46.

    [7] V. Crescenzi, G. Mecca, and P. Merialdo. RoadRunner: Towards auto-matic data extraction from large Web sites. In Proceedings of the 27th International Conference on Very Large Data Bases, September 2001,Rome, Italy, pp. 109-118.

    [8] Y. Freund, R. Iyer, R. E. Schapire, and Y. Singer. An efficient boostingalgorithm for combining preferences. Journal of Machine LearningResearch, vol. 4, 2003, pp. 933-969.

    [9] G. Golovchinsky and P. Qvarfordt. Collaborative information seeking.Computer, vol. 42, no. 3, March 2009, pp. 47-51.

    [10] J. L. Herlocker, J. A. Konstan, and J. Riedl. Explaining collaborativefiltering recommendations. In Proceedings of the 2000 ACM Conferenceon Computer Supported Cooperative Work, Philadelphia, PA, December2000, pp. 241-150.

    [11] T. Hofmann. Probabilistic latent semantic indexing. In Proceedingsof the 22nd Annual International ACM Conference on Research and Development in Information Retrieval, Berkeley, CA, August 1999, pp.50-57.

    [12] B. Liu. Mining data records in Web pages. In Proceedings of the ACMInternational Conference on Knowledge Discovery and Data Mining,Washington, D.C., August 2003, pp. 601-606

    [13] G. Miao, J. Tatemura, A. Sawires, W. P. Hsiu, and L. E. Moser.Extracting data records from the Web using tag path clustering. InProceedings of the 18th International World Wide Web Conference,Madrid, Spain, 2009, pp. 981-990

    [14] Z. Nie, Y. Ma, S. Shi, J.-R. Wen, and W.-Y. Ma. Web object retrieval.In Proceedings of the 16th International Conference on the World WideWeb, Banff, Alberta, Canada, May 2007, pp. 81-90

    [15] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEETransactions on Pattern Analysis and Machine Intelligence, vol. 22, no.8, 2000, pp. 888-905.

    [16] J. Wang and F. H. Lochovsky. Data extraction and label assignment for

    Web databases. In Proceedings of the 12th International Conference onthe World Wide Web, Budapest, Hungary, May 2003, pp. 187-196.

    [17] Y. Zhai and B. Liu. Web data extraction based on partial tree alignment.In Proceedings of the 14th International Conference on the World WideWeb, Chiba, Japan, May 2005, pp. 76-85.

    [18] H. Zhao, W. Meng, and C. Yu. Mining templates from search resultrecords of search engines. In Proceedings of the 13th ACM InternationalConference on Knowledge Discovery and Data Mining, San Jose, CA,August 2007, pp. 884-893.

    [19] J. Zhu, Z. Nie, J. Wen, B. Zhang, and W. Ma. Simultaneous recorddetection and attribute labeling in Web data extraction. In Proceedingsof the 12th ACM International Conference on Knowledge Discovery and

    Data Mining, Philadelphia, PA, August 2006, pp. 494-503.

    902