Searching the Web Using Screenshots

Searching the Web Using Screenshots

Tom Yeh, Brandyn White, Larry DavisUniversity of MarylandCollege Park, MD, USA

{tomyeh, brandyn, lsd}@umd.edu

Boris KatzMIT

Cambridge, MA, USA

[email protected]

ABSTRACTMany online articles contain useful know-how knowledgeabout GUI applications. Even though these articles tendto be richly illustrated by screenshots, no system has beendesigned to take advantage of these screenshots to visu-ally search know-how articles effectively. In this paper, wepresent a novel system to index and search software know-how articles that leverages the visual correspondences be-tween screenshots. To retrieve articles about an application,users can take a screenshot of the application to query thesystem and retrieve a list of articles containing a matchingscreenshot. Useful snippets such as captions, references, andnearby text are automatically extracted from the retrievedarticles and shown alongside with the thumbnails of thematching screenshots as excerpts for relevancy judgement.Retrieved articles are ranked by a comprehensive set of vi-sual, textual, and site features, whose weights are learned byRankSVM. Our prototype system currently contains 150karticles that are classified into walkthrough, book, gallery,and general categories. We demonstrated the system’s abil-ity to retrieve matching screenshots for a wide variety ofprograms, across language boundaries, and provide subjec-tively more useful results than keyword-based web and im-age search engines.

Categories and Subject DescriptorsH.2.8 [Database Applications]: Image Databases; I.4 [Computing Methodologies]: Image Processing andComputer Vision

General TermsMeasurement, Human Factors

KeywordsImage Search; Help

1. INTRODUCTIONMany online articles contain useful know-how knowledge

about GUI applications. Users can learn how to performa wide variety of tasks, such as how to set up a home net-work, back up files, or change the speed of the mouse cursor.These resources are often created and made available on-line by software developers who wish to maintain an online

.

version of the documentation in order to keep the contentup-to-date (e.g., support.microsoft.com). These resourcesare also created by unofficial, third-party experts, for exam-ple, by sites offering tutorials and tips on various softwareapplications (e.g., osxfaq.com), by general-purpose how-tosites (e.g., eHow.com) featuring software tutorials as one ofthe topics, and by computer book publishers who wish tomake their books accessible online via subscription (e.g., sa-faribooksonline.com). Even more resources can be found inuser-generated content, for example, in blogs where bloggersshare their experiences and tips using a software application,in discussion boards where people can discuss and learn fromeach other about software, and in QA communities such asYahoo Answers where members can raise question and getanswers back from other members.

However, for some users, searching this valuable resourceeffectively can be challenging. For example, suppose a useropens up the network properties dialog window and wishesto find out how to change the IP address. To use a searchengine, this user may first type “change IP address” to de-scribe the task he wishes to learn. He may soon realize it isalso necessary to indicate which dialog window he wishes toperform the task with. To do so, he may enter additionalsearch terms to describe the operating system, the title ofthe dialog window, and any other information needed toidentify the dialog window. Not only is it cumbersome toenter many keywords but also these keywords are proneto ambiguity since it is hard distinguish the keywords de-scribing the program from those describing the task. As theuser browses the links in the result, the user may find itdifficult to judge relevancy based only on the text ex-cerpts. Moreover, with the emphasis on second languageeducation and the availability of free and powerful machinetranslation technology (e.g., Google Translate), many Webusers have begun to practice bilingual search in order to ac-cess more information beyond the confines of their nativelanguage. However, technical articles are often inaccessi-ble by bilingual search. For example, a Chinese-speakingperson fluent in English may be able to retrieve useful arti-cles on dogs in both languages using the English or Chineseword for dog as the search term. But when it comes to tech-nical terms such as system preferences, the same person maybe limited to only Chinese articles as it may not be obvioushow to translate these terms into English.

In this paper, we present a novel system for indexing andsearching online know-how articles about computer applica-tions (Figure 1). These articles tend to be richly illustratedby screenshots of the applications discussed in the text. Our

Figure 1: A typical result page returned by our proposd search engine that uses screenshots to search forarticles about computer programs (please visit authors’ homepage to see more examples)

proposed system leverages this unique property and adoptsa multi-modal approach to indexing and searching these ar-ticles based on both visual and textual relevance. It enablescomputer users to submit a screenshot of an application asthe search term and to retrieve a list of relevant articles con-taining the screenshots of the same application. In addition,it allows users to optionally specify a few keywords to searchwithin these visually relevant articles for a subset regardinga specific topic.

Our approach offers several advantages. In terms of us-ability, this approach is direct and intuitive. Users onlyneed to capture a screenshot of an application to form aquery, which is less cumbersome than entering several key-words to describe the application. In terms of the separationbetween context and topic, this approach is unambiguoussince each search modality serves a well-defined role (i.e.,screenshot → context, keywords → topic). In terms of rele-vancy judgement, this approach is cognitively simple, byreducing the judgement task from reading text snippets toviewing images. Finally, in terms of internalization, thisapproach is language independent since it speaks theuniversal language of images and does not require users totranslate technical terms.

This paper reports on the two major contributions madeby the current research:

• Proposed a multi-modal search engine for computerknow-how articles.

• Built and evaluated a prototype system with morethan 150K unique pages.

In the rest of the paper, we review related work (Section 2),describe the how we built the database for our proposed sys-

tem (Section 3), explain how the database can be searched(Section 4), and describe how the system was evaluated (Sec-tion 5).

2. RELATED WORKThere has been a growing research interest in the prob-

lem of indexing, searching, and organizing images on theweb. Many classic problems originating from text searchhave been revisited in the new context of images. For ex-ample, Jing and Baluja [8] tackled the ranking problem inthe context of searching product images using PageRank.Kennedy and Naaman [10] and Van Leuken et al [21] bothexamined the problem of result diversification in the contextof image search; the former considered a specialized case oflandmark images, whereas the latter considered a more gen-eral case involving a wider range of topics such as animalsand cars. Lempel and Soffer [12] dealt with the problem ofauthority identification for web images by analyzing the linkstructures of the source pages. Mehta et al [16] presented asolution to the problem of spam detection for web images;they analyzed visual features and looked for duplicate im-ages in suspiciously large quantities as potential spam. Li etal [13] considered the problem of relevance judgement anddemonstrated the ability of image excerpts (dominant image+ snippets) to help users judge results faster. Crandall etal [2] took on the challenge of organizing a large dataset inthe setting of tens of millions of geotagged images, taking acomprehensive approach involving visual, textual, and tem-poral features. Similar to these existing works, the currentwork relies on several state-of-the-art computer algorithmsin order to index and search screenshots effectively. Butunlike them, the current work finds similar images only as

an intermediate step toward the ultimate goal of returningrelevant text to the users.

In response to accelerated globalization, there have beenresearch efforts dedicated to make the Web more accessi-ble to international users independent of the languages theyspeak. For example, Scholl et al [19] proposed a methodto search the Web by genres such as blog and forum in alanguage independent manner. Ni et al [18] explored waysto analyze and organize Web information written in differ-ent languages by mining multilingual topics from Wikipedia.Tanaka-Ishii and Nakagawa [20] developed a tool for lan-guage learners to perform multilingual search to find us-age of foreign languages. In relation to research efforts,the current work offers yet another interesting and promis-ing possibility—searching based on the universal visual lan-guage of screenshots.

In terms of the application domain, the work closest toours is that of Medhi et al [15] who examined the optimalway to present audio-visual know-how knowledge to novicecomputer users. In terms of search modality, there have beenseveral previous works aimed to provide users with multiplesearch modalities instead of just keywords. For example,Narayan et al [17] developed a multi-modal mobile interfacecombining speech and text for accessing web informationthrough a personalized dialog. Dowman et al [5] dealt withthe problem of indexing and searching radio and televisionnews using both speech and text. In the current work, weexplore the potential of the particular modality pair of imageand text for searching computer-related articles.

In the current work, we used crowd-sourcing to train andevaluate the proposed system. We used the crowd-sourcingservices brokered by Amazon Mechanical Turk (AMT) 1,a vibrant micro-task market where people can be recruitedto perform simple tasks for small monetary awards. AMThas been applied in various research contexts for the pur-pose of training and evaluation. Kittur et al [11] reportedsuccess in recruiting workers from AMT to rate Wikipediaarticles and achieved quality comparable to that by expertraters. Liu et al [14] used AMT to evaluate their algorithmfor predicting satisfication of answers in online QA commu-nities. Vijayanarasimha et al [22] used AMT to label imagesfor learning visual classifiers. AMT offers cost-effectivenessand the ability to cover wider demographics than lab-basedstudies. However, as identified by Kittur et al [11], there arecertain risks associated with this method such as getting dis-honest answers, which can be prevented by formulating theproblem in such a way that the optimal strategy for workersis to perform the task in an honest manner. Consequently,in all of our applications of AMT, we built in validationmechanisms to control quality.

3. BUILDING THE DATABASEThis section describes the process of building the database

for our proposed search engine. This process involves col-lecting images to populate the database (Section 3.1), in-dexing the images collected by visual features (Section 3.2),extracting useful snippets from the articles containing theseimages (Section 3.3), and classifying articles into categories(Section 3.4).

1www.mturk.com

3.1 Collecting imagesWe used three methods to collect screenshot images to

populate our database. Currently, our prototype systemcontains more than 150k images in its index.

First, we submitted computer-related keywords to BingImages 2 to collect screenshot images of interactive pro-grams. To increase the likelihood of obtaining the desiredimages, we sampled keywords from title bars of the dialogwindows of various computer programs. Some examples ofthese keywords are properties, preferences, option, settings,wizard, custom, installation, network, sound, keyboard ...etc. We turned on the filter feature to keep only illustra-tions and graphics, rejecting obviously non-screenshot im-ages such as images of faces and natural scenes. Using thismethod, we collected approximately 100k images.

Second, we used TinEye3, a reverse image search enginethat can take an image as the query and return a list of URLsto nearly identical copies of the image found on the Web; itis designed primarily for copyright infringement detection.We manually captured screenshot images of more than 300interactive windows of popular programs across three of themost popular OS platforms (XP, Vista, and Mac OS). Theseimages were submitted to TinEye to obtain about 5,000 im-ages.

Third, we collected a library of 102 electronic books ofpopular software programs. We extracted all the image fig-ures embedded in the electronic file (i.e., PDF documents).This method produced about 50k images.

Each method has its own pros and cons. While Bing Im-age Search provides the best variety of images, many of themare not visually relevant to any program at all. TinEye isable to provide visually relevant images, but these imagesare ranked only by visual relevancy; the page containing thehighest ranked image may not necessarily contain any usefulinformation. Computer books are professionally edited andthus contain the highest quality text; however, they covera relatively limited range of applications and their contentisn’t as current as compared to the Web. By using all ofthese methods, we hope to create a rich repository of tech-nical information that is both visually and textually relevantto, and accessible by, general computer users.

3.2 Indexing imagesOur image search task can be categorized as a content-

based image retrieval (CBIR) problem; we are given a screen-shot of an application the user would like information aboutand the desired result is a ranked list of visually similarimages. The general CBIR task is ill-defined as each prob-lem domain imposes requirements on what aspects of theimage are relevant for retrieval. For example, when search-ing for photos of a particular person, we may largely ignorethe scene while focusing on detecting and recognizing faces;however, if our goal is to collect photos taken at nationallandmarks, we would place our focus on the background(more information on CBIR can be found in a recent surveyby Datta et al [3]).

In the domain of common GUI screenshots, we have sev-eral desired properties of a CBIR system. As book andweb screenshots are often post processed by authors to con-serve space and draw attention to relevant parts of the in-

2www.bing.com/images3www.tineye.com

terface, the image search solution should be invariant toimage manipulations including cropping, scaling, rotation,color variations, and illumination changes. Moreover, visualannotations are commonplace as are compositions of differ-ent screenshots, potentially describing a sequence of useractions. The GUI elements themselves often vary visuallyas interface elements may be moved, scaled, occluded, andthemed. As a primary goal of the proposed system is tofacilitate cross-lingual search, the proposed system shouldhandle variations in application text. Tutorials and docu-mentation for different software versions are often largelyapplicable, though the interface may have new visual ele-ments, an ideal CBIR system would be robust to minor GUImodifications. To serve as a useful method of locating helpdocumentation, the system must have real-time performanceon comprehensive databases of book and web figures.

Our solution to these problems is to model visually de-scriptive regions of the image individually. This allows forspatial variations and substantial performance gains as onlyvisually relevant regions of the image are considered. Theimages are converted to gray to reduce common color themevariations and the differences between gray and color imagesoften found in PC help books and websites respectively. Toachieve scale, rotation, and illumination invariance, we usethe SURF feature point detector and descriptor developedby Bay et al [1] to find and describe regions of interest. TheSURF descriptor produces a 64 dimensional feature vectorfor each feature point found and, for our purposes, the spa-tial coordinates and scale of the features are neglected.

Given a list of feature vectors for each image in our database,our aim is to produce a fast image comparison system. Apossible solution is to store all of the feature vectors forall images and, given a query image, cast a vote for eachdatabase image that has a feature within some Euclideandistance to a query image feature; however, this methodwould require a substantial amount of memory to allow forfast comparison. Moreover, every query feature would becompared to every database feature, preventing real-timeperformance on even modestly sized databases. To com-bat this problem, we use the method developed by Jegouet al [7]. Provided a database of book and web images, wecompute SURF features. The features are clustered usingK-Means, and each individual feature is assigned to its near-est cluster. For each cluster, we compute the median valuefor each of the 64 dimensions in the SURF descriptor. Foreach feature in the cluster, we compare the cluster’s me-dian value to the feature vector, producing a 64 dimensionbit-string encoding each dimension’s proximity to the me-dian value. This process results in what is referred to as aHamming Embedding of the feature vector, which requires 8bytes of memory for each feature (compared to 256 bytes forthe floating point vector) and allows for fast distance com-putation by simply taking the Hamming distance betweentwo bit-strings belonging to the same cluster.

The offline database process is implemented using theHadoop map/reduce implementation by Dean and Ghemawat[4] to allow the computation to scale with the database size.The entire feature computation and indexing procedure forour 150,000 image database takes under two hours on a clus-ter of six High-CPU Amazon EC2 4 instances.

4aws.amazon.com/ec2/

3.3 Extracting SnippetsAfter collecting and indexing a large number of screen-

shots, we need to extract relevant text from their sourcearticles. The objective of text extraction is to identify snip-pets in an article that can highlight and/or give hints aboutthe kinds of knowledge users can expect to obtain if theyread the article. For example, for an article containing ascreenshot of the network configuration program, a goodsnippet would be “set IP address” or “enable wireless net-work” since it clearly indicates to users the article is likelyto explain how to perform these tasks. On the other hand, asnippet such as “network configuration” that describes thescreenshot’s content is less useful to users since they alreadysee the content in the screenshot and learn nothing new fromreading the content again in the snippet. Note that this isin sharp contrast with typical image search engines that ac-tually favor snippets describing image contents in order toindex and search images. Below we describe six kinds ofinformative snippets that we extract for the purpose of oursearch application:

Title/Heading An article’s title and headings are obvi-ous choices for snippets since they often highlight themost important points covered by the article. Titlescan be extracted from the <title> tag for web pages.Headings are especially useful when there are multiplescreenshots contained in the same article. For web ar-ticles, we extract the heading tag (e.g., h1, h2) closestto and proceeding each screenshot. For book articles,we extract the chapter and section headings from theoutline metadata stored in the PDF file whenever suchmetadata is available.

Alt Text An image’s ALT tag provides information redun-dancy in cases where the image fails to be displayed.The text attribute of an ALT tag is often a useful fea-ture for a keyword-based image search engine to indexand search images. Based on our observation, the Alttext of the screenshot of a program (e.g., Alt=“systempreferences”) tends to be the title of the program (e.g.,System Preferences). Thus, such text may not provideany additional information other than what the userscan already read directly on their computer screen.Yet, we expect Alt text can still be helpful in strength-ening users’ confidence in a search result, by checkingwhether both the program’s screenshot and title ap-pear in the result’s excerpt.

Caption Many screenshot images are accompanied by cap-tion text to briefly describe the idea meant to be illus-trated by the images. Especially in books, almost allthe figures, whether or not they are screenshots, con-tain captions that can reliably extracted by lookingfor string patterns that begin with the word Figure.Even in the articles on the web, captions can some-times be found below or above images. These captionsare harder to detect because often they are not ex-plicitly labeled by the word Figure. Thus, we identifypotential caption text based on two simple heuristics:(1) is it immediately above or below the image and (2)does it have a different style as compared to surround-ing paragraphs.

References Sometimes references to images can be foundin sentences relatively far away from the images. Some

references are based on explicit labels, such as thephrase as shown in Figure X. Other references arebased on layout relationships, such as the phrase seebelow. Sentences containing such references are likelyto be relevant to the images they refer to. Presentedin the excerpt, these sentences may indicate the typesof knowledge users can expect to learn if they read thewhole article. To resolve a label-based reference, welook for the image whose caption contains the match-ing label. For web articles, the image is always on thesame page, whereas in book articles, the referred im-age can appear a few pages away due to formattinglimitations. To resolve a layout-based reference, wescan in the direction indicated in the reference (e.g.,scan backward for see below) until an image is found.After resolving the reference, a link between the re-ferred image and the referring sentence is established.Whenever an image is retrieved, sentences referring toit can also be retrieved for inclusion in the excerpt.

Action Phrases Since one of the anticipated uses of oursystem is to learn how to perform interactive actionsusing a particular program, phrases containing cer-tain action keywords can be useful. For examples,phrases containing keywords such as configure, click,and open, may indicate to the users that the article arelikely to explain what can be configured, what shouldbe clicked, and what must be opened, using the pro-gram. Thus, we compile a list of common action key-words and extract action phrases from each images’surrounding text.

Nearby Text If none of the above useful excerpt phrasescan be extracted, we simply extract nearby sentencesaround the image and use it as the excerpt.

3.4 Classifying articles into categoriesIn addition to extracting useful text from articles, we can

also classify them into distinct categories based on their ori-gins, structures, and contents. In the current work, weconsider four categories: book, walkthrough, gallery, andgeneral. The book category consists of those articles ex-tracted from pdf books. The walkthrough category includesarticles giving step-by-step instructions on how to performcertain computing tasks such as backing up files and chang-ing display resolution. Figure 2 shows some examples ofwalkthrough articles. These walkthrough articles tend topossess three features that can help us identify them. First,they tend to contain an ordered or unordered list of instruc-tions in short sentences interspersed with screenshot illustra-tions. Second, these sentences tend to begin with commoninteractive action verbs such as click, enter, type, and open.Third, walkthrough articles tend to include certain indica-tive terms such as walkthrough, step, and guide. The gallerycategory consists of articles with many images but relativesmall amount of text. The general category covers all of theunclassified articles. Table 1 summarizes the distribution ofarticles over the categories.

4. SEARCHING THE DATABASEThis section describes the process of searching the database.

This process involves specifying queries (Section 4.1), re-trieving articles with similar images (Section 4.2), ranking

Figure 2: Examples of articles providing step-by-step walkthroughs

Table 1: Number of articles in different categoriesGeneral Walkthrough Gallery Book Total51,743 45,249 4,464 55,244 156,700

(33.1%) (28.9%) (2.9%) (35.1%)

retrieved articles (Section 4.3), and presenting them to theusers (Section 4.4).

4.1 Specifying queriesOur proposed search engine supports mixed-modality queries

in order to optimize the effectiveness in searching technicalarticles about interactive programs. A typical query consistsof a screenshot of a program and, optionally, a set of key-words to specify what aspects of the program the retrievedarticles are supposed to cover.

Screenshot queries can be specified in two ways. First,users can run a cross-platform, Java-based, client interfacewe developed to capture the screenshot of a selected windowor an arbitrary screen region. The client interface will sub-mit the screenshot as the image query to our search engineand display the search results in the default web browser.Alternatively, users can use any existing image capture util-ities such as the snipping tool on Windows Vista or theCommand-Shift+4 hotkey on Mac OS to capture screen-shots. Users can use a Web interface to submit the screen-shots to our search engine and view the results directly in aWeb browser.

Keywords queries are optional and can be specified inthree ways. They can be entered together with the screen-shot queries on both the client and Web interfaces. Also, asthe users are browsing the results, they can enter keywordsto filter and/or refine the results.

4.2 Finding similar imagesGiven a query screenshot, we need to find other similar

screenshots in the database in order to retrieve visually rel-evant articles. To do so, we extract visual features fromthe query image and compute the Hamming Embedding foreach feature as was performed on the database images. Thedistance between each query bit-string is computed againstall database bit-strings in the nearest cluster to the queryfeature vector. All features within a certain Hamming dis-

tance cast a vote for their corresponding database image.This look-up process uses an inverted file for each clusterwhich contains lines of bit-strings and their correspondingdatabase image IDs. The result is a histogram of image votesfrom which we produce a ranked list of database images ascompared to the query image. The average retrieval timefor a query image when k = 1, 000, where k is the numberof clusters, is .53 seconds on a single CPU 2.4GHz machine.

4.3 Ranking retrieved articlesAfter retrieving a list of articles ranked by visual simi-

larity, we need to re-rank them by taking textual relevancyinto account. If ranking is based only on visual similarity,as in the case of a typical content-based image search en-gine, even though the highest ranked article may contain theright screenshot, there is no guarantee whether the articlealso contains any useful text. Similarly, if ranking is basedonly on text, as in the case of a typical keyword search en-gine, despite having all the keywords, an article may stillbe useless if it contains the wrong screenshot. Therefore, itis necessary to adopt a ranking scheme based on a compre-hensive set of multimodal features. In the current work, weconsider three types of features: visual, text, and site, whichwill be explained in detail next.

4.3.1 Visual Features

Similarity In our system, visual similarity is the most dom-inant feature. To users, an article with a visuallydissimilar screenshot is a sure sign that the article isabout the wrong program and should be ranked lower.However, images with lower visual similar scores cansimply mean they are a cropped version of the sameimage or a version with it superimposed on another im-age. In some cases the users may still find them useful;there is no reason to reject them completely. Hence,we derive a visual similarity feature by normalizing thethe score by the highest score in the result set.

Resize ratio When a screenshot image of a program is cap-tured and included in an article, it is often resized todimensions that are most suitable for reading. The ra-tio by which an image is resized provides a cue as tothe image’s role. For example, an image subject to ahigh resize ratio may suggest that it is displayed as athumbnail to provide just enough details for identifica-tion, whereas a moderately resized image may suggestthe desire to preserve more details for users to actuallyread its content. The resize ratio can be computed asthe ratio of the captured size (off the desktop screen)and the embedded size (in a Web or book page). Thecaptured size can be derived simply from the size of thequery image. For images extracted from PDF books,the size information is often readily available. For Webimages, the embedded size is often contained in thewidth and height fields of the <img> tag. In the ab-sence of these fields, the size can be extracted from theheader of the image file. We sampled a number of web-sites with good screenshots and computed the averageresize ratios as the standard. Then, given an arbitraryscreenshot, we can calculate a normalized score basedon how close it is to the standard.

Position If an image occupies a prominent position in anarticle, it is likely to be important. As a result, the ar-

Table 2: Top 10 sites with the most imagesmicrosoft.com 320 qweas.com 110

ehow.com 183 softpedia.com 110brothersoft.com 163 bestshareware.net 108techrepublic.com 121 gateway.com 106

versiontracker 113 askdavetaylor.com 106

ticle may dedicate more text to the image and shouldbe ranked higher. We consider three positions to beprominent: first, center, and last. We can computethree distinct position scores as the normalized dis-tances to each of the three prominent positions.

4.3.2 Text Features

Snippet If an article contains image related snippets suchas captions, references, and nearby text, it is likely tobe more relevant to the image. We derive a binaryfeature for each type of snippet to indicate whetherthe snippet can be identified in the article.

Category When it comes to technical information, arti-cles in some categories may be preferred over othersand should be ranked higher accordingly. Currently,we consider four categories: walkthrough, gallery, gen-eral, and book, and derive a binary feature for eachcategory.

Search terms If an article contains more search terms, it islikely to be more relevant and should be ranked higher.We check whether search terms can be found in thetitle and in each image related snippet. Then, we checkwhether the search terms can be found in the samepage. The score is computed as the ratio between thenumber of search terms found and the total number ofsearch terms. Then, for matched search terms on thepage, this score is inversely weighted by their distanceto the image. Also, each term is inversely weightedby its global frequency in order to favor less commonterms over very common terms.

4.3.3 Site Features

Authority Some sites may be trusted by users more fortheir technical content. We identified a list of sitesusers may find authoritative, including sites hostedby software vendors such as Microsoft and Apple andthose dedicated to know-how knowledge such as eHowand wikiHow. We derived a binary feature for eachsite.

Quantity Some sites may contain relatively more screen-shots, which may suggest their stronger dedication totechnical contents involving screenshots. We countedthe number of unique screenshots collected from eachdomain and rank all the domains by this number. Wethen used the percentile in the ranking to compute anumerical feature between 0 and 1. Table 2 lists the10 sites in our database that have the most number ofimages.

4.3.4 Setting feature weightsSince not all features are equally important, it is nec-

essary to set the weights appropriately in order to reflectthese features’ relative importance. In developing the pro-totype of our system, we set the weights in three stages.Table 3 lists all the features and indicates the stages whenthe weights are set. First (stage 1), we apply RankSVM tolearn feature weights based on the training data collectedusing Amazon Mechanical Turk. RankSVM was originallyproposed by Joachims [9] to learn how to weight featuresfrom a set of subjective ordering constraints inferred fromuser click-through data to improved ranking. At the de-velopment stage, we did not have any click-through datato derive ordering constraints for RankSVM. Thus, we re-cruited workers from Amazon Mechanical Turk to provideus explicit article ratings from which to infer ordering con-straints. We chose five query images known to retrieve alarge of number of articles, but they are ranked only by vi-sual similarity. We shuffled the order of these articles andpresented them in a list, Each worker was asked to rate eacharticle in terms of usefulness on a 5-point scale. For qual-ity control, we included two junk results and two duplicateresults. To pass the quality test, workers must successfullymark junk results as useless and give consistent ratings toduplicate results. Then, each pair of ratings provided uswith an ordering constraint for training RankSVM. Notethat since there was no easy way to ask the workers to pro-vide meaningful keywords, this procedure only applied tothe subset of features that do not depend on keywords.

Next (stage 2), we set the weights of keyword-related fea-tures based on the weights of corresponding features. Forexample, if the weight of has title feature is high, it is likelythe weight of has keywords in title must also be high. Thus,we set the weight of each keyword-releated feature to 1.5times the weight of its corresponding feature learned at theprevious stage. Finally (stage 3), with all the weights set ata reasonable level, we can deploy the system and continueto refine the weights using RankSVM based on actual userclick-through data.

4.4 Presenting resultsTo display a list of search results, we adopt a presentation

scheme modeled after that of a typical search engine in or-der to leverage users’ existing familiarity with such scheme.Figure 3 shows a typical presentation of a single result. Thispresentation consists of three main elements: title, excerpt,and source, that are common for all result types. In ad-dition there are several minor elements such as tags andactions that are dependent on the types of results.

For the title element, we display the HTML title tags forweb articles and the book titles for book pages. Markers areinserted at the beginning of the title to identify the cate-gories (i.e., walkthrough, book, or gallery). Users can clickon the title to visit the website to read articles online or pre-view the content of the book pages (assuming the copyrightissues have been resolved).

For the excerpt element, we display the thumbnail of thematched screenshot as well as a composition of relevant textsnippets extracted using the methods described in Section3.3. We insert a tiny inline thumbnail (15 by 15 pixels) intoeach snippet in a way to reflect the snippet’s type. For acaption snippet, the tiny thumbnail is displayed above thesnippet and centered. For a reference snippet, any explicit

Table 3: Features for ranking retrieved articles.Checks indicate the stages when weights are set.Features with highest weights are shown in bold.

Type Name 1 2 3

Image visually similar√

common resize ratio√ √

position to first√ √

position to center√ √

position to last√ √

Text has title√ √

has caption√ √

has reference√ √

has nearby text√ √

is walkthrough√ √

is book√ √

is gallery√ √

is general√ √

has keywords in title√ √

has keywords in caption√ √

has keywords in reference√ √

has keywords in nearby text√ √

keywords distance to image√ √

Site site is authority√ √

site has many images√ √

figure label is replaced by the tiny thumbnail of the corre-sponding screenshot (i.e., as shown in Figure 2 → as shown

in ). For a nearby text snippet, the tiny thumbnail isdisplayed above or below depending on the image’s relativeposition. If keywords are specified, their occurrences in theexcerpt are highlighted.

For the source element, we display the abbreviated url forweb results and chapter/section headings for book results.

4.4.1 Faceted searchTo help users browse search results more effectively, we

provide a set of faceted search capabilities based on the Ex-hibit framework developed by Huynh et al [6]. Exhibit isa popular framework for creating dynamic exhibits of datacollections and provides a rich set of faceted search function-atlies such as filter and timeline controls. Figure 1 shows thetwo faceted filtering control displayed at the left-side panelto allow users to filter the results based on categories andsites.

We also display a tag cloud based on the snippets ex-tracted from all the retrieved articles. Figure 4 shows twoexamples of tag clouds. The size of each tag indicates thenumber of articles this tag can be found. Stop words andwords with low article counts are removed. The tag cloudcan help users in three ways. First, it provides users with aquick overview of the range of topics covered by the retrievedarticles. Second, the application name also appears in thecloud as prominent keywords, users can be more confidentin the relevancy of the results. Third, by clicking on a tag,users can filter the results down to only those containing thetag.

5. EVALUATIONWe conducted three sets of experiments to evaluate the

proposed system in terms of image matching performance,

Figure 3: Elements of result presentation

Figure 4: Examples of tag clouds generated fromthe results of two screenshot queries.

subjectively rated usefulness compared to other systems,and bilingual search feasibility.

5.1 Screenshot matching performanceIn this experiment, we evaluated the core functionality of

our system: the ability to search for similar screenshot im-ages. We created a test set by capturing the screenshots of352 unique application windows. This test set covers threemajor operating systems (Windows XP, Windows Vista, andMac OS X) and several popular programs such as MicrosoftOffice, Firefox, and Skype. Figure 5 shows some examplesof the screenshots in the test set. We retrieved the top 10matches for each test screenshot. The correctness of thesematches were evaluated by workers recruited from AmazonMechanical Turk. For quality control, we inserted into thelist of matches one known good match (the test image it-self), two known bad matches (other test images), and twoduplicate matches, and shuffled the list. To pass the qualitytest, a worker must correctly identify all the known good andbad matches as well as give consistent answers to duplicatematches.

Out of 352 screenshots, 194 (55%) retrieved at least one

Figure 7: Examples of results returned by three sys-tems in comparison

and 123 (35%) retrieved at least three good matches amongthe top 10 matches. Figure 6 shows examples of screenshotsthat did not retrieve any good match. They tend to bewindows less frequently accessed. The articles about themmay not have been collected and added into our database.We expect as we continue to scale up the database, thesepercentages will rise.

5.2 Comparing usefulness with baselinesIn this experiment, we compared the proposed search sys-

tem to two strong baseline systems: Bing Web and BingImages, in terms of the perceived usefulness of the above-the-fold results. The comparison among the three systemswas based on the results returned for 20 frequently used ap-plication windows (10 for XP and 10 for Mac). Figure 7shows an example result for each system. For our system,the query is the screenshot of the application window. Forthe baseline systems, the query is the set of keywords basedon the title of the window plus the name of the operatingsystem (e.g., system preferences mac). We merged the topfour results (above the fold) returned by each system intoa single list and randomized ordering in the list. To ratethe usefulness of the results in the list, we recruited workers

Figure 5: Examples of query images used in screenshot matching experiments

Figure 6: Examples of query images that did not retrieve matching screenshots from our database

Table 4: Subjective ratings of top results returnedby three systems on a 5 point Likert scale

Our Approach Bing Web Bing Images4.08±1.28 3.31±1.54 3.27±1.48

from Amazon Mechanical Turk. The rating was done on a5 point Likert scale (1: completely useless, 5: very useful).For quality control, we inserted into the list at random po-sitions four junk results based on the results returned byBing using only one of the keywords. To pass the qualitytest, these junk results must be marked as less useful thanthe other results.

Table 4 summarizes our findings. After rejecting disquali-fied ratings, we obtained 1677 ratings from 33 unique work-ers. As can be seen, results returned by our system receivedhigher ratings (4.08) than did results returned by Bing (3.31)and by Bing Images (3.28).

5.3 Bilingual search feasibilityIn this experiment, we examined to what extent our sys-

tem can support bilingual search when screenshots of pro-grams in different languages are used as queries. We tookscreenshots of 15 representative programs on Mac in fivelanguages (Spanish, French, German, Chinese, and Korean)and obtained a total of 75 test images. We used these queryimages to retrieve the top 10 matches and counted the num-ber of correct matches. Table 5 summarizes our findings. Ascan be seen, some screenshots continued to retrieve match-ing images though they are in a different language; theyscreenshots tend to contain visually rich patterns such aslogos. On the other hand, some cases failed to retrieve anymatch; they tend to be dominated by text and can lookvery different across languages, especially in block-basedlanguages (i.e., Chinese and Korean) whose printed wordsoften occupy smaller areas and appear denser than westernalphabet-based words. These findings point to the potentialfor our system to support bilingual search, albeit in a limitedscope when screenshots are dominated by visual features.

6. SUMMARYWe described a novel system for indexing and searching

know-how articles about software applications. This systemcan take a screenshot of an application window as the searchterm and is able to retrieve a list of articles illustrated by thescreenshots of the same application. This screenshot-basedapproach offers at least four advantages: (1) a screenshotquery is easier to specify than keywords, (2) the contextand topic are unambiguous identified by the screenshot andkeywords respectively, (3) judging relevancy by viewing im-ages is faster than by reading text, and (4) searching acrosslanguage barriers is possible. We built a prototype systemwith more than 150k images and evaluated its performancewith three experiments. We demonstrated our system’s abil-ity to match a wide variety of screenshots, to provide moreuseful results, and to match screenshots in other languages.

7. REFERENCES[1] Herbert Bay, Andreas Ess, Tinne Tuytelaars, and Luc

Van Gool. Speeded-up robust features (surf). Comput.Vis. Image Underst., 110(3):346–359, 2008.

[2] David J. Crandall, Lars Backstrom, DanielHuttenlocher, and Jon Kleinberg. Mapping the world’sphotos. In WWW ’09: Proceedings of the 18thinternational conference on World wide web, pages761–770, New York, NY, USA, 2009. ACM.

[3] Ritendra Datta, Dhiraj Joshi, Jia Li, and James Z.Wang. Image retrieval: Ideas, influences, and trends ofthe new age. ACM Comput. Surv., 40(2):1–60, 2008.

[4] Jeffrey Dean and Sanjay Ghemawat. Mapreduce:simplified data processing on large clusters. Commun.ACM, 51(1):107–113, 2008.

[5] Mike Dowman, Valentin Tablan, HamishCunningham, and Borislav Popov. Web-assistedannotation, semantic indexing and search of televisionand radio news. In WWW ’05: Proceedings of the 14thinternational conference on World Wide Web, pages225–234, New York, NY, USA, 2005. ACM.

[6] David F. Huynh, David R. Karger, and Robert C.Miller. Exhibit: lightweight structured datapublishing. In WWW ’07: Proceedings of the 16th

Table 5: Number of correct top 10 matches for 15 application windows in five non-English languages: Spanish(S), French (F), German (G), Chinese (C) and Korean (K)

international conference on World Wide Web, pages737–746, New York, NY, USA, 2007. ACM.

[7] Herve Jegou, Matthijs Douze, and Cordelia Schmid.Hamming embedding and weak geometric consistencyfor large scale image search. In ECCV ’08:Proceedings of the 10th European Conference onComputer Vision, pages 304–317, Berlin, Heidelberg,2008. Springer-Verlag.

[8] Yushi Jing and Shumeet Baluja. Pagerank for productimage search. In WWW ’08: Proceeding of the 17thinternational conference on World Wide Web, pages307–316, New York, NY, USA, 2008. ACM.

[9] Thorsten Joachims. Optimizing search engines usingclickthrough data. In KDD ’02: Proceedings of theeighth ACM SIGKDD international conference onKnowledge discovery and data mining, pages 133–142,New York, NY, USA, 2002. ACM.

[10] Lyndon S. Kennedy and Mor Naaman. Generatingdiverse and representative image search results forlandmarks. In WWW ’08: Proceeding of the 17thinternational conference on World Wide Web, pages297–306, New York, NY, USA, 2008. ACM.

[11] Aniket Kittur, Ed H. Chi, and Bongwon Suh.Crowdsourcing user studies with mechanical turk. InCHI ’08: Proceeding of the twenty-sixth annualSIGCHI conference on Human factors in computingsystems, pages 453–456, New York, NY, USA, 2008.ACM.

[12] Ronny Lempel and Aya Soffer. Picashow: pictorialauthority search by hyperlinks on the web. In WWW’01: Proceedings of the 10th international conferenceon World Wide Web, pages 438–448, New York, NY,USA, 2001. ACM.

[13] Zhiwei Li, Shuming Shi, and Lei Zhang. Improvingrelevance judgment of web search results with imageexcerpts. In WWW ’08: Proceeding of the 17thinternational conference on World Wide Web, pages21–30, New York, NY, USA, 2008. ACM.

[14] Yandong Liu, Jiang Bian, and Eugene Agichtein.Predicting information seeker satisfaction incommunity question answering. In SIGIR ’08:Proceedings of the 31st annual international ACMSIGIR conference on Research and development ininformation retrieval, pages 483–490, New York, NY,USA, 2008. ACM.

[15] Indrani Medhi, Archana Prasad, and Kentaro Toyama.Optimal audio-visual representations for illiterateusers of computers. In WWW ’07: Proceedings of the16th international conference on World Wide Web,

pages 873–882, New York, NY, USA, 2007. ACM.

[16] Bhaskar Mehta, Saurabh Nangia, Manish Gupta, andWolfgang Nejdl. Detecting image spam using visualfeatures and near duplicate detection. In WWW ’08:Proceeding of the 17th international conference onWorld Wide Web, pages 497–506, New York, NY,USA, 2008. ACM.

[17] Michael Narayan, Christopher Williams, SaverioPerugini, and Naren Ramakrishnan. Stagingtransformations for multimodal web interactionmanagement. In WWW ’04: Proceedings of the 13thinternational conference on World Wide Web, pages212–223, New York, NY, USA, 2004. ACM.

[18] Xiaochuan Ni, Jian-Tao Sun, Jian Hu, and ZhengChen. Mining multilingual topics from wikipedia. InWWW ’09: Proceedings of the 18th internationalconference on World wide web, pages 1155–1156, NewYork, NY, USA, 2009. ACM.

[19] Philipp Scholl, Renato Domınguez Garcıa, DoreenBohnstedt, Christoph Rensing, and Ralf Steinmetz.Towards language-independent web genre detection.In WWW ’09: Proceedings of the 18th internationalconference on World wide web, pages 1157–1158, NewYork, NY, USA, 2009. ACM.

[20] Kumiko Tanaka-Ishii and Hiroshi Nakagawa. Amultilingual usage consultation tool based on internetsearching: more than a search engine, less than qa. InWWW ’05: Proceedings of the 14th internationalconference on World Wide Web, pages 363–371, NewYork, NY, USA, 2005. ACM.

[21] Reinier H. van Leuken, Lluis Garcia, Ximena Olivares,and Roelof van Zwol. Visual diversification of imagesearch results. In WWW ’09: Proceedings of the 18thinternational conference on World wide web, pages341–350, New York, NY, USA, 2009. ACM.

[22] S. Vijayanarasimhan and K. Grauman. What’s itgoing to cost you?: Predicting effort vs.informativeness for multi-label image annotations.pages 2262–2269, 2009.

Documents

Searching the Web Using Screenshots