View
219
Download
0
Tags:
Embed Size (px)
Citation preview
Search Engines in eCommerceWeb-Based
Information Architectures
MSEC 20-760Mini II
Jaime Carbonell
General Topic: Applying IR to eCommerce
• High-level review of homework 1 and 2
• The search-engine business
• Getting search engines to work for you
• Some web-site design principles
• Other IR-related eCommerce business ideas
Building a Search Engine (1)
Assemble the Collection
• Acquire a document data base
• Or, spider the Web to collect the DB
• Or, spider a your own site/company
Building a Search Engine (2)Index the Collection (HW1)• Build a dictionary from collection C
Find all unique words & optionally stem themFilter out stop wordsOptionally generate phrases as wordsΣ is resulting word list
• For each wi in Σ
Calculate & store log2IDF for wi
Find all Dj where wi occurs
Store ID(Dj) and wi positions in Dj
Building a Search Engine (3)
Match Queries to Collection (HW2)
• Filter out query words not in Σ
• Compute ArgmaxkDj
in C[Sim(Q, Dj]
Use dot-product or cosine similarity
Use inverted index for computation
The Search Engine Business (1)
Services Provided
• Locating (most) useful web pages
• Two-step process: "Query & Find"
Then click-through based on summary
The Search Engine Business (2)
Revenue Model• Maximizing traffic => advertisements, etc.
Lycos, Google, AltaVista, Excite, Metacrawler...• Installing intranet searching for a fee or providing
search technology to others
Inktomi, Verity, Google, Condor...• Boosting glory/value of parent corporation
Infoseek => Disney
The Search Engine Business (3)
Hybrid Models• Universal locators (people, locations, ...)
Metacrawler/GO2Net, Lycos...• Hierarchical Content-based Browser
Yahoo clear first, later Lycos & others...• Together with News, Stock-quotes, Chat-rooms, ....
Yahoo clear leader, now many others...
New Technologies (1)
Better Search Technologies
• Metasearch (combine output of multiple engines)
e.g. Metacrawler, Vivisimo
• Marrying IR with hand-built taxonomies
e.g. Yahoo originally, later most others
New Technologies (2)
Better Search Technologies• Ranking web sites by in-link density
e.g. Google,
Authorities = high in-link degree
Hubs = high out-link degree
Rank = Argmaxkdj in Drel
[Σilogi (inlinki(dj))ai ]
• Marrying IR with Translation
e.g. AltaVista/Babblefish, Google, …
New Technologies (3)
Better Mousetraps in the Drawing Board• True Web-Based Translingual IR• High-powered, more accurate search for a fee
(MMR, probabilistic IR search, quality filters,...)• WebSearch + Summarization & Fusion• Multimedia search for a fee• Automatically-generate Yahoo-like hierarchies• Search part of the hidden-web (distributed IR)
New Technologies (4)
Better Mousetraps in the Drawing Board• More comprehensive Web Crawlers
AltaVista indexes < 30% of web
Google indexes 2.0 Billion URLs < 50% of web
All others index much less...• Generate answers to questions (not just ‘hits’)
[AskJeeves.com does not work well]
FAQ’s, helpdesks, networking to humans, ...
Optimizing WebSites for Searching (1)
Objectives
• Want your eCommerce site found easily by all potential customers
• Want your site to rank above the competition in web searches
• Want customers to stay within your eCommerce web site, once they find it
Optimizing WebSites for Searching (2)
Content Strategy
1. Build your first-pass web site
2. Generate alphabetized union of terms in your web site and in those of the primary competition.
e.g. "...amazing" "antelope" "antiques" "auction" ... "catalog" "cars" ...
Optimizing WebSites for Searching (3)
Content Strategy
3. Filter out all terms not directly relevant to your
business. e.g. "auction" "antiques" "catalog"...
4. Expand the filtered list with synonyms or highly-related terms (dual of q-expansion)
e.g. "antique" => "antique, vintage, classic"
5. Where to put such terms? Edit your site to include the terms that fit naturally. For others…
Optimizing WebSites for Searching (4)
Content Strategy6. Include the rest of the terms "invisibly"
– Meta-tags for indexing– Minuscule font for word lists
(illegible text appears as background pattern)– Text color = background color– Minimize all extraneous text on portal page(s)
(e.g. move text to other linked pages).
Optimizing WebSites for Searching Part II (1)
Find Key Competition
1. Complete first-pass web site (last slides)
2. Register with all search engines
3. Contract 20-to-50 potential "clients"
Optimizing WebSites for Searching Part II (2)
Find Key Competition
4. Have clients generate multiple queries for your eProduct or eService without knowing what’s in your web site. Try these queries on multiple search engines (except Authority and Frequency-biased ones like Google)
5. Find web sites that consistently rank higher in search (if any) via one or more engines
Optimizing WebSites for Searching Part II (3)
Analyze Key Competition6. Find terms in competition web sites that match
spontaneous queries (looking carefully at meta-tags, invisible fonts, etc.)
7. Add such terms to your web pages invisibly8. Optionally remove more extraneous text from
portal page(s)9. Re-register with search engines, and iterate until
your web site is near the top for most of the reasonable queries in most of the engines.
Optimizing WebSites for Searching Part II (4)
OPTIMIZE Your Site for Search engines
10.Remove maximal amount of non-key-word text (e.g. put it in liked pages, or as .gif files). Recall the denominator in cosine-similarity function.
11.Subdivide general entry pages into topically-specific ones (increase info-density wrt query).
Optimizing WebSites for Searching Part III (1)
Connectivity Strategy• Make your term-laden pages attractive entry
portals• Link these search-engine entry pages strongly to
home/entry page(s) if these are different• Provide intra-site searching capability if your site
has > 30 pages, where only on-site or associated text and pages are searched.
Optimizing WebSites for Searching Part III (2)
Connectivity Strategy
• Possibly hand off to general search engine upon failure of local search.
• Maximize the number in-page links to entry portals from anywhere and everywhere else (internal and external).
IR-Related eCommerce Business Ideas (1)
eCLIP: Adaptable Electronic Clipping Service• Goal: Personalized eNewspaper
(weekly, daily, hourly)• User sets interest profile
YES: "finance>eCommerce>technology""science>astronomy"
NO: "sports" "politics>scandals"KEY-TERMS: "ecommerce" "search engine"
"IPO" "Hubble"
IR-Related eCommerce Business Ideas (2)
• Multiple newsfeeds are categorized on entry ...and filtered by user profiles
• Maximally-relevant & novel news is includedNext most relevant or less novel is summarizedRest is ignored.
• User feedback automatically adjusts profile(e.g. thumbs-down on more Amazon.com news thumbs-up on Google, a new search engine)
• Revenue models: subscription, advertisement, ...
IR-Related eCommerce Business Ideas Part II (1)
ePUB: Customized Publishing
• Goal: Offer customized books (texts, trade, etc.)
• Index all offerings by chapter & section
• Permit user to search & browse
(using MMR, summarization, etc.)
IR-Related eCommerce Business Ideas Part II (2)
ePUB: Customized Publishing• Assemble for user a customized bundle
(e.g. Ch 3-7 of "Intro to IR" + Ch 5-6 of "Web IR" + Ch 2 of "Applied Linear Algebra")
• Print, bind and ship 50+ copies...or ship single copy electronically (e.g. via PDF)
IR-Related eCommerce Business Ideas Part II (3)
eFACT: Universal Q/A Database• Goal: Answer any question over web• Create large FAQ incrementally, categorized by
subject areas• Have humans answer questions over web Pay for
answers with free subscription? $$?• If new question matches, give answer, else send to
humans and resort to metasearch for relevant web-pages (not an answer, but best one can do for now), and email answer later.
• Essentially do AskJeeves the right way
IR-Related eCommerce Business Ideas Part III (1)
iSELL: Meta Auction eSite• Goal: The Metacrawler of Web Auction Sites• User describes product she wants to sell• iSELL finds best match to auction sites(s) that
sell(s) such products (similarity between description and auction offerings past and present)
• ...or auction site that gets best prices• iSELL’s metaform automatically connects and
lists product in one or several auction sites and de-lists when sold.
• iSELL gets a cut of the selling price + ad revenues
IR-Related eCommerce Business Ideas Part III (2)
WebRATE: Rating Service for eSites
• Goal: Nielsen’s or CU or USN&WR of the Web
• Find similarity to other sites, ...
• Sites pay to be rated by content, style, traffic, etc.