Upload
kevin-martin
View
219
Download
3
Tags:
Embed Size (px)
Citation preview
AlltheWebTorbjørn Kanestrøm
January 30th, 2003
Agenda
• Who is FAST ?
• What do we do?
• Libraries; Relevant projects we have done
• What is AlltheWeb?
• Under the Hood: Phrasing & Lemmatization
• Take a tour of AlltheWeb– Simple searches (Web, News, Multimedia, FTP)– Advanced Web Search– Results Page
• Q & A
Who is FAST?
San Francisco
Tokyo
Boston
Norway Munich
Rome
London
Paris
• Fast Search & Transfer (FAST)–Founded 1997–Public company (Oslo Stock Exchange – June 2001)–One of the fastest growing companies in Europe–Profitable–200 employees–40+ Phd’s –12 offices world wide
What we do…
Understand the Intention of a query
Understanding content
//TECHNOLOGY
Common Technology Platform
FAST Solutions
Enterprise
Portals
Partners
//BACKGROUND
FAST Customers & Partners
FAST is the creator of the real-time integrated search and filter technology solutions that are behind the scenes at some of the world's best known companies with the world's most demanding search problems
A few selected projects we have done- Relevant to every librarian
Questia
Questia – the online library
Nordic Web Archive
The Nordic Web Archive is a cooperation between the Nordic National Libraries (Finland, Sweden, Denmark, Norway, Iceland). Project started in 2000, datacenter built deep inside a mountain in northern NorwayCollecting and archiving web documents of national interest and importance.
Everything published in the national domains (.NO, .DK, .FI etc.)Everything written on the web in the respective languagesEverything referring to one of the countries (city, company, person, etc.)
Continuous project designed to scale indefinitely
Available to the research community, not a public site.
Elesevier Engineering Information
Compendex® is the most comprehensive interdisciplinary engineering database in the world with almost seven million records referencing 5,000 engineering journals and conference materials dating from 1970. The database is updated weekly.
• Combining scientific classification of the “deep web” and proprietary publications
“FAST’s core search technology has enabled us to provide the best scientific search results, period” - John Regazzi - Managing Director, Elsevier Science
Web Server
XML
//BUSINESS CASES
• 120M web pages• 17M Elsevier Science publications
• Scientific classification• Grouping and identification of related articles
• Leading science Index• Understanding content• Scientific navigation
Scirus.com – the web’s Science search
What is AlltheWeb?
What is AlltheWeb?
• Showcase for FAST technology– Test new search features with real live audience– Several milion queries per day– 40% North America, 30% Europe, and 30% rest of World
• Integrated interface for searching– 2.1+ billion web pages, PDF docs, MS Word docs, & Flash objects– Continuously refreshed news from 5000+ global/local news sources– 150 million images and videos– 130 million ftp files– 2 million mp3 files
• Targeted at advand searches
What makes AlltheWeb different?
• Versatility– Searching in 49 languages– Six seperate catalogues (Web, News, Pictures, Videos, MP3, FTP)– Fully customizable front-end (only major search site that is XHTML/CSS compl.)
• Solid Index– 2.5 billion web objects (pages, pictures, videos, mp3s, etc.)– One of the fastest refresh cycle (every 7 – 14 days)
• Advanced search features– Boolean search– Embedded content selectors– Domain & IP filtering– File format and size filtering– Much more...
Under the Hood -
Phrasing & Lemmatization
Under the Hood: Phrasing/Anti-Phrasing
• Phrasing: Known phrases are matched as a phrase– New York “New York”– Based on common phrases, names, movie names, geographic names, etc.– Can detect multiple phrases within same query
• Anti-Phrasing: Remove words irrelevant to the query– Who is…– What is…
• Combines to create a better query– Who is George Bush “George Bush”– What is the age of the earth “the age of the earth”– How do I get to train station in New York “get to” “train station” in “New York”
Under the Hood:Lemmatization
• Lemmatization improves recall– Literal matching only finds a fraction of candidates for a query
• Ratio between base and full forms– English: 2– German, French, Spanish: 5 – 10– Russian, Polish: 40+
• Typical Cases: Singular/plural variation, case marking, etc.
• Stemming vs. Lemmatization– Traditional stemming
• Term is stemmed according to rules, e.g. walking walk• Can easily result in “false” stemmings, e.g. Bobby Browning Bobby Brown
– Lemmatization• Rewriting of terms are controlled by language-sensitive dictionaries• Very comprehensive dictionaries; about 20 “man years”
Take a Tour
AlltheWeb Home Page
Simple Search (Web/News)
• Web- and News Search
• Picture-, Video- and MP3 Search
• FTP Search
Language detection
Your query Match exact phraseSimilar to using quotes around your query
”WebSearch University”
Simple Search (Rich Media)
• Web- and News Search
• Picture-, Video- and MP3 Search
• FTP Search
Simple Search (FTP)
• Web- and News Search
• Picture-, Video- and MP3 Search
• FTP Search
Query/Expression
Select between 13 different matching algorithms
Advanced Web Search
Select Search TypeAll the words (AND)Any of the words (OR)The exact phrase Boolean expression
Language/Charset49 languagesMost used characters sets
Term / PhraseOnly one phrase/word per filter. Add more filters if necessary.
Where To MatchBody textPage titleURLHostnameLinks on the page
Embedded ContentExclude or include pages based on embedded content on these pages
Specific Date rangeand Document depth
File TypeLimits results to PDF, MS Word, and Macromedia Flash files
Save SettingsKeep knobs in the same position when you return
Page Depth/TypeFilter based on depth of URL and whether ~ occurs in URL
IP-address filterFor especially interested. Supports most common IP-address/-range syntaxes
Advanced Web Search (cont.)
Domain FiltersOnly include and/or exclude results from a domain
Region FilterLimit results to different regions Document Size
Specify size of document. Supports exact, less than or more than
PresentationHow many search results to list per page Offensive Content
Whether or not to filter out/reduce results with offensive content
The Result Page
Search BarClick tabs to send query to other catalogs
News ResultFlashed in results from real-time News Search catalog Web Results
Search results from Web Pages, PDF & MS Word files and Macromedia Flash files
Site CollapsingShow the other hits from this site
Query RewritingDid we rewrite your query? Gives you full control!
Paid contentRevenue funds new features at AlltheWeb
Related Queries
Multimedia ResultsResults from other catalogues
www.AllTheWeb .comHas all the advanced search features and functions
that you can find on all other major web search engines
– combined...And we innovate at a faster pace and invest more in
R&D than ever before.
AlltheWeb Q&A