Click here to load reader
Upload
abial
View
8.286
Download
3
Tags:
Embed Size (px)
DESCRIPTION
This slideset presents the Nutch search engine (http://lucene.apache.org/nutch). A high-level architecture is described, as well as some challenges common in web-crawling and solutions implemented in Nutch. The presentation closes with a brief look into the Nutch future.
Citation preview
Nut
ch ndash
Apa
cheC
on U
S 0
9
Web-scale search engine toolkit
Today and tomorrow
Andrzej Białeckiabapacheorg
Apache
2
Nut
ch ndash
Apa
cheC
on U
S 0
9
AgendaAbout the projectWeb crawling in generalNutch architecture overviewNutch workflow
bull Setupbull Crawlingbull SearchingChallenges (and some solutions)Nutch present and futureQuestions and answers
3
Nut
ch ndash
Apa
cheC
on U
S 0
9
Apache Nutch projectFounded in 2003 by Doug Cutting the Lucene creator and Mike CafarellaApache project since 2004 (sub-project of Lucene)Spin-offs
bull Map-Reduce and distributed FS rarr Hadoopbull Content type detection and parsing rarr TikaMany installations in operation mostly vertical searchCollections typically 1 mln - 200 mln documents
4
Nut
ch ndash
Apa
cheC
on U
S 0
9
5
Nut
ch ndash
Apa
cheC
on U
S 0
9
Web as a directed graphNodes (vertices) URL-s as unique identifiersEdges (links) hyperlinks like lta href=rdquotargetUrlrdquogtEdge labels lta href=rdquordquogtanchor textltagtOften represented as adjacency (neighbor) listsTraversal follow the edges breadth-first depth-first random
1
3
4
56
2
7
8
9
1 rarr 2 3 4 5 65 rarr 6 97 rarr 3 4 8 9
6
Nut
ch ndash
Apa
cheC
on U
S 0
9
Whats in a search engine
hellip a few things that may surprise you
7
Nut
ch ndash
Apa
cheC
on U
S 0
9
Search engine building blocks
Web graph-page info-links (inout)
Crawler
Parser
Searcher
IndexerUpdater
Scheduler
Contentrepository
Injector
Crawling frontier controls
8
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch features at a glancePage database and link database (web graph)Plugin-based highly modular
minus Most behavior can be changed via pluginsMulti-protocol multi-threaded distributed crawlerPlugin-based content processing (parsing filtering)Robust crawling frontier controlsScalable data processing framework
minus Map-reduce processingFull-text indexer amp search engine
minus Using Lucene or Solrminus Support for distributed search
Robust API and integration options
9
Nut
ch ndash
Apa
cheC
on U
S 0
9
Hadoop foundationFile system abstraction
bull Local FS orbull Distributed FS
minus also Amazon S3 Kosmos and other FS implMap-reduce processing
bull Currently central to Nutch algorithmsbull Processing tasks are executed as one or more map-
reduce jobsbull Data maintained as Hadoop MapFile-s SequenceFile-sbull Massive updates very efficientbull Small updates costly
minus Hadoop data is immutable once createdminus Updates are really merge amp replace
10
Nut
ch ndash
Apa
cheC
on U
S 0
9
Search engine building blocks
Web graph-page info-links (inout)
Crawler
Parser
Searcher
IndexerUpdater
Scheduler
Contentrepository
Injector
Crawling frontier controls
11
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch building blocks
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
12
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch data
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
Maintains info on all known URL-s Fetch schedule Fetch status Page signature Metadata
13
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch data
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
For each target URL keeps info on incoming links ie list of source URL-s and their associated anchor text
14
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch data
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
Shards (ldquosegmentsrdquo) keep Raw page content Parsed content + discovered metadata + outlinks Plain text for indexing and snippets Optional per-shard indexes
15
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch shards (aka ldquosegmentsrdquo)Unit of work (batch) ndash easier to process massive datasetsConvenience placeholder using predefined directory namesUnit of deployment to the search infrastructureMay include per-shard Lucene indexesOnce completed they are basically unmodifiable
minus No in-place updates of content or replacing of obsolete contentPeriodically phased-out
2009043012345
crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_text
2009043012345
crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_text
200904301234crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_textindexes
Generator
Fetcher
Parser
Indexersnippets
ldquocachedrdquo view
16
Nut
ch ndash
Apa
cheC
on U
S 0
9
URL life-cycle and shardsinjected
discoveredscheduledfor fetching
S1
fetched up-to-date
scheduledfor fetching
fetched up-to-date
A time-lapse view of realityGoals
bull Fresh index rarr sync with the real rate of changes
bull Minimize re-fetching rarr dont fetch unmodified pages
Each page may need its own crawl scheduleShard management
bull Page may be present in many shards but only the most recent record is valid
bull Inconvenient to update in-place just mark as deleted
bull Phase-out old shards and force re-fetch of the remaining pages
scheduledfor fetching
fetched up-to-date
time
S2
S3
real page changesobserved changes
17
Nut
ch ndash
Apa
cheC
on U
S 0
9
Crawling frontierNo authoritative catalog of web pagesSearch engines discover their view of web universe
minus Start from ldquoseed listrdquominus Follow (walk) some (useful interesting) outlinks
Many dangers of simply wandering aroundminus explosion or collapse of the frontierminus collecting unwanted content (spam junk offensive)
I could usesome guidance
18
Nut
ch ndash
Apa
cheC
on U
S 0
9
Controlling the crawling frontierURL filter plugins
minus White-list black-list regexminus May use external resources
(DB-s services )
URL normalizer pluginsminus Resolving relative path
elementsminus ldquoEquivalentrdquo URLs
Additional controls using scoring plugins
minus priority metadata selectblockCrawler traps ndash difficult
minus Domain host path-level stats
seed
i = 1
i = 2i = 3
19
Nut
ch ndash
Apa
cheC
on U
S 0
9
Wide vs focused crawlingDifferences
bull Little technical difference in configurationbull Big difference in operations maintenance and qualityWide crawling
minus (Almost) Unlimited crawling frontierminus High risk of spamming and junk contentminus ldquoPolitenessrdquo a very important limiting factorminus Bandwidth amp DNS considerations
Focused (vertical or enterprise) crawlingminus Limited crawling frontierminus Bandwidth or politeness is often not an issueminus Low risk of spamming and junk content
20
Nut
ch ndash
Apa
cheC
on U
S 0
9
Vertical amp enterprise searchVertical search
bull Range of selected ldquoreferencerdquo sitesbull Robust control of the crawling frontierbull Extensive content post-processingbull Business-driven decisions about rankingEnterprise search
bull Variety of data sources and data formatsbull Well-defined and limited crawling frontierbull Integration with in-house data sourcesbull Little danger of spambull PageRank-like scoring usually works poorly
21
Nut
ch ndash
Apa
cheC
on U
S 0
9
Face to face with Nutch
22
Nut
ch ndash
Apa
cheC
on U
S 0
9
Simple set-up and runningYou already have Java 5+ rightGet a 10 release or a nightly build (pretty stable)Simple search setup uses Tomcat for search web appCommand-line bash script binnutch
bull Windows users get Cygwinbull Early version of a web-based UI consolebull Hadoop web-based monitoring
23
Nut
ch ndash
Apa
cheC
on U
S 0
9
Configuration filesEdit configuration in confnutch-sitexml
bull Check nutch-defaultxml for defaults and docsbull You MUST at least fill the name of your agentActive plugins configurationPer-plugin propertiesExternal configuration files
bull regex-urlfilterxmlbull regex-normalizexmlbull parse-pluginsxml mapping of MIME type to plugin
24
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch pluginsPlugin-based extensions for
bull Crawl schedulingbull URL filtering and normalizationbull Protocol for getting the contentbull Content parsingbull Text analysis (tokenization)bull Page signature (to detect near-duplicates)bull Indexing filters (index fields amp metadata)bull Snippet generation and highlightingbull Scoring and rankingbull Query translation and expansion (user rarr Lucene)
25
Nut
ch ndash
Apa
cheC
on U
S 0
9
Main Nutch workflow
Inject initial creation of CrawlDBbull Insert seed URLsbull Initial LinkDB is empty
Generate new shards fetchlistFetch raw contentParse content (discovers outlinks)Update CrawlDB from shardsUpdate LinkDB from shardsIndex shards
(repeat)
Command-line binnutchinject
generatefetchparseupdatedbupdatelinkdbindex solrindex
26
Nut
ch ndash
Apa
cheC
on U
S 0
9
Injecting new URL-s
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
27
Nut
ch ndash
Apa
cheC
on U
S 0
9
Generating fetchlists
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
28
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow generating fetchlistsWhat to fetch next
bull Breadth-first ndash important due to ldquopolitenessrdquo limitsbull Expired (longer than fetchTime + fetchInterval)bull Highly ranking (PageRank)bull Newly addedFetchlist generation
bull ldquotopNrdquo - select best candidatesminus Priority based on many factors pluggable
Adaptive fetch schedulebull Detect rate of changes amp time of changebull Detect unmodified content
minus Hmm and how to recognize this rarr approximate page signatures (near-duplicate detection)
29
Nut
ch ndash
Apa
cheC
on U
S 0
9
Fetching content
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
30
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow fetching
Multi-protocol HTTP(s) FTP file etc Coordinates multiple threads accessing the same hostbull ldquopolitenessrdquo issues vs crawling efficiencybull Host an IP or DNS namebull Redirections should follow immediately What kind
Other netiquette issuesbull robotstxt disallowed paths Crawl-DelayPreparation for parsingbull Content type detection issuesbull Parsing usually executed as a separate step
(resource hungry and sometimes unstable)
31
Nut
ch ndash
Apa
cheC
on U
S 0
9
Content processing
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
32
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow content processingProtocol plugins retrieve content as plain bytes + protocol-level metadata (eg HTTP headers)Parse pluginsbull Content is parsed by MIME-type specific parsersbull Content is parsed into parseData (title outlinks other
metadata) and parseText (plain text content)Nutch supports many popular file formats
33
Nut
ch ndash
Apa
cheC
on U
S 0
9
Link inversion
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
34
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow link inversionPages have outgoing links (outlinks)
hellip I know where Im pointing to
Question who points to mehellip I dont know there is no catalog of pageshellip NOBODY knows for sure either
In-degree indicates importance of the pageAnchor text provides important semantic infoPartial answer invert the outlinks that I know about and group by target
src 1
tgt 2
tgt 3
tgt 4tgt 5
tgt 1
tgt 1
src 2
src 3
src 4src 5
src 1
35
Nut
ch ndash
Apa
cheC
on U
S 0
9
Page importance - scoring
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
36
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow page scoringQuery-independent importance factor
bull Affects search rankingbull Affects crawl prioritizationMay include arbitrary decision of the operatorCurrently two systems (plugins + tools)OPIC scoring
bull Doesnt require explicit steps - ldquoonlinerdquobull Difficult to stabilizePageRank scoring with loop detection
bull Periodically run to update CrawlDb scoresbull Computationally intensive esp loop detection
37
Nut
ch ndash
Apa
cheC
on U
S 0
9
Indexing
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
38
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow indexingIndexing plugins
bull Create full-text search index from segment databull Apply adjustments to score per documentbull May further post-process the parsed content (eg
language identification) to facilitate advanced searchIndexers
bull Lucene indexer ndash builds indexes to be served by Nutch search servers
bull Solr indexer ndash submits documents to a Solr instance
39
Nut
ch ndash
Apa
cheC
on U
S 0
9
De-duplication
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
40
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow de-duplicationThe same page may be present in many shards
bull Obsolete versions in older shardsbull Mirrored pages or equivalent (acom rarr wwwacom)Many other pages are almost identical
bull Template-related differences (banners current date)bull Font layout changes minor re-wording
Hmm hellip what is a significant change bull Tricky issue Hint what is the page content
Near-duplicate detection and removalbull Nutch uses approximate page signatures (fingerprints)bull Duplicates are only marked as deleted
41
Nut
ch ndash
Apa
cheC
on U
S 0
9
Cycles may overlap
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
42
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch distributed search
Multiple search serversSearch front-end dispatches queries and collects search results
bull multiple access methods (API OpenSearch XML JSON)
Page servers build snippetsFault-tolerant
with degraded quality
Search front-end(s)Web app
APIXML
NutchBean
JSON
Search server(Nutch or Solr)
indexed shards
43
Nut
ch ndash
Apa
cheC
on U
S 0
9
Search configurationNutch syntax is limited ndash on purpose
bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins
bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20
Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation
44
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards
Searchfront-endSearch
front-end
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
45
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment distributed searchLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
46
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment map-reduce processing amp distrib search
Fully distributed storage and processingLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
DFS Data NodeDFS Data Node
DFS Data Node
DFS Name NodeDFS Name Node(Job Tracker)
FetcherIndexer
DB mgmt
(Job Tracker)FetcherIndexer
DB mgmt
Task TrackerTask Tracker
Task Tracker
47
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch on a Hadoop cluster
Assumes an up amp running Hadoop clusterhellip which is covered elsewhere
Build nutchjob jar and use it as usual
binhadoop jar nutchjob ltclassNamegt ltargsgt
Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar
Searching is not a map-reduce job ndash often on a separate group of machines
48
Nut
ch ndash
Apa
cheC
on U
S 0
9
49
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch index
50
Nut
ch ndash
Apa
cheC
on U
S 0
9
hellip and press Search
51
Nut
ch ndash
Apa
cheC
on U
S 0
9
Conclusions(This overview is a tip of the iceberg)
NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit
52
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of NutchAvoid code duplicationParsing rarr Tika
bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI
bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper
bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase
bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available
53
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of Nutch (2)Whats left then
bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons
Vision aacute la carte toolkit scalable from1-1000s nodes
54
Nut
ch ndash
Apa
cheC
on U
S 0
9
Summary
QampA
Further informationbull httpluceneapacheorgnutch
2
Nut
ch ndash
Apa
cheC
on U
S 0
9
AgendaAbout the projectWeb crawling in generalNutch architecture overviewNutch workflow
bull Setupbull Crawlingbull SearchingChallenges (and some solutions)Nutch present and futureQuestions and answers
3
Nut
ch ndash
Apa
cheC
on U
S 0
9
Apache Nutch projectFounded in 2003 by Doug Cutting the Lucene creator and Mike CafarellaApache project since 2004 (sub-project of Lucene)Spin-offs
bull Map-Reduce and distributed FS rarr Hadoopbull Content type detection and parsing rarr TikaMany installations in operation mostly vertical searchCollections typically 1 mln - 200 mln documents
4
Nut
ch ndash
Apa
cheC
on U
S 0
9
5
Nut
ch ndash
Apa
cheC
on U
S 0
9
Web as a directed graphNodes (vertices) URL-s as unique identifiersEdges (links) hyperlinks like lta href=rdquotargetUrlrdquogtEdge labels lta href=rdquordquogtanchor textltagtOften represented as adjacency (neighbor) listsTraversal follow the edges breadth-first depth-first random
1
3
4
56
2
7
8
9
1 rarr 2 3 4 5 65 rarr 6 97 rarr 3 4 8 9
6
Nut
ch ndash
Apa
cheC
on U
S 0
9
Whats in a search engine
hellip a few things that may surprise you
7
Nut
ch ndash
Apa
cheC
on U
S 0
9
Search engine building blocks
Web graph-page info-links (inout)
Crawler
Parser
Searcher
IndexerUpdater
Scheduler
Contentrepository
Injector
Crawling frontier controls
8
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch features at a glancePage database and link database (web graph)Plugin-based highly modular
minus Most behavior can be changed via pluginsMulti-protocol multi-threaded distributed crawlerPlugin-based content processing (parsing filtering)Robust crawling frontier controlsScalable data processing framework
minus Map-reduce processingFull-text indexer amp search engine
minus Using Lucene or Solrminus Support for distributed search
Robust API and integration options
9
Nut
ch ndash
Apa
cheC
on U
S 0
9
Hadoop foundationFile system abstraction
bull Local FS orbull Distributed FS
minus also Amazon S3 Kosmos and other FS implMap-reduce processing
bull Currently central to Nutch algorithmsbull Processing tasks are executed as one or more map-
reduce jobsbull Data maintained as Hadoop MapFile-s SequenceFile-sbull Massive updates very efficientbull Small updates costly
minus Hadoop data is immutable once createdminus Updates are really merge amp replace
10
Nut
ch ndash
Apa
cheC
on U
S 0
9
Search engine building blocks
Web graph-page info-links (inout)
Crawler
Parser
Searcher
IndexerUpdater
Scheduler
Contentrepository
Injector
Crawling frontier controls
11
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch building blocks
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
12
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch data
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
Maintains info on all known URL-s Fetch schedule Fetch status Page signature Metadata
13
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch data
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
For each target URL keeps info on incoming links ie list of source URL-s and their associated anchor text
14
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch data
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
Shards (ldquosegmentsrdquo) keep Raw page content Parsed content + discovered metadata + outlinks Plain text for indexing and snippets Optional per-shard indexes
15
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch shards (aka ldquosegmentsrdquo)Unit of work (batch) ndash easier to process massive datasetsConvenience placeholder using predefined directory namesUnit of deployment to the search infrastructureMay include per-shard Lucene indexesOnce completed they are basically unmodifiable
minus No in-place updates of content or replacing of obsolete contentPeriodically phased-out
2009043012345
crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_text
2009043012345
crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_text
200904301234crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_textindexes
Generator
Fetcher
Parser
Indexersnippets
ldquocachedrdquo view
16
Nut
ch ndash
Apa
cheC
on U
S 0
9
URL life-cycle and shardsinjected
discoveredscheduledfor fetching
S1
fetched up-to-date
scheduledfor fetching
fetched up-to-date
A time-lapse view of realityGoals
bull Fresh index rarr sync with the real rate of changes
bull Minimize re-fetching rarr dont fetch unmodified pages
Each page may need its own crawl scheduleShard management
bull Page may be present in many shards but only the most recent record is valid
bull Inconvenient to update in-place just mark as deleted
bull Phase-out old shards and force re-fetch of the remaining pages
scheduledfor fetching
fetched up-to-date
time
S2
S3
real page changesobserved changes
17
Nut
ch ndash
Apa
cheC
on U
S 0
9
Crawling frontierNo authoritative catalog of web pagesSearch engines discover their view of web universe
minus Start from ldquoseed listrdquominus Follow (walk) some (useful interesting) outlinks
Many dangers of simply wandering aroundminus explosion or collapse of the frontierminus collecting unwanted content (spam junk offensive)
I could usesome guidance
18
Nut
ch ndash
Apa
cheC
on U
S 0
9
Controlling the crawling frontierURL filter plugins
minus White-list black-list regexminus May use external resources
(DB-s services )
URL normalizer pluginsminus Resolving relative path
elementsminus ldquoEquivalentrdquo URLs
Additional controls using scoring plugins
minus priority metadata selectblockCrawler traps ndash difficult
minus Domain host path-level stats
seed
i = 1
i = 2i = 3
19
Nut
ch ndash
Apa
cheC
on U
S 0
9
Wide vs focused crawlingDifferences
bull Little technical difference in configurationbull Big difference in operations maintenance and qualityWide crawling
minus (Almost) Unlimited crawling frontierminus High risk of spamming and junk contentminus ldquoPolitenessrdquo a very important limiting factorminus Bandwidth amp DNS considerations
Focused (vertical or enterprise) crawlingminus Limited crawling frontierminus Bandwidth or politeness is often not an issueminus Low risk of spamming and junk content
20
Nut
ch ndash
Apa
cheC
on U
S 0
9
Vertical amp enterprise searchVertical search
bull Range of selected ldquoreferencerdquo sitesbull Robust control of the crawling frontierbull Extensive content post-processingbull Business-driven decisions about rankingEnterprise search
bull Variety of data sources and data formatsbull Well-defined and limited crawling frontierbull Integration with in-house data sourcesbull Little danger of spambull PageRank-like scoring usually works poorly
21
Nut
ch ndash
Apa
cheC
on U
S 0
9
Face to face with Nutch
22
Nut
ch ndash
Apa
cheC
on U
S 0
9
Simple set-up and runningYou already have Java 5+ rightGet a 10 release or a nightly build (pretty stable)Simple search setup uses Tomcat for search web appCommand-line bash script binnutch
bull Windows users get Cygwinbull Early version of a web-based UI consolebull Hadoop web-based monitoring
23
Nut
ch ndash
Apa
cheC
on U
S 0
9
Configuration filesEdit configuration in confnutch-sitexml
bull Check nutch-defaultxml for defaults and docsbull You MUST at least fill the name of your agentActive plugins configurationPer-plugin propertiesExternal configuration files
bull regex-urlfilterxmlbull regex-normalizexmlbull parse-pluginsxml mapping of MIME type to plugin
24
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch pluginsPlugin-based extensions for
bull Crawl schedulingbull URL filtering and normalizationbull Protocol for getting the contentbull Content parsingbull Text analysis (tokenization)bull Page signature (to detect near-duplicates)bull Indexing filters (index fields amp metadata)bull Snippet generation and highlightingbull Scoring and rankingbull Query translation and expansion (user rarr Lucene)
25
Nut
ch ndash
Apa
cheC
on U
S 0
9
Main Nutch workflow
Inject initial creation of CrawlDBbull Insert seed URLsbull Initial LinkDB is empty
Generate new shards fetchlistFetch raw contentParse content (discovers outlinks)Update CrawlDB from shardsUpdate LinkDB from shardsIndex shards
(repeat)
Command-line binnutchinject
generatefetchparseupdatedbupdatelinkdbindex solrindex
26
Nut
ch ndash
Apa
cheC
on U
S 0
9
Injecting new URL-s
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
27
Nut
ch ndash
Apa
cheC
on U
S 0
9
Generating fetchlists
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
28
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow generating fetchlistsWhat to fetch next
bull Breadth-first ndash important due to ldquopolitenessrdquo limitsbull Expired (longer than fetchTime + fetchInterval)bull Highly ranking (PageRank)bull Newly addedFetchlist generation
bull ldquotopNrdquo - select best candidatesminus Priority based on many factors pluggable
Adaptive fetch schedulebull Detect rate of changes amp time of changebull Detect unmodified content
minus Hmm and how to recognize this rarr approximate page signatures (near-duplicate detection)
29
Nut
ch ndash
Apa
cheC
on U
S 0
9
Fetching content
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
30
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow fetching
Multi-protocol HTTP(s) FTP file etc Coordinates multiple threads accessing the same hostbull ldquopolitenessrdquo issues vs crawling efficiencybull Host an IP or DNS namebull Redirections should follow immediately What kind
Other netiquette issuesbull robotstxt disallowed paths Crawl-DelayPreparation for parsingbull Content type detection issuesbull Parsing usually executed as a separate step
(resource hungry and sometimes unstable)
31
Nut
ch ndash
Apa
cheC
on U
S 0
9
Content processing
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
32
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow content processingProtocol plugins retrieve content as plain bytes + protocol-level metadata (eg HTTP headers)Parse pluginsbull Content is parsed by MIME-type specific parsersbull Content is parsed into parseData (title outlinks other
metadata) and parseText (plain text content)Nutch supports many popular file formats
33
Nut
ch ndash
Apa
cheC
on U
S 0
9
Link inversion
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
34
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow link inversionPages have outgoing links (outlinks)
hellip I know where Im pointing to
Question who points to mehellip I dont know there is no catalog of pageshellip NOBODY knows for sure either
In-degree indicates importance of the pageAnchor text provides important semantic infoPartial answer invert the outlinks that I know about and group by target
src 1
tgt 2
tgt 3
tgt 4tgt 5
tgt 1
tgt 1
src 2
src 3
src 4src 5
src 1
35
Nut
ch ndash
Apa
cheC
on U
S 0
9
Page importance - scoring
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
36
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow page scoringQuery-independent importance factor
bull Affects search rankingbull Affects crawl prioritizationMay include arbitrary decision of the operatorCurrently two systems (plugins + tools)OPIC scoring
bull Doesnt require explicit steps - ldquoonlinerdquobull Difficult to stabilizePageRank scoring with loop detection
bull Periodically run to update CrawlDb scoresbull Computationally intensive esp loop detection
37
Nut
ch ndash
Apa
cheC
on U
S 0
9
Indexing
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
38
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow indexingIndexing plugins
bull Create full-text search index from segment databull Apply adjustments to score per documentbull May further post-process the parsed content (eg
language identification) to facilitate advanced searchIndexers
bull Lucene indexer ndash builds indexes to be served by Nutch search servers
bull Solr indexer ndash submits documents to a Solr instance
39
Nut
ch ndash
Apa
cheC
on U
S 0
9
De-duplication
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
40
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow de-duplicationThe same page may be present in many shards
bull Obsolete versions in older shardsbull Mirrored pages or equivalent (acom rarr wwwacom)Many other pages are almost identical
bull Template-related differences (banners current date)bull Font layout changes minor re-wording
Hmm hellip what is a significant change bull Tricky issue Hint what is the page content
Near-duplicate detection and removalbull Nutch uses approximate page signatures (fingerprints)bull Duplicates are only marked as deleted
41
Nut
ch ndash
Apa
cheC
on U
S 0
9
Cycles may overlap
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
42
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch distributed search
Multiple search serversSearch front-end dispatches queries and collects search results
bull multiple access methods (API OpenSearch XML JSON)
Page servers build snippetsFault-tolerant
with degraded quality
Search front-end(s)Web app
APIXML
NutchBean
JSON
Search server(Nutch or Solr)
indexed shards
43
Nut
ch ndash
Apa
cheC
on U
S 0
9
Search configurationNutch syntax is limited ndash on purpose
bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins
bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20
Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation
44
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards
Searchfront-endSearch
front-end
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
45
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment distributed searchLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
46
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment map-reduce processing amp distrib search
Fully distributed storage and processingLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
DFS Data NodeDFS Data Node
DFS Data Node
DFS Name NodeDFS Name Node(Job Tracker)
FetcherIndexer
DB mgmt
(Job Tracker)FetcherIndexer
DB mgmt
Task TrackerTask Tracker
Task Tracker
47
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch on a Hadoop cluster
Assumes an up amp running Hadoop clusterhellip which is covered elsewhere
Build nutchjob jar and use it as usual
binhadoop jar nutchjob ltclassNamegt ltargsgt
Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar
Searching is not a map-reduce job ndash often on a separate group of machines
48
Nut
ch ndash
Apa
cheC
on U
S 0
9
49
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch index
50
Nut
ch ndash
Apa
cheC
on U
S 0
9
hellip and press Search
51
Nut
ch ndash
Apa
cheC
on U
S 0
9
Conclusions(This overview is a tip of the iceberg)
NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit
52
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of NutchAvoid code duplicationParsing rarr Tika
bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI
bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper
bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase
bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available
53
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of Nutch (2)Whats left then
bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons
Vision aacute la carte toolkit scalable from1-1000s nodes
54
Nut
ch ndash
Apa
cheC
on U
S 0
9
Summary
QampA
Further informationbull httpluceneapacheorgnutch
3
Nut
ch ndash
Apa
cheC
on U
S 0
9
Apache Nutch projectFounded in 2003 by Doug Cutting the Lucene creator and Mike CafarellaApache project since 2004 (sub-project of Lucene)Spin-offs
bull Map-Reduce and distributed FS rarr Hadoopbull Content type detection and parsing rarr TikaMany installations in operation mostly vertical searchCollections typically 1 mln - 200 mln documents
4
Nut
ch ndash
Apa
cheC
on U
S 0
9
5
Nut
ch ndash
Apa
cheC
on U
S 0
9
Web as a directed graphNodes (vertices) URL-s as unique identifiersEdges (links) hyperlinks like lta href=rdquotargetUrlrdquogtEdge labels lta href=rdquordquogtanchor textltagtOften represented as adjacency (neighbor) listsTraversal follow the edges breadth-first depth-first random
1
3
4
56
2
7
8
9
1 rarr 2 3 4 5 65 rarr 6 97 rarr 3 4 8 9
6
Nut
ch ndash
Apa
cheC
on U
S 0
9
Whats in a search engine
hellip a few things that may surprise you
7
Nut
ch ndash
Apa
cheC
on U
S 0
9
Search engine building blocks
Web graph-page info-links (inout)
Crawler
Parser
Searcher
IndexerUpdater
Scheduler
Contentrepository
Injector
Crawling frontier controls
8
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch features at a glancePage database and link database (web graph)Plugin-based highly modular
minus Most behavior can be changed via pluginsMulti-protocol multi-threaded distributed crawlerPlugin-based content processing (parsing filtering)Robust crawling frontier controlsScalable data processing framework
minus Map-reduce processingFull-text indexer amp search engine
minus Using Lucene or Solrminus Support for distributed search
Robust API and integration options
9
Nut
ch ndash
Apa
cheC
on U
S 0
9
Hadoop foundationFile system abstraction
bull Local FS orbull Distributed FS
minus also Amazon S3 Kosmos and other FS implMap-reduce processing
bull Currently central to Nutch algorithmsbull Processing tasks are executed as one or more map-
reduce jobsbull Data maintained as Hadoop MapFile-s SequenceFile-sbull Massive updates very efficientbull Small updates costly
minus Hadoop data is immutable once createdminus Updates are really merge amp replace
10
Nut
ch ndash
Apa
cheC
on U
S 0
9
Search engine building blocks
Web graph-page info-links (inout)
Crawler
Parser
Searcher
IndexerUpdater
Scheduler
Contentrepository
Injector
Crawling frontier controls
11
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch building blocks
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
12
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch data
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
Maintains info on all known URL-s Fetch schedule Fetch status Page signature Metadata
13
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch data
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
For each target URL keeps info on incoming links ie list of source URL-s and their associated anchor text
14
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch data
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
Shards (ldquosegmentsrdquo) keep Raw page content Parsed content + discovered metadata + outlinks Plain text for indexing and snippets Optional per-shard indexes
15
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch shards (aka ldquosegmentsrdquo)Unit of work (batch) ndash easier to process massive datasetsConvenience placeholder using predefined directory namesUnit of deployment to the search infrastructureMay include per-shard Lucene indexesOnce completed they are basically unmodifiable
minus No in-place updates of content or replacing of obsolete contentPeriodically phased-out
2009043012345
crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_text
2009043012345
crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_text
200904301234crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_textindexes
Generator
Fetcher
Parser
Indexersnippets
ldquocachedrdquo view
16
Nut
ch ndash
Apa
cheC
on U
S 0
9
URL life-cycle and shardsinjected
discoveredscheduledfor fetching
S1
fetched up-to-date
scheduledfor fetching
fetched up-to-date
A time-lapse view of realityGoals
bull Fresh index rarr sync with the real rate of changes
bull Minimize re-fetching rarr dont fetch unmodified pages
Each page may need its own crawl scheduleShard management
bull Page may be present in many shards but only the most recent record is valid
bull Inconvenient to update in-place just mark as deleted
bull Phase-out old shards and force re-fetch of the remaining pages
scheduledfor fetching
fetched up-to-date
time
S2
S3
real page changesobserved changes
17
Nut
ch ndash
Apa
cheC
on U
S 0
9
Crawling frontierNo authoritative catalog of web pagesSearch engines discover their view of web universe
minus Start from ldquoseed listrdquominus Follow (walk) some (useful interesting) outlinks
Many dangers of simply wandering aroundminus explosion or collapse of the frontierminus collecting unwanted content (spam junk offensive)
I could usesome guidance
18
Nut
ch ndash
Apa
cheC
on U
S 0
9
Controlling the crawling frontierURL filter plugins
minus White-list black-list regexminus May use external resources
(DB-s services )
URL normalizer pluginsminus Resolving relative path
elementsminus ldquoEquivalentrdquo URLs
Additional controls using scoring plugins
minus priority metadata selectblockCrawler traps ndash difficult
minus Domain host path-level stats
seed
i = 1
i = 2i = 3
19
Nut
ch ndash
Apa
cheC
on U
S 0
9
Wide vs focused crawlingDifferences
bull Little technical difference in configurationbull Big difference in operations maintenance and qualityWide crawling
minus (Almost) Unlimited crawling frontierminus High risk of spamming and junk contentminus ldquoPolitenessrdquo a very important limiting factorminus Bandwidth amp DNS considerations
Focused (vertical or enterprise) crawlingminus Limited crawling frontierminus Bandwidth or politeness is often not an issueminus Low risk of spamming and junk content
20
Nut
ch ndash
Apa
cheC
on U
S 0
9
Vertical amp enterprise searchVertical search
bull Range of selected ldquoreferencerdquo sitesbull Robust control of the crawling frontierbull Extensive content post-processingbull Business-driven decisions about rankingEnterprise search
bull Variety of data sources and data formatsbull Well-defined and limited crawling frontierbull Integration with in-house data sourcesbull Little danger of spambull PageRank-like scoring usually works poorly
21
Nut
ch ndash
Apa
cheC
on U
S 0
9
Face to face with Nutch
22
Nut
ch ndash
Apa
cheC
on U
S 0
9
Simple set-up and runningYou already have Java 5+ rightGet a 10 release or a nightly build (pretty stable)Simple search setup uses Tomcat for search web appCommand-line bash script binnutch
bull Windows users get Cygwinbull Early version of a web-based UI consolebull Hadoop web-based monitoring
23
Nut
ch ndash
Apa
cheC
on U
S 0
9
Configuration filesEdit configuration in confnutch-sitexml
bull Check nutch-defaultxml for defaults and docsbull You MUST at least fill the name of your agentActive plugins configurationPer-plugin propertiesExternal configuration files
bull regex-urlfilterxmlbull regex-normalizexmlbull parse-pluginsxml mapping of MIME type to plugin
24
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch pluginsPlugin-based extensions for
bull Crawl schedulingbull URL filtering and normalizationbull Protocol for getting the contentbull Content parsingbull Text analysis (tokenization)bull Page signature (to detect near-duplicates)bull Indexing filters (index fields amp metadata)bull Snippet generation and highlightingbull Scoring and rankingbull Query translation and expansion (user rarr Lucene)
25
Nut
ch ndash
Apa
cheC
on U
S 0
9
Main Nutch workflow
Inject initial creation of CrawlDBbull Insert seed URLsbull Initial LinkDB is empty
Generate new shards fetchlistFetch raw contentParse content (discovers outlinks)Update CrawlDB from shardsUpdate LinkDB from shardsIndex shards
(repeat)
Command-line binnutchinject
generatefetchparseupdatedbupdatelinkdbindex solrindex
26
Nut
ch ndash
Apa
cheC
on U
S 0
9
Injecting new URL-s
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
27
Nut
ch ndash
Apa
cheC
on U
S 0
9
Generating fetchlists
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
28
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow generating fetchlistsWhat to fetch next
bull Breadth-first ndash important due to ldquopolitenessrdquo limitsbull Expired (longer than fetchTime + fetchInterval)bull Highly ranking (PageRank)bull Newly addedFetchlist generation
bull ldquotopNrdquo - select best candidatesminus Priority based on many factors pluggable
Adaptive fetch schedulebull Detect rate of changes amp time of changebull Detect unmodified content
minus Hmm and how to recognize this rarr approximate page signatures (near-duplicate detection)
29
Nut
ch ndash
Apa
cheC
on U
S 0
9
Fetching content
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
30
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow fetching
Multi-protocol HTTP(s) FTP file etc Coordinates multiple threads accessing the same hostbull ldquopolitenessrdquo issues vs crawling efficiencybull Host an IP or DNS namebull Redirections should follow immediately What kind
Other netiquette issuesbull robotstxt disallowed paths Crawl-DelayPreparation for parsingbull Content type detection issuesbull Parsing usually executed as a separate step
(resource hungry and sometimes unstable)
31
Nut
ch ndash
Apa
cheC
on U
S 0
9
Content processing
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
32
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow content processingProtocol plugins retrieve content as plain bytes + protocol-level metadata (eg HTTP headers)Parse pluginsbull Content is parsed by MIME-type specific parsersbull Content is parsed into parseData (title outlinks other
metadata) and parseText (plain text content)Nutch supports many popular file formats
33
Nut
ch ndash
Apa
cheC
on U
S 0
9
Link inversion
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
34
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow link inversionPages have outgoing links (outlinks)
hellip I know where Im pointing to
Question who points to mehellip I dont know there is no catalog of pageshellip NOBODY knows for sure either
In-degree indicates importance of the pageAnchor text provides important semantic infoPartial answer invert the outlinks that I know about and group by target
src 1
tgt 2
tgt 3
tgt 4tgt 5
tgt 1
tgt 1
src 2
src 3
src 4src 5
src 1
35
Nut
ch ndash
Apa
cheC
on U
S 0
9
Page importance - scoring
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
36
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow page scoringQuery-independent importance factor
bull Affects search rankingbull Affects crawl prioritizationMay include arbitrary decision of the operatorCurrently two systems (plugins + tools)OPIC scoring
bull Doesnt require explicit steps - ldquoonlinerdquobull Difficult to stabilizePageRank scoring with loop detection
bull Periodically run to update CrawlDb scoresbull Computationally intensive esp loop detection
37
Nut
ch ndash
Apa
cheC
on U
S 0
9
Indexing
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
38
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow indexingIndexing plugins
bull Create full-text search index from segment databull Apply adjustments to score per documentbull May further post-process the parsed content (eg
language identification) to facilitate advanced searchIndexers
bull Lucene indexer ndash builds indexes to be served by Nutch search servers
bull Solr indexer ndash submits documents to a Solr instance
39
Nut
ch ndash
Apa
cheC
on U
S 0
9
De-duplication
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
40
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow de-duplicationThe same page may be present in many shards
bull Obsolete versions in older shardsbull Mirrored pages or equivalent (acom rarr wwwacom)Many other pages are almost identical
bull Template-related differences (banners current date)bull Font layout changes minor re-wording
Hmm hellip what is a significant change bull Tricky issue Hint what is the page content
Near-duplicate detection and removalbull Nutch uses approximate page signatures (fingerprints)bull Duplicates are only marked as deleted
41
Nut
ch ndash
Apa
cheC
on U
S 0
9
Cycles may overlap
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
42
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch distributed search
Multiple search serversSearch front-end dispatches queries and collects search results
bull multiple access methods (API OpenSearch XML JSON)
Page servers build snippetsFault-tolerant
with degraded quality
Search front-end(s)Web app
APIXML
NutchBean
JSON
Search server(Nutch or Solr)
indexed shards
43
Nut
ch ndash
Apa
cheC
on U
S 0
9
Search configurationNutch syntax is limited ndash on purpose
bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins
bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20
Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation
44
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards
Searchfront-endSearch
front-end
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
45
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment distributed searchLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
46
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment map-reduce processing amp distrib search
Fully distributed storage and processingLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
DFS Data NodeDFS Data Node
DFS Data Node
DFS Name NodeDFS Name Node(Job Tracker)
FetcherIndexer
DB mgmt
(Job Tracker)FetcherIndexer
DB mgmt
Task TrackerTask Tracker
Task Tracker
47
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch on a Hadoop cluster
Assumes an up amp running Hadoop clusterhellip which is covered elsewhere
Build nutchjob jar and use it as usual
binhadoop jar nutchjob ltclassNamegt ltargsgt
Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar
Searching is not a map-reduce job ndash often on a separate group of machines
48
Nut
ch ndash
Apa
cheC
on U
S 0
9
49
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch index
50
Nut
ch ndash
Apa
cheC
on U
S 0
9
hellip and press Search
51
Nut
ch ndash
Apa
cheC
on U
S 0
9
Conclusions(This overview is a tip of the iceberg)
NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit
52
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of NutchAvoid code duplicationParsing rarr Tika
bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI
bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper
bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase
bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available
53
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of Nutch (2)Whats left then
bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons
Vision aacute la carte toolkit scalable from1-1000s nodes
54
Nut
ch ndash
Apa
cheC
on U
S 0
9
Summary
QampA
Further informationbull httpluceneapacheorgnutch
4
Nut
ch ndash
Apa
cheC
on U
S 0
9
5
Nut
ch ndash
Apa
cheC
on U
S 0
9
Web as a directed graphNodes (vertices) URL-s as unique identifiersEdges (links) hyperlinks like lta href=rdquotargetUrlrdquogtEdge labels lta href=rdquordquogtanchor textltagtOften represented as adjacency (neighbor) listsTraversal follow the edges breadth-first depth-first random
1
3
4
56
2
7
8
9
1 rarr 2 3 4 5 65 rarr 6 97 rarr 3 4 8 9
6
Nut
ch ndash
Apa
cheC
on U
S 0
9
Whats in a search engine
hellip a few things that may surprise you
7
Nut
ch ndash
Apa
cheC
on U
S 0
9
Search engine building blocks
Web graph-page info-links (inout)
Crawler
Parser
Searcher
IndexerUpdater
Scheduler
Contentrepository
Injector
Crawling frontier controls
8
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch features at a glancePage database and link database (web graph)Plugin-based highly modular
minus Most behavior can be changed via pluginsMulti-protocol multi-threaded distributed crawlerPlugin-based content processing (parsing filtering)Robust crawling frontier controlsScalable data processing framework
minus Map-reduce processingFull-text indexer amp search engine
minus Using Lucene or Solrminus Support for distributed search
Robust API and integration options
9
Nut
ch ndash
Apa
cheC
on U
S 0
9
Hadoop foundationFile system abstraction
bull Local FS orbull Distributed FS
minus also Amazon S3 Kosmos and other FS implMap-reduce processing
bull Currently central to Nutch algorithmsbull Processing tasks are executed as one or more map-
reduce jobsbull Data maintained as Hadoop MapFile-s SequenceFile-sbull Massive updates very efficientbull Small updates costly
minus Hadoop data is immutable once createdminus Updates are really merge amp replace
10
Nut
ch ndash
Apa
cheC
on U
S 0
9
Search engine building blocks
Web graph-page info-links (inout)
Crawler
Parser
Searcher
IndexerUpdater
Scheduler
Contentrepository
Injector
Crawling frontier controls
11
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch building blocks
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
12
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch data
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
Maintains info on all known URL-s Fetch schedule Fetch status Page signature Metadata
13
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch data
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
For each target URL keeps info on incoming links ie list of source URL-s and their associated anchor text
14
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch data
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
Shards (ldquosegmentsrdquo) keep Raw page content Parsed content + discovered metadata + outlinks Plain text for indexing and snippets Optional per-shard indexes
15
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch shards (aka ldquosegmentsrdquo)Unit of work (batch) ndash easier to process massive datasetsConvenience placeholder using predefined directory namesUnit of deployment to the search infrastructureMay include per-shard Lucene indexesOnce completed they are basically unmodifiable
minus No in-place updates of content or replacing of obsolete contentPeriodically phased-out
2009043012345
crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_text
2009043012345
crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_text
200904301234crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_textindexes
Generator
Fetcher
Parser
Indexersnippets
ldquocachedrdquo view
16
Nut
ch ndash
Apa
cheC
on U
S 0
9
URL life-cycle and shardsinjected
discoveredscheduledfor fetching
S1
fetched up-to-date
scheduledfor fetching
fetched up-to-date
A time-lapse view of realityGoals
bull Fresh index rarr sync with the real rate of changes
bull Minimize re-fetching rarr dont fetch unmodified pages
Each page may need its own crawl scheduleShard management
bull Page may be present in many shards but only the most recent record is valid
bull Inconvenient to update in-place just mark as deleted
bull Phase-out old shards and force re-fetch of the remaining pages
scheduledfor fetching
fetched up-to-date
time
S2
S3
real page changesobserved changes
17
Nut
ch ndash
Apa
cheC
on U
S 0
9
Crawling frontierNo authoritative catalog of web pagesSearch engines discover their view of web universe
minus Start from ldquoseed listrdquominus Follow (walk) some (useful interesting) outlinks
Many dangers of simply wandering aroundminus explosion or collapse of the frontierminus collecting unwanted content (spam junk offensive)
I could usesome guidance
18
Nut
ch ndash
Apa
cheC
on U
S 0
9
Controlling the crawling frontierURL filter plugins
minus White-list black-list regexminus May use external resources
(DB-s services )
URL normalizer pluginsminus Resolving relative path
elementsminus ldquoEquivalentrdquo URLs
Additional controls using scoring plugins
minus priority metadata selectblockCrawler traps ndash difficult
minus Domain host path-level stats
seed
i = 1
i = 2i = 3
19
Nut
ch ndash
Apa
cheC
on U
S 0
9
Wide vs focused crawlingDifferences
bull Little technical difference in configurationbull Big difference in operations maintenance and qualityWide crawling
minus (Almost) Unlimited crawling frontierminus High risk of spamming and junk contentminus ldquoPolitenessrdquo a very important limiting factorminus Bandwidth amp DNS considerations
Focused (vertical or enterprise) crawlingminus Limited crawling frontierminus Bandwidth or politeness is often not an issueminus Low risk of spamming and junk content
20
Nut
ch ndash
Apa
cheC
on U
S 0
9
Vertical amp enterprise searchVertical search
bull Range of selected ldquoreferencerdquo sitesbull Robust control of the crawling frontierbull Extensive content post-processingbull Business-driven decisions about rankingEnterprise search
bull Variety of data sources and data formatsbull Well-defined and limited crawling frontierbull Integration with in-house data sourcesbull Little danger of spambull PageRank-like scoring usually works poorly
21
Nut
ch ndash
Apa
cheC
on U
S 0
9
Face to face with Nutch
22
Nut
ch ndash
Apa
cheC
on U
S 0
9
Simple set-up and runningYou already have Java 5+ rightGet a 10 release or a nightly build (pretty stable)Simple search setup uses Tomcat for search web appCommand-line bash script binnutch
bull Windows users get Cygwinbull Early version of a web-based UI consolebull Hadoop web-based monitoring
23
Nut
ch ndash
Apa
cheC
on U
S 0
9
Configuration filesEdit configuration in confnutch-sitexml
bull Check nutch-defaultxml for defaults and docsbull You MUST at least fill the name of your agentActive plugins configurationPer-plugin propertiesExternal configuration files
bull regex-urlfilterxmlbull regex-normalizexmlbull parse-pluginsxml mapping of MIME type to plugin
24
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch pluginsPlugin-based extensions for
bull Crawl schedulingbull URL filtering and normalizationbull Protocol for getting the contentbull Content parsingbull Text analysis (tokenization)bull Page signature (to detect near-duplicates)bull Indexing filters (index fields amp metadata)bull Snippet generation and highlightingbull Scoring and rankingbull Query translation and expansion (user rarr Lucene)
25
Nut
ch ndash
Apa
cheC
on U
S 0
9
Main Nutch workflow
Inject initial creation of CrawlDBbull Insert seed URLsbull Initial LinkDB is empty
Generate new shards fetchlistFetch raw contentParse content (discovers outlinks)Update CrawlDB from shardsUpdate LinkDB from shardsIndex shards
(repeat)
Command-line binnutchinject
generatefetchparseupdatedbupdatelinkdbindex solrindex
26
Nut
ch ndash
Apa
cheC
on U
S 0
9
Injecting new URL-s
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
27
Nut
ch ndash
Apa
cheC
on U
S 0
9
Generating fetchlists
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
28
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow generating fetchlistsWhat to fetch next
bull Breadth-first ndash important due to ldquopolitenessrdquo limitsbull Expired (longer than fetchTime + fetchInterval)bull Highly ranking (PageRank)bull Newly addedFetchlist generation
bull ldquotopNrdquo - select best candidatesminus Priority based on many factors pluggable
Adaptive fetch schedulebull Detect rate of changes amp time of changebull Detect unmodified content
minus Hmm and how to recognize this rarr approximate page signatures (near-duplicate detection)
29
Nut
ch ndash
Apa
cheC
on U
S 0
9
Fetching content
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
30
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow fetching
Multi-protocol HTTP(s) FTP file etc Coordinates multiple threads accessing the same hostbull ldquopolitenessrdquo issues vs crawling efficiencybull Host an IP or DNS namebull Redirections should follow immediately What kind
Other netiquette issuesbull robotstxt disallowed paths Crawl-DelayPreparation for parsingbull Content type detection issuesbull Parsing usually executed as a separate step
(resource hungry and sometimes unstable)
31
Nut
ch ndash
Apa
cheC
on U
S 0
9
Content processing
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
32
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow content processingProtocol plugins retrieve content as plain bytes + protocol-level metadata (eg HTTP headers)Parse pluginsbull Content is parsed by MIME-type specific parsersbull Content is parsed into parseData (title outlinks other
metadata) and parseText (plain text content)Nutch supports many popular file formats
33
Nut
ch ndash
Apa
cheC
on U
S 0
9
Link inversion
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
34
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow link inversionPages have outgoing links (outlinks)
hellip I know where Im pointing to
Question who points to mehellip I dont know there is no catalog of pageshellip NOBODY knows for sure either
In-degree indicates importance of the pageAnchor text provides important semantic infoPartial answer invert the outlinks that I know about and group by target
src 1
tgt 2
tgt 3
tgt 4tgt 5
tgt 1
tgt 1
src 2
src 3
src 4src 5
src 1
35
Nut
ch ndash
Apa
cheC
on U
S 0
9
Page importance - scoring
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
36
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow page scoringQuery-independent importance factor
bull Affects search rankingbull Affects crawl prioritizationMay include arbitrary decision of the operatorCurrently two systems (plugins + tools)OPIC scoring
bull Doesnt require explicit steps - ldquoonlinerdquobull Difficult to stabilizePageRank scoring with loop detection
bull Periodically run to update CrawlDb scoresbull Computationally intensive esp loop detection
37
Nut
ch ndash
Apa
cheC
on U
S 0
9
Indexing
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
38
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow indexingIndexing plugins
bull Create full-text search index from segment databull Apply adjustments to score per documentbull May further post-process the parsed content (eg
language identification) to facilitate advanced searchIndexers
bull Lucene indexer ndash builds indexes to be served by Nutch search servers
bull Solr indexer ndash submits documents to a Solr instance
39
Nut
ch ndash
Apa
cheC
on U
S 0
9
De-duplication
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
40
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow de-duplicationThe same page may be present in many shards
bull Obsolete versions in older shardsbull Mirrored pages or equivalent (acom rarr wwwacom)Many other pages are almost identical
bull Template-related differences (banners current date)bull Font layout changes minor re-wording
Hmm hellip what is a significant change bull Tricky issue Hint what is the page content
Near-duplicate detection and removalbull Nutch uses approximate page signatures (fingerprints)bull Duplicates are only marked as deleted
41
Nut
ch ndash
Apa
cheC
on U
S 0
9
Cycles may overlap
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
42
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch distributed search
Multiple search serversSearch front-end dispatches queries and collects search results
bull multiple access methods (API OpenSearch XML JSON)
Page servers build snippetsFault-tolerant
with degraded quality
Search front-end(s)Web app
APIXML
NutchBean
JSON
Search server(Nutch or Solr)
indexed shards
43
Nut
ch ndash
Apa
cheC
on U
S 0
9
Search configurationNutch syntax is limited ndash on purpose
bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins
bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20
Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation
44
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards
Searchfront-endSearch
front-end
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
45
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment distributed searchLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
46
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment map-reduce processing amp distrib search
Fully distributed storage and processingLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
DFS Data NodeDFS Data Node
DFS Data Node
DFS Name NodeDFS Name Node(Job Tracker)
FetcherIndexer
DB mgmt
(Job Tracker)FetcherIndexer
DB mgmt
Task TrackerTask Tracker
Task Tracker
47
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch on a Hadoop cluster
Assumes an up amp running Hadoop clusterhellip which is covered elsewhere
Build nutchjob jar and use it as usual
binhadoop jar nutchjob ltclassNamegt ltargsgt
Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar
Searching is not a map-reduce job ndash often on a separate group of machines
48
Nut
ch ndash
Apa
cheC
on U
S 0
9
49
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch index
50
Nut
ch ndash
Apa
cheC
on U
S 0
9
hellip and press Search
51
Nut
ch ndash
Apa
cheC
on U
S 0
9
Conclusions(This overview is a tip of the iceberg)
NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit
52
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of NutchAvoid code duplicationParsing rarr Tika
bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI
bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper
bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase
bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available
53
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of Nutch (2)Whats left then
bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons
Vision aacute la carte toolkit scalable from1-1000s nodes
54
Nut
ch ndash
Apa
cheC
on U
S 0
9
Summary
QampA
Further informationbull httpluceneapacheorgnutch
5
Nut
ch ndash
Apa
cheC
on U
S 0
9
Web as a directed graphNodes (vertices) URL-s as unique identifiersEdges (links) hyperlinks like lta href=rdquotargetUrlrdquogtEdge labels lta href=rdquordquogtanchor textltagtOften represented as adjacency (neighbor) listsTraversal follow the edges breadth-first depth-first random
1
3
4
56
2
7
8
9
1 rarr 2 3 4 5 65 rarr 6 97 rarr 3 4 8 9
6
Nut
ch ndash
Apa
cheC
on U
S 0
9
Whats in a search engine
hellip a few things that may surprise you
7
Nut
ch ndash
Apa
cheC
on U
S 0
9
Search engine building blocks
Web graph-page info-links (inout)
Crawler
Parser
Searcher
IndexerUpdater
Scheduler
Contentrepository
Injector
Crawling frontier controls
8
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch features at a glancePage database and link database (web graph)Plugin-based highly modular
minus Most behavior can be changed via pluginsMulti-protocol multi-threaded distributed crawlerPlugin-based content processing (parsing filtering)Robust crawling frontier controlsScalable data processing framework
minus Map-reduce processingFull-text indexer amp search engine
minus Using Lucene or Solrminus Support for distributed search
Robust API and integration options
9
Nut
ch ndash
Apa
cheC
on U
S 0
9
Hadoop foundationFile system abstraction
bull Local FS orbull Distributed FS
minus also Amazon S3 Kosmos and other FS implMap-reduce processing
bull Currently central to Nutch algorithmsbull Processing tasks are executed as one or more map-
reduce jobsbull Data maintained as Hadoop MapFile-s SequenceFile-sbull Massive updates very efficientbull Small updates costly
minus Hadoop data is immutable once createdminus Updates are really merge amp replace
10
Nut
ch ndash
Apa
cheC
on U
S 0
9
Search engine building blocks
Web graph-page info-links (inout)
Crawler
Parser
Searcher
IndexerUpdater
Scheduler
Contentrepository
Injector
Crawling frontier controls
11
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch building blocks
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
12
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch data
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
Maintains info on all known URL-s Fetch schedule Fetch status Page signature Metadata
13
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch data
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
For each target URL keeps info on incoming links ie list of source URL-s and their associated anchor text
14
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch data
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
Shards (ldquosegmentsrdquo) keep Raw page content Parsed content + discovered metadata + outlinks Plain text for indexing and snippets Optional per-shard indexes
15
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch shards (aka ldquosegmentsrdquo)Unit of work (batch) ndash easier to process massive datasetsConvenience placeholder using predefined directory namesUnit of deployment to the search infrastructureMay include per-shard Lucene indexesOnce completed they are basically unmodifiable
minus No in-place updates of content or replacing of obsolete contentPeriodically phased-out
2009043012345
crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_text
2009043012345
crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_text
200904301234crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_textindexes
Generator
Fetcher
Parser
Indexersnippets
ldquocachedrdquo view
16
Nut
ch ndash
Apa
cheC
on U
S 0
9
URL life-cycle and shardsinjected
discoveredscheduledfor fetching
S1
fetched up-to-date
scheduledfor fetching
fetched up-to-date
A time-lapse view of realityGoals
bull Fresh index rarr sync with the real rate of changes
bull Minimize re-fetching rarr dont fetch unmodified pages
Each page may need its own crawl scheduleShard management
bull Page may be present in many shards but only the most recent record is valid
bull Inconvenient to update in-place just mark as deleted
bull Phase-out old shards and force re-fetch of the remaining pages
scheduledfor fetching
fetched up-to-date
time
S2
S3
real page changesobserved changes
17
Nut
ch ndash
Apa
cheC
on U
S 0
9
Crawling frontierNo authoritative catalog of web pagesSearch engines discover their view of web universe
minus Start from ldquoseed listrdquominus Follow (walk) some (useful interesting) outlinks
Many dangers of simply wandering aroundminus explosion or collapse of the frontierminus collecting unwanted content (spam junk offensive)
I could usesome guidance
18
Nut
ch ndash
Apa
cheC
on U
S 0
9
Controlling the crawling frontierURL filter plugins
minus White-list black-list regexminus May use external resources
(DB-s services )
URL normalizer pluginsminus Resolving relative path
elementsminus ldquoEquivalentrdquo URLs
Additional controls using scoring plugins
minus priority metadata selectblockCrawler traps ndash difficult
minus Domain host path-level stats
seed
i = 1
i = 2i = 3
19
Nut
ch ndash
Apa
cheC
on U
S 0
9
Wide vs focused crawlingDifferences
bull Little technical difference in configurationbull Big difference in operations maintenance and qualityWide crawling
minus (Almost) Unlimited crawling frontierminus High risk of spamming and junk contentminus ldquoPolitenessrdquo a very important limiting factorminus Bandwidth amp DNS considerations
Focused (vertical or enterprise) crawlingminus Limited crawling frontierminus Bandwidth or politeness is often not an issueminus Low risk of spamming and junk content
20
Nut
ch ndash
Apa
cheC
on U
S 0
9
Vertical amp enterprise searchVertical search
bull Range of selected ldquoreferencerdquo sitesbull Robust control of the crawling frontierbull Extensive content post-processingbull Business-driven decisions about rankingEnterprise search
bull Variety of data sources and data formatsbull Well-defined and limited crawling frontierbull Integration with in-house data sourcesbull Little danger of spambull PageRank-like scoring usually works poorly
21
Nut
ch ndash
Apa
cheC
on U
S 0
9
Face to face with Nutch
22
Nut
ch ndash
Apa
cheC
on U
S 0
9
Simple set-up and runningYou already have Java 5+ rightGet a 10 release or a nightly build (pretty stable)Simple search setup uses Tomcat for search web appCommand-line bash script binnutch
bull Windows users get Cygwinbull Early version of a web-based UI consolebull Hadoop web-based monitoring
23
Nut
ch ndash
Apa
cheC
on U
S 0
9
Configuration filesEdit configuration in confnutch-sitexml
bull Check nutch-defaultxml for defaults and docsbull You MUST at least fill the name of your agentActive plugins configurationPer-plugin propertiesExternal configuration files
bull regex-urlfilterxmlbull regex-normalizexmlbull parse-pluginsxml mapping of MIME type to plugin
24
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch pluginsPlugin-based extensions for
bull Crawl schedulingbull URL filtering and normalizationbull Protocol for getting the contentbull Content parsingbull Text analysis (tokenization)bull Page signature (to detect near-duplicates)bull Indexing filters (index fields amp metadata)bull Snippet generation and highlightingbull Scoring and rankingbull Query translation and expansion (user rarr Lucene)
25
Nut
ch ndash
Apa
cheC
on U
S 0
9
Main Nutch workflow
Inject initial creation of CrawlDBbull Insert seed URLsbull Initial LinkDB is empty
Generate new shards fetchlistFetch raw contentParse content (discovers outlinks)Update CrawlDB from shardsUpdate LinkDB from shardsIndex shards
(repeat)
Command-line binnutchinject
generatefetchparseupdatedbupdatelinkdbindex solrindex
26
Nut
ch ndash
Apa
cheC
on U
S 0
9
Injecting new URL-s
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
27
Nut
ch ndash
Apa
cheC
on U
S 0
9
Generating fetchlists
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
28
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow generating fetchlistsWhat to fetch next
bull Breadth-first ndash important due to ldquopolitenessrdquo limitsbull Expired (longer than fetchTime + fetchInterval)bull Highly ranking (PageRank)bull Newly addedFetchlist generation
bull ldquotopNrdquo - select best candidatesminus Priority based on many factors pluggable
Adaptive fetch schedulebull Detect rate of changes amp time of changebull Detect unmodified content
minus Hmm and how to recognize this rarr approximate page signatures (near-duplicate detection)
29
Nut
ch ndash
Apa
cheC
on U
S 0
9
Fetching content
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
30
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow fetching
Multi-protocol HTTP(s) FTP file etc Coordinates multiple threads accessing the same hostbull ldquopolitenessrdquo issues vs crawling efficiencybull Host an IP or DNS namebull Redirections should follow immediately What kind
Other netiquette issuesbull robotstxt disallowed paths Crawl-DelayPreparation for parsingbull Content type detection issuesbull Parsing usually executed as a separate step
(resource hungry and sometimes unstable)
31
Nut
ch ndash
Apa
cheC
on U
S 0
9
Content processing
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
32
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow content processingProtocol plugins retrieve content as plain bytes + protocol-level metadata (eg HTTP headers)Parse pluginsbull Content is parsed by MIME-type specific parsersbull Content is parsed into parseData (title outlinks other
metadata) and parseText (plain text content)Nutch supports many popular file formats
33
Nut
ch ndash
Apa
cheC
on U
S 0
9
Link inversion
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
34
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow link inversionPages have outgoing links (outlinks)
hellip I know where Im pointing to
Question who points to mehellip I dont know there is no catalog of pageshellip NOBODY knows for sure either
In-degree indicates importance of the pageAnchor text provides important semantic infoPartial answer invert the outlinks that I know about and group by target
src 1
tgt 2
tgt 3
tgt 4tgt 5
tgt 1
tgt 1
src 2
src 3
src 4src 5
src 1
35
Nut
ch ndash
Apa
cheC
on U
S 0
9
Page importance - scoring
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
36
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow page scoringQuery-independent importance factor
bull Affects search rankingbull Affects crawl prioritizationMay include arbitrary decision of the operatorCurrently two systems (plugins + tools)OPIC scoring
bull Doesnt require explicit steps - ldquoonlinerdquobull Difficult to stabilizePageRank scoring with loop detection
bull Periodically run to update CrawlDb scoresbull Computationally intensive esp loop detection
37
Nut
ch ndash
Apa
cheC
on U
S 0
9
Indexing
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
38
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow indexingIndexing plugins
bull Create full-text search index from segment databull Apply adjustments to score per documentbull May further post-process the parsed content (eg
language identification) to facilitate advanced searchIndexers
bull Lucene indexer ndash builds indexes to be served by Nutch search servers
bull Solr indexer ndash submits documents to a Solr instance
39
Nut
ch ndash
Apa
cheC
on U
S 0
9
De-duplication
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
40
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow de-duplicationThe same page may be present in many shards
bull Obsolete versions in older shardsbull Mirrored pages or equivalent (acom rarr wwwacom)Many other pages are almost identical
bull Template-related differences (banners current date)bull Font layout changes minor re-wording
Hmm hellip what is a significant change bull Tricky issue Hint what is the page content
Near-duplicate detection and removalbull Nutch uses approximate page signatures (fingerprints)bull Duplicates are only marked as deleted
41
Nut
ch ndash
Apa
cheC
on U
S 0
9
Cycles may overlap
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
42
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch distributed search
Multiple search serversSearch front-end dispatches queries and collects search results
bull multiple access methods (API OpenSearch XML JSON)
Page servers build snippetsFault-tolerant
with degraded quality
Search front-end(s)Web app
APIXML
NutchBean
JSON
Search server(Nutch or Solr)
indexed shards
43
Nut
ch ndash
Apa
cheC
on U
S 0
9
Search configurationNutch syntax is limited ndash on purpose
bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins
bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20
Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation
44
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards
Searchfront-endSearch
front-end
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
45
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment distributed searchLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
46
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment map-reduce processing amp distrib search
Fully distributed storage and processingLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
DFS Data NodeDFS Data Node
DFS Data Node
DFS Name NodeDFS Name Node(Job Tracker)
FetcherIndexer
DB mgmt
(Job Tracker)FetcherIndexer
DB mgmt
Task TrackerTask Tracker
Task Tracker
47
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch on a Hadoop cluster
Assumes an up amp running Hadoop clusterhellip which is covered elsewhere
Build nutchjob jar and use it as usual
binhadoop jar nutchjob ltclassNamegt ltargsgt
Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar
Searching is not a map-reduce job ndash often on a separate group of machines
48
Nut
ch ndash
Apa
cheC
on U
S 0
9
49
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch index
50
Nut
ch ndash
Apa
cheC
on U
S 0
9
hellip and press Search
51
Nut
ch ndash
Apa
cheC
on U
S 0
9
Conclusions(This overview is a tip of the iceberg)
NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit
52
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of NutchAvoid code duplicationParsing rarr Tika
bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI
bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper
bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase
bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available
53
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of Nutch (2)Whats left then
bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons
Vision aacute la carte toolkit scalable from1-1000s nodes
54
Nut
ch ndash
Apa
cheC
on U
S 0
9
Summary
QampA
Further informationbull httpluceneapacheorgnutch
6
Nut
ch ndash
Apa
cheC
on U
S 0
9
Whats in a search engine
hellip a few things that may surprise you
7
Nut
ch ndash
Apa
cheC
on U
S 0
9
Search engine building blocks
Web graph-page info-links (inout)
Crawler
Parser
Searcher
IndexerUpdater
Scheduler
Contentrepository
Injector
Crawling frontier controls
8
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch features at a glancePage database and link database (web graph)Plugin-based highly modular
minus Most behavior can be changed via pluginsMulti-protocol multi-threaded distributed crawlerPlugin-based content processing (parsing filtering)Robust crawling frontier controlsScalable data processing framework
minus Map-reduce processingFull-text indexer amp search engine
minus Using Lucene or Solrminus Support for distributed search
Robust API and integration options
9
Nut
ch ndash
Apa
cheC
on U
S 0
9
Hadoop foundationFile system abstraction
bull Local FS orbull Distributed FS
minus also Amazon S3 Kosmos and other FS implMap-reduce processing
bull Currently central to Nutch algorithmsbull Processing tasks are executed as one or more map-
reduce jobsbull Data maintained as Hadoop MapFile-s SequenceFile-sbull Massive updates very efficientbull Small updates costly
minus Hadoop data is immutable once createdminus Updates are really merge amp replace
10
Nut
ch ndash
Apa
cheC
on U
S 0
9
Search engine building blocks
Web graph-page info-links (inout)
Crawler
Parser
Searcher
IndexerUpdater
Scheduler
Contentrepository
Injector
Crawling frontier controls
11
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch building blocks
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
12
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch data
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
Maintains info on all known URL-s Fetch schedule Fetch status Page signature Metadata
13
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch data
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
For each target URL keeps info on incoming links ie list of source URL-s and their associated anchor text
14
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch data
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
Shards (ldquosegmentsrdquo) keep Raw page content Parsed content + discovered metadata + outlinks Plain text for indexing and snippets Optional per-shard indexes
15
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch shards (aka ldquosegmentsrdquo)Unit of work (batch) ndash easier to process massive datasetsConvenience placeholder using predefined directory namesUnit of deployment to the search infrastructureMay include per-shard Lucene indexesOnce completed they are basically unmodifiable
minus No in-place updates of content or replacing of obsolete contentPeriodically phased-out
2009043012345
crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_text
2009043012345
crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_text
200904301234crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_textindexes
Generator
Fetcher
Parser
Indexersnippets
ldquocachedrdquo view
16
Nut
ch ndash
Apa
cheC
on U
S 0
9
URL life-cycle and shardsinjected
discoveredscheduledfor fetching
S1
fetched up-to-date
scheduledfor fetching
fetched up-to-date
A time-lapse view of realityGoals
bull Fresh index rarr sync with the real rate of changes
bull Minimize re-fetching rarr dont fetch unmodified pages
Each page may need its own crawl scheduleShard management
bull Page may be present in many shards but only the most recent record is valid
bull Inconvenient to update in-place just mark as deleted
bull Phase-out old shards and force re-fetch of the remaining pages
scheduledfor fetching
fetched up-to-date
time
S2
S3
real page changesobserved changes
17
Nut
ch ndash
Apa
cheC
on U
S 0
9
Crawling frontierNo authoritative catalog of web pagesSearch engines discover their view of web universe
minus Start from ldquoseed listrdquominus Follow (walk) some (useful interesting) outlinks
Many dangers of simply wandering aroundminus explosion or collapse of the frontierminus collecting unwanted content (spam junk offensive)
I could usesome guidance
18
Nut
ch ndash
Apa
cheC
on U
S 0
9
Controlling the crawling frontierURL filter plugins
minus White-list black-list regexminus May use external resources
(DB-s services )
URL normalizer pluginsminus Resolving relative path
elementsminus ldquoEquivalentrdquo URLs
Additional controls using scoring plugins
minus priority metadata selectblockCrawler traps ndash difficult
minus Domain host path-level stats
seed
i = 1
i = 2i = 3
19
Nut
ch ndash
Apa
cheC
on U
S 0
9
Wide vs focused crawlingDifferences
bull Little technical difference in configurationbull Big difference in operations maintenance and qualityWide crawling
minus (Almost) Unlimited crawling frontierminus High risk of spamming and junk contentminus ldquoPolitenessrdquo a very important limiting factorminus Bandwidth amp DNS considerations
Focused (vertical or enterprise) crawlingminus Limited crawling frontierminus Bandwidth or politeness is often not an issueminus Low risk of spamming and junk content
20
Nut
ch ndash
Apa
cheC
on U
S 0
9
Vertical amp enterprise searchVertical search
bull Range of selected ldquoreferencerdquo sitesbull Robust control of the crawling frontierbull Extensive content post-processingbull Business-driven decisions about rankingEnterprise search
bull Variety of data sources and data formatsbull Well-defined and limited crawling frontierbull Integration with in-house data sourcesbull Little danger of spambull PageRank-like scoring usually works poorly
21
Nut
ch ndash
Apa
cheC
on U
S 0
9
Face to face with Nutch
22
Nut
ch ndash
Apa
cheC
on U
S 0
9
Simple set-up and runningYou already have Java 5+ rightGet a 10 release or a nightly build (pretty stable)Simple search setup uses Tomcat for search web appCommand-line bash script binnutch
bull Windows users get Cygwinbull Early version of a web-based UI consolebull Hadoop web-based monitoring
23
Nut
ch ndash
Apa
cheC
on U
S 0
9
Configuration filesEdit configuration in confnutch-sitexml
bull Check nutch-defaultxml for defaults and docsbull You MUST at least fill the name of your agentActive plugins configurationPer-plugin propertiesExternal configuration files
bull regex-urlfilterxmlbull regex-normalizexmlbull parse-pluginsxml mapping of MIME type to plugin
24
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch pluginsPlugin-based extensions for
bull Crawl schedulingbull URL filtering and normalizationbull Protocol for getting the contentbull Content parsingbull Text analysis (tokenization)bull Page signature (to detect near-duplicates)bull Indexing filters (index fields amp metadata)bull Snippet generation and highlightingbull Scoring and rankingbull Query translation and expansion (user rarr Lucene)
25
Nut
ch ndash
Apa
cheC
on U
S 0
9
Main Nutch workflow
Inject initial creation of CrawlDBbull Insert seed URLsbull Initial LinkDB is empty
Generate new shards fetchlistFetch raw contentParse content (discovers outlinks)Update CrawlDB from shardsUpdate LinkDB from shardsIndex shards
(repeat)
Command-line binnutchinject
generatefetchparseupdatedbupdatelinkdbindex solrindex
26
Nut
ch ndash
Apa
cheC
on U
S 0
9
Injecting new URL-s
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
27
Nut
ch ndash
Apa
cheC
on U
S 0
9
Generating fetchlists
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
28
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow generating fetchlistsWhat to fetch next
bull Breadth-first ndash important due to ldquopolitenessrdquo limitsbull Expired (longer than fetchTime + fetchInterval)bull Highly ranking (PageRank)bull Newly addedFetchlist generation
bull ldquotopNrdquo - select best candidatesminus Priority based on many factors pluggable
Adaptive fetch schedulebull Detect rate of changes amp time of changebull Detect unmodified content
minus Hmm and how to recognize this rarr approximate page signatures (near-duplicate detection)
29
Nut
ch ndash
Apa
cheC
on U
S 0
9
Fetching content
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
30
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow fetching
Multi-protocol HTTP(s) FTP file etc Coordinates multiple threads accessing the same hostbull ldquopolitenessrdquo issues vs crawling efficiencybull Host an IP or DNS namebull Redirections should follow immediately What kind
Other netiquette issuesbull robotstxt disallowed paths Crawl-DelayPreparation for parsingbull Content type detection issuesbull Parsing usually executed as a separate step
(resource hungry and sometimes unstable)
31
Nut
ch ndash
Apa
cheC
on U
S 0
9
Content processing
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
32
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow content processingProtocol plugins retrieve content as plain bytes + protocol-level metadata (eg HTTP headers)Parse pluginsbull Content is parsed by MIME-type specific parsersbull Content is parsed into parseData (title outlinks other
metadata) and parseText (plain text content)Nutch supports many popular file formats
33
Nut
ch ndash
Apa
cheC
on U
S 0
9
Link inversion
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
34
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow link inversionPages have outgoing links (outlinks)
hellip I know where Im pointing to
Question who points to mehellip I dont know there is no catalog of pageshellip NOBODY knows for sure either
In-degree indicates importance of the pageAnchor text provides important semantic infoPartial answer invert the outlinks that I know about and group by target
src 1
tgt 2
tgt 3
tgt 4tgt 5
tgt 1
tgt 1
src 2
src 3
src 4src 5
src 1
35
Nut
ch ndash
Apa
cheC
on U
S 0
9
Page importance - scoring
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
36
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow page scoringQuery-independent importance factor
bull Affects search rankingbull Affects crawl prioritizationMay include arbitrary decision of the operatorCurrently two systems (plugins + tools)OPIC scoring
bull Doesnt require explicit steps - ldquoonlinerdquobull Difficult to stabilizePageRank scoring with loop detection
bull Periodically run to update CrawlDb scoresbull Computationally intensive esp loop detection
37
Nut
ch ndash
Apa
cheC
on U
S 0
9
Indexing
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
38
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow indexingIndexing plugins
bull Create full-text search index from segment databull Apply adjustments to score per documentbull May further post-process the parsed content (eg
language identification) to facilitate advanced searchIndexers
bull Lucene indexer ndash builds indexes to be served by Nutch search servers
bull Solr indexer ndash submits documents to a Solr instance
39
Nut
ch ndash
Apa
cheC
on U
S 0
9
De-duplication
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
40
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow de-duplicationThe same page may be present in many shards
bull Obsolete versions in older shardsbull Mirrored pages or equivalent (acom rarr wwwacom)Many other pages are almost identical
bull Template-related differences (banners current date)bull Font layout changes minor re-wording
Hmm hellip what is a significant change bull Tricky issue Hint what is the page content
Near-duplicate detection and removalbull Nutch uses approximate page signatures (fingerprints)bull Duplicates are only marked as deleted
41
Nut
ch ndash
Apa
cheC
on U
S 0
9
Cycles may overlap
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
42
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch distributed search
Multiple search serversSearch front-end dispatches queries and collects search results
bull multiple access methods (API OpenSearch XML JSON)
Page servers build snippetsFault-tolerant
with degraded quality
Search front-end(s)Web app
APIXML
NutchBean
JSON
Search server(Nutch or Solr)
indexed shards
43
Nut
ch ndash
Apa
cheC
on U
S 0
9
Search configurationNutch syntax is limited ndash on purpose
bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins
bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20
Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation
44
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards
Searchfront-endSearch
front-end
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
45
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment distributed searchLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
46
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment map-reduce processing amp distrib search
Fully distributed storage and processingLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
DFS Data NodeDFS Data Node
DFS Data Node
DFS Name NodeDFS Name Node(Job Tracker)
FetcherIndexer
DB mgmt
(Job Tracker)FetcherIndexer
DB mgmt
Task TrackerTask Tracker
Task Tracker
47
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch on a Hadoop cluster
Assumes an up amp running Hadoop clusterhellip which is covered elsewhere
Build nutchjob jar and use it as usual
binhadoop jar nutchjob ltclassNamegt ltargsgt
Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar
Searching is not a map-reduce job ndash often on a separate group of machines
48
Nut
ch ndash
Apa
cheC
on U
S 0
9
49
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch index
50
Nut
ch ndash
Apa
cheC
on U
S 0
9
hellip and press Search
51
Nut
ch ndash
Apa
cheC
on U
S 0
9
Conclusions(This overview is a tip of the iceberg)
NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit
52
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of NutchAvoid code duplicationParsing rarr Tika
bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI
bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper
bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase
bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available
53
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of Nutch (2)Whats left then
bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons
Vision aacute la carte toolkit scalable from1-1000s nodes
54
Nut
ch ndash
Apa
cheC
on U
S 0
9
Summary
QampA
Further informationbull httpluceneapacheorgnutch
7
Nut
ch ndash
Apa
cheC
on U
S 0
9
Search engine building blocks
Web graph-page info-links (inout)
Crawler
Parser
Searcher
IndexerUpdater
Scheduler
Contentrepository
Injector
Crawling frontier controls
8
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch features at a glancePage database and link database (web graph)Plugin-based highly modular
minus Most behavior can be changed via pluginsMulti-protocol multi-threaded distributed crawlerPlugin-based content processing (parsing filtering)Robust crawling frontier controlsScalable data processing framework
minus Map-reduce processingFull-text indexer amp search engine
minus Using Lucene or Solrminus Support for distributed search
Robust API and integration options
9
Nut
ch ndash
Apa
cheC
on U
S 0
9
Hadoop foundationFile system abstraction
bull Local FS orbull Distributed FS
minus also Amazon S3 Kosmos and other FS implMap-reduce processing
bull Currently central to Nutch algorithmsbull Processing tasks are executed as one or more map-
reduce jobsbull Data maintained as Hadoop MapFile-s SequenceFile-sbull Massive updates very efficientbull Small updates costly
minus Hadoop data is immutable once createdminus Updates are really merge amp replace
10
Nut
ch ndash
Apa
cheC
on U
S 0
9
Search engine building blocks
Web graph-page info-links (inout)
Crawler
Parser
Searcher
IndexerUpdater
Scheduler
Contentrepository
Injector
Crawling frontier controls
11
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch building blocks
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
12
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch data
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
Maintains info on all known URL-s Fetch schedule Fetch status Page signature Metadata
13
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch data
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
For each target URL keeps info on incoming links ie list of source URL-s and their associated anchor text
14
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch data
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
Shards (ldquosegmentsrdquo) keep Raw page content Parsed content + discovered metadata + outlinks Plain text for indexing and snippets Optional per-shard indexes
15
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch shards (aka ldquosegmentsrdquo)Unit of work (batch) ndash easier to process massive datasetsConvenience placeholder using predefined directory namesUnit of deployment to the search infrastructureMay include per-shard Lucene indexesOnce completed they are basically unmodifiable
minus No in-place updates of content or replacing of obsolete contentPeriodically phased-out
2009043012345
crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_text
2009043012345
crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_text
200904301234crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_textindexes
Generator
Fetcher
Parser
Indexersnippets
ldquocachedrdquo view
16
Nut
ch ndash
Apa
cheC
on U
S 0
9
URL life-cycle and shardsinjected
discoveredscheduledfor fetching
S1
fetched up-to-date
scheduledfor fetching
fetched up-to-date
A time-lapse view of realityGoals
bull Fresh index rarr sync with the real rate of changes
bull Minimize re-fetching rarr dont fetch unmodified pages
Each page may need its own crawl scheduleShard management
bull Page may be present in many shards but only the most recent record is valid
bull Inconvenient to update in-place just mark as deleted
bull Phase-out old shards and force re-fetch of the remaining pages
scheduledfor fetching
fetched up-to-date
time
S2
S3
real page changesobserved changes
17
Nut
ch ndash
Apa
cheC
on U
S 0
9
Crawling frontierNo authoritative catalog of web pagesSearch engines discover their view of web universe
minus Start from ldquoseed listrdquominus Follow (walk) some (useful interesting) outlinks
Many dangers of simply wandering aroundminus explosion or collapse of the frontierminus collecting unwanted content (spam junk offensive)
I could usesome guidance
18
Nut
ch ndash
Apa
cheC
on U
S 0
9
Controlling the crawling frontierURL filter plugins
minus White-list black-list regexminus May use external resources
(DB-s services )
URL normalizer pluginsminus Resolving relative path
elementsminus ldquoEquivalentrdquo URLs
Additional controls using scoring plugins
minus priority metadata selectblockCrawler traps ndash difficult
minus Domain host path-level stats
seed
i = 1
i = 2i = 3
19
Nut
ch ndash
Apa
cheC
on U
S 0
9
Wide vs focused crawlingDifferences
bull Little technical difference in configurationbull Big difference in operations maintenance and qualityWide crawling
minus (Almost) Unlimited crawling frontierminus High risk of spamming and junk contentminus ldquoPolitenessrdquo a very important limiting factorminus Bandwidth amp DNS considerations
Focused (vertical or enterprise) crawlingminus Limited crawling frontierminus Bandwidth or politeness is often not an issueminus Low risk of spamming and junk content
20
Nut
ch ndash
Apa
cheC
on U
S 0
9
Vertical amp enterprise searchVertical search
bull Range of selected ldquoreferencerdquo sitesbull Robust control of the crawling frontierbull Extensive content post-processingbull Business-driven decisions about rankingEnterprise search
bull Variety of data sources and data formatsbull Well-defined and limited crawling frontierbull Integration with in-house data sourcesbull Little danger of spambull PageRank-like scoring usually works poorly
21
Nut
ch ndash
Apa
cheC
on U
S 0
9
Face to face with Nutch
22
Nut
ch ndash
Apa
cheC
on U
S 0
9
Simple set-up and runningYou already have Java 5+ rightGet a 10 release or a nightly build (pretty stable)Simple search setup uses Tomcat for search web appCommand-line bash script binnutch
bull Windows users get Cygwinbull Early version of a web-based UI consolebull Hadoop web-based monitoring
23
Nut
ch ndash
Apa
cheC
on U
S 0
9
Configuration filesEdit configuration in confnutch-sitexml
bull Check nutch-defaultxml for defaults and docsbull You MUST at least fill the name of your agentActive plugins configurationPer-plugin propertiesExternal configuration files
bull regex-urlfilterxmlbull regex-normalizexmlbull parse-pluginsxml mapping of MIME type to plugin
24
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch pluginsPlugin-based extensions for
bull Crawl schedulingbull URL filtering and normalizationbull Protocol for getting the contentbull Content parsingbull Text analysis (tokenization)bull Page signature (to detect near-duplicates)bull Indexing filters (index fields amp metadata)bull Snippet generation and highlightingbull Scoring and rankingbull Query translation and expansion (user rarr Lucene)
25
Nut
ch ndash
Apa
cheC
on U
S 0
9
Main Nutch workflow
Inject initial creation of CrawlDBbull Insert seed URLsbull Initial LinkDB is empty
Generate new shards fetchlistFetch raw contentParse content (discovers outlinks)Update CrawlDB from shardsUpdate LinkDB from shardsIndex shards
(repeat)
Command-line binnutchinject
generatefetchparseupdatedbupdatelinkdbindex solrindex
26
Nut
ch ndash
Apa
cheC
on U
S 0
9
Injecting new URL-s
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
27
Nut
ch ndash
Apa
cheC
on U
S 0
9
Generating fetchlists
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
28
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow generating fetchlistsWhat to fetch next
bull Breadth-first ndash important due to ldquopolitenessrdquo limitsbull Expired (longer than fetchTime + fetchInterval)bull Highly ranking (PageRank)bull Newly addedFetchlist generation
bull ldquotopNrdquo - select best candidatesminus Priority based on many factors pluggable
Adaptive fetch schedulebull Detect rate of changes amp time of changebull Detect unmodified content
minus Hmm and how to recognize this rarr approximate page signatures (near-duplicate detection)
29
Nut
ch ndash
Apa
cheC
on U
S 0
9
Fetching content
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
30
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow fetching
Multi-protocol HTTP(s) FTP file etc Coordinates multiple threads accessing the same hostbull ldquopolitenessrdquo issues vs crawling efficiencybull Host an IP or DNS namebull Redirections should follow immediately What kind
Other netiquette issuesbull robotstxt disallowed paths Crawl-DelayPreparation for parsingbull Content type detection issuesbull Parsing usually executed as a separate step
(resource hungry and sometimes unstable)
31
Nut
ch ndash
Apa
cheC
on U
S 0
9
Content processing
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
32
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow content processingProtocol plugins retrieve content as plain bytes + protocol-level metadata (eg HTTP headers)Parse pluginsbull Content is parsed by MIME-type specific parsersbull Content is parsed into parseData (title outlinks other
metadata) and parseText (plain text content)Nutch supports many popular file formats
33
Nut
ch ndash
Apa
cheC
on U
S 0
9
Link inversion
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
34
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow link inversionPages have outgoing links (outlinks)
hellip I know where Im pointing to
Question who points to mehellip I dont know there is no catalog of pageshellip NOBODY knows for sure either
In-degree indicates importance of the pageAnchor text provides important semantic infoPartial answer invert the outlinks that I know about and group by target
src 1
tgt 2
tgt 3
tgt 4tgt 5
tgt 1
tgt 1
src 2
src 3
src 4src 5
src 1
35
Nut
ch ndash
Apa
cheC
on U
S 0
9
Page importance - scoring
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
36
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow page scoringQuery-independent importance factor
bull Affects search rankingbull Affects crawl prioritizationMay include arbitrary decision of the operatorCurrently two systems (plugins + tools)OPIC scoring
bull Doesnt require explicit steps - ldquoonlinerdquobull Difficult to stabilizePageRank scoring with loop detection
bull Periodically run to update CrawlDb scoresbull Computationally intensive esp loop detection
37
Nut
ch ndash
Apa
cheC
on U
S 0
9
Indexing
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
38
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow indexingIndexing plugins
bull Create full-text search index from segment databull Apply adjustments to score per documentbull May further post-process the parsed content (eg
language identification) to facilitate advanced searchIndexers
bull Lucene indexer ndash builds indexes to be served by Nutch search servers
bull Solr indexer ndash submits documents to a Solr instance
39
Nut
ch ndash
Apa
cheC
on U
S 0
9
De-duplication
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
40
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow de-duplicationThe same page may be present in many shards
bull Obsolete versions in older shardsbull Mirrored pages or equivalent (acom rarr wwwacom)Many other pages are almost identical
bull Template-related differences (banners current date)bull Font layout changes minor re-wording
Hmm hellip what is a significant change bull Tricky issue Hint what is the page content
Near-duplicate detection and removalbull Nutch uses approximate page signatures (fingerprints)bull Duplicates are only marked as deleted
41
Nut
ch ndash
Apa
cheC
on U
S 0
9
Cycles may overlap
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
42
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch distributed search
Multiple search serversSearch front-end dispatches queries and collects search results
bull multiple access methods (API OpenSearch XML JSON)
Page servers build snippetsFault-tolerant
with degraded quality
Search front-end(s)Web app
APIXML
NutchBean
JSON
Search server(Nutch or Solr)
indexed shards
43
Nut
ch ndash
Apa
cheC
on U
S 0
9
Search configurationNutch syntax is limited ndash on purpose
bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins
bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20
Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation
44
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards
Searchfront-endSearch
front-end
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
45
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment distributed searchLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
46
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment map-reduce processing amp distrib search
Fully distributed storage and processingLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
DFS Data NodeDFS Data Node
DFS Data Node
DFS Name NodeDFS Name Node(Job Tracker)
FetcherIndexer
DB mgmt
(Job Tracker)FetcherIndexer
DB mgmt
Task TrackerTask Tracker
Task Tracker
47
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch on a Hadoop cluster
Assumes an up amp running Hadoop clusterhellip which is covered elsewhere
Build nutchjob jar and use it as usual
binhadoop jar nutchjob ltclassNamegt ltargsgt
Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar
Searching is not a map-reduce job ndash often on a separate group of machines
48
Nut
ch ndash
Apa
cheC
on U
S 0
9
49
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch index
50
Nut
ch ndash
Apa
cheC
on U
S 0
9
hellip and press Search
51
Nut
ch ndash
Apa
cheC
on U
S 0
9
Conclusions(This overview is a tip of the iceberg)
NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit
52
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of NutchAvoid code duplicationParsing rarr Tika
bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI
bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper
bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase
bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available
53
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of Nutch (2)Whats left then
bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons
Vision aacute la carte toolkit scalable from1-1000s nodes
54
Nut
ch ndash
Apa
cheC
on U
S 0
9
Summary
QampA
Further informationbull httpluceneapacheorgnutch
8
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch features at a glancePage database and link database (web graph)Plugin-based highly modular
minus Most behavior can be changed via pluginsMulti-protocol multi-threaded distributed crawlerPlugin-based content processing (parsing filtering)Robust crawling frontier controlsScalable data processing framework
minus Map-reduce processingFull-text indexer amp search engine
minus Using Lucene or Solrminus Support for distributed search
Robust API and integration options
9
Nut
ch ndash
Apa
cheC
on U
S 0
9
Hadoop foundationFile system abstraction
bull Local FS orbull Distributed FS
minus also Amazon S3 Kosmos and other FS implMap-reduce processing
bull Currently central to Nutch algorithmsbull Processing tasks are executed as one or more map-
reduce jobsbull Data maintained as Hadoop MapFile-s SequenceFile-sbull Massive updates very efficientbull Small updates costly
minus Hadoop data is immutable once createdminus Updates are really merge amp replace
10
Nut
ch ndash
Apa
cheC
on U
S 0
9
Search engine building blocks
Web graph-page info-links (inout)
Crawler
Parser
Searcher
IndexerUpdater
Scheduler
Contentrepository
Injector
Crawling frontier controls
11
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch building blocks
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
12
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch data
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
Maintains info on all known URL-s Fetch schedule Fetch status Page signature Metadata
13
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch data
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
For each target URL keeps info on incoming links ie list of source URL-s and their associated anchor text
14
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch data
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
Shards (ldquosegmentsrdquo) keep Raw page content Parsed content + discovered metadata + outlinks Plain text for indexing and snippets Optional per-shard indexes
15
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch shards (aka ldquosegmentsrdquo)Unit of work (batch) ndash easier to process massive datasetsConvenience placeholder using predefined directory namesUnit of deployment to the search infrastructureMay include per-shard Lucene indexesOnce completed they are basically unmodifiable
minus No in-place updates of content or replacing of obsolete contentPeriodically phased-out
2009043012345
crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_text
2009043012345
crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_text
200904301234crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_textindexes
Generator
Fetcher
Parser
Indexersnippets
ldquocachedrdquo view
16
Nut
ch ndash
Apa
cheC
on U
S 0
9
URL life-cycle and shardsinjected
discoveredscheduledfor fetching
S1
fetched up-to-date
scheduledfor fetching
fetched up-to-date
A time-lapse view of realityGoals
bull Fresh index rarr sync with the real rate of changes
bull Minimize re-fetching rarr dont fetch unmodified pages
Each page may need its own crawl scheduleShard management
bull Page may be present in many shards but only the most recent record is valid
bull Inconvenient to update in-place just mark as deleted
bull Phase-out old shards and force re-fetch of the remaining pages
scheduledfor fetching
fetched up-to-date
time
S2
S3
real page changesobserved changes
17
Nut
ch ndash
Apa
cheC
on U
S 0
9
Crawling frontierNo authoritative catalog of web pagesSearch engines discover their view of web universe
minus Start from ldquoseed listrdquominus Follow (walk) some (useful interesting) outlinks
Many dangers of simply wandering aroundminus explosion or collapse of the frontierminus collecting unwanted content (spam junk offensive)
I could usesome guidance
18
Nut
ch ndash
Apa
cheC
on U
S 0
9
Controlling the crawling frontierURL filter plugins
minus White-list black-list regexminus May use external resources
(DB-s services )
URL normalizer pluginsminus Resolving relative path
elementsminus ldquoEquivalentrdquo URLs
Additional controls using scoring plugins
minus priority metadata selectblockCrawler traps ndash difficult
minus Domain host path-level stats
seed
i = 1
i = 2i = 3
19
Nut
ch ndash
Apa
cheC
on U
S 0
9
Wide vs focused crawlingDifferences
bull Little technical difference in configurationbull Big difference in operations maintenance and qualityWide crawling
minus (Almost) Unlimited crawling frontierminus High risk of spamming and junk contentminus ldquoPolitenessrdquo a very important limiting factorminus Bandwidth amp DNS considerations
Focused (vertical or enterprise) crawlingminus Limited crawling frontierminus Bandwidth or politeness is often not an issueminus Low risk of spamming and junk content
20
Nut
ch ndash
Apa
cheC
on U
S 0
9
Vertical amp enterprise searchVertical search
bull Range of selected ldquoreferencerdquo sitesbull Robust control of the crawling frontierbull Extensive content post-processingbull Business-driven decisions about rankingEnterprise search
bull Variety of data sources and data formatsbull Well-defined and limited crawling frontierbull Integration with in-house data sourcesbull Little danger of spambull PageRank-like scoring usually works poorly
21
Nut
ch ndash
Apa
cheC
on U
S 0
9
Face to face with Nutch
22
Nut
ch ndash
Apa
cheC
on U
S 0
9
Simple set-up and runningYou already have Java 5+ rightGet a 10 release or a nightly build (pretty stable)Simple search setup uses Tomcat for search web appCommand-line bash script binnutch
bull Windows users get Cygwinbull Early version of a web-based UI consolebull Hadoop web-based monitoring
23
Nut
ch ndash
Apa
cheC
on U
S 0
9
Configuration filesEdit configuration in confnutch-sitexml
bull Check nutch-defaultxml for defaults and docsbull You MUST at least fill the name of your agentActive plugins configurationPer-plugin propertiesExternal configuration files
bull regex-urlfilterxmlbull regex-normalizexmlbull parse-pluginsxml mapping of MIME type to plugin
24
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch pluginsPlugin-based extensions for
bull Crawl schedulingbull URL filtering and normalizationbull Protocol for getting the contentbull Content parsingbull Text analysis (tokenization)bull Page signature (to detect near-duplicates)bull Indexing filters (index fields amp metadata)bull Snippet generation and highlightingbull Scoring and rankingbull Query translation and expansion (user rarr Lucene)
25
Nut
ch ndash
Apa
cheC
on U
S 0
9
Main Nutch workflow
Inject initial creation of CrawlDBbull Insert seed URLsbull Initial LinkDB is empty
Generate new shards fetchlistFetch raw contentParse content (discovers outlinks)Update CrawlDB from shardsUpdate LinkDB from shardsIndex shards
(repeat)
Command-line binnutchinject
generatefetchparseupdatedbupdatelinkdbindex solrindex
26
Nut
ch ndash
Apa
cheC
on U
S 0
9
Injecting new URL-s
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
27
Nut
ch ndash
Apa
cheC
on U
S 0
9
Generating fetchlists
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
28
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow generating fetchlistsWhat to fetch next
bull Breadth-first ndash important due to ldquopolitenessrdquo limitsbull Expired (longer than fetchTime + fetchInterval)bull Highly ranking (PageRank)bull Newly addedFetchlist generation
bull ldquotopNrdquo - select best candidatesminus Priority based on many factors pluggable
Adaptive fetch schedulebull Detect rate of changes amp time of changebull Detect unmodified content
minus Hmm and how to recognize this rarr approximate page signatures (near-duplicate detection)
29
Nut
ch ndash
Apa
cheC
on U
S 0
9
Fetching content
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
30
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow fetching
Multi-protocol HTTP(s) FTP file etc Coordinates multiple threads accessing the same hostbull ldquopolitenessrdquo issues vs crawling efficiencybull Host an IP or DNS namebull Redirections should follow immediately What kind
Other netiquette issuesbull robotstxt disallowed paths Crawl-DelayPreparation for parsingbull Content type detection issuesbull Parsing usually executed as a separate step
(resource hungry and sometimes unstable)
31
Nut
ch ndash
Apa
cheC
on U
S 0
9
Content processing
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
32
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow content processingProtocol plugins retrieve content as plain bytes + protocol-level metadata (eg HTTP headers)Parse pluginsbull Content is parsed by MIME-type specific parsersbull Content is parsed into parseData (title outlinks other
metadata) and parseText (plain text content)Nutch supports many popular file formats
33
Nut
ch ndash
Apa
cheC
on U
S 0
9
Link inversion
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
34
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow link inversionPages have outgoing links (outlinks)
hellip I know where Im pointing to
Question who points to mehellip I dont know there is no catalog of pageshellip NOBODY knows for sure either
In-degree indicates importance of the pageAnchor text provides important semantic infoPartial answer invert the outlinks that I know about and group by target
src 1
tgt 2
tgt 3
tgt 4tgt 5
tgt 1
tgt 1
src 2
src 3
src 4src 5
src 1
35
Nut
ch ndash
Apa
cheC
on U
S 0
9
Page importance - scoring
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
36
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow page scoringQuery-independent importance factor
bull Affects search rankingbull Affects crawl prioritizationMay include arbitrary decision of the operatorCurrently two systems (plugins + tools)OPIC scoring
bull Doesnt require explicit steps - ldquoonlinerdquobull Difficult to stabilizePageRank scoring with loop detection
bull Periodically run to update CrawlDb scoresbull Computationally intensive esp loop detection
37
Nut
ch ndash
Apa
cheC
on U
S 0
9
Indexing
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
38
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow indexingIndexing plugins
bull Create full-text search index from segment databull Apply adjustments to score per documentbull May further post-process the parsed content (eg
language identification) to facilitate advanced searchIndexers
bull Lucene indexer ndash builds indexes to be served by Nutch search servers
bull Solr indexer ndash submits documents to a Solr instance
39
Nut
ch ndash
Apa
cheC
on U
S 0
9
De-duplication
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
40
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow de-duplicationThe same page may be present in many shards
bull Obsolete versions in older shardsbull Mirrored pages or equivalent (acom rarr wwwacom)Many other pages are almost identical
bull Template-related differences (banners current date)bull Font layout changes minor re-wording
Hmm hellip what is a significant change bull Tricky issue Hint what is the page content
Near-duplicate detection and removalbull Nutch uses approximate page signatures (fingerprints)bull Duplicates are only marked as deleted
41
Nut
ch ndash
Apa
cheC
on U
S 0
9
Cycles may overlap
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
42
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch distributed search
Multiple search serversSearch front-end dispatches queries and collects search results
bull multiple access methods (API OpenSearch XML JSON)
Page servers build snippetsFault-tolerant
with degraded quality
Search front-end(s)Web app
APIXML
NutchBean
JSON
Search server(Nutch or Solr)
indexed shards
43
Nut
ch ndash
Apa
cheC
on U
S 0
9
Search configurationNutch syntax is limited ndash on purpose
bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins
bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20
Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation
44
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards
Searchfront-endSearch
front-end
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
45
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment distributed searchLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
46
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment map-reduce processing amp distrib search
Fully distributed storage and processingLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
DFS Data NodeDFS Data Node
DFS Data Node
DFS Name NodeDFS Name Node(Job Tracker)
FetcherIndexer
DB mgmt
(Job Tracker)FetcherIndexer
DB mgmt
Task TrackerTask Tracker
Task Tracker
47
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch on a Hadoop cluster
Assumes an up amp running Hadoop clusterhellip which is covered elsewhere
Build nutchjob jar and use it as usual
binhadoop jar nutchjob ltclassNamegt ltargsgt
Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar
Searching is not a map-reduce job ndash often on a separate group of machines
48
Nut
ch ndash
Apa
cheC
on U
S 0
9
49
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch index
50
Nut
ch ndash
Apa
cheC
on U
S 0
9
hellip and press Search
51
Nut
ch ndash
Apa
cheC
on U
S 0
9
Conclusions(This overview is a tip of the iceberg)
NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit
52
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of NutchAvoid code duplicationParsing rarr Tika
bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI
bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper
bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase
bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available
53
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of Nutch (2)Whats left then
bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons
Vision aacute la carte toolkit scalable from1-1000s nodes
54
Nut
ch ndash
Apa
cheC
on U
S 0
9
Summary
QampA
Further informationbull httpluceneapacheorgnutch
9
Nut
ch ndash
Apa
cheC
on U
S 0
9
Hadoop foundationFile system abstraction
bull Local FS orbull Distributed FS
minus also Amazon S3 Kosmos and other FS implMap-reduce processing
bull Currently central to Nutch algorithmsbull Processing tasks are executed as one or more map-
reduce jobsbull Data maintained as Hadoop MapFile-s SequenceFile-sbull Massive updates very efficientbull Small updates costly
minus Hadoop data is immutable once createdminus Updates are really merge amp replace
10
Nut
ch ndash
Apa
cheC
on U
S 0
9
Search engine building blocks
Web graph-page info-links (inout)
Crawler
Parser
Searcher
IndexerUpdater
Scheduler
Contentrepository
Injector
Crawling frontier controls
11
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch building blocks
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
12
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch data
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
Maintains info on all known URL-s Fetch schedule Fetch status Page signature Metadata
13
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch data
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
For each target URL keeps info on incoming links ie list of source URL-s and their associated anchor text
14
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch data
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
Shards (ldquosegmentsrdquo) keep Raw page content Parsed content + discovered metadata + outlinks Plain text for indexing and snippets Optional per-shard indexes
15
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch shards (aka ldquosegmentsrdquo)Unit of work (batch) ndash easier to process massive datasetsConvenience placeholder using predefined directory namesUnit of deployment to the search infrastructureMay include per-shard Lucene indexesOnce completed they are basically unmodifiable
minus No in-place updates of content or replacing of obsolete contentPeriodically phased-out
2009043012345
crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_text
2009043012345
crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_text
200904301234crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_textindexes
Generator
Fetcher
Parser
Indexersnippets
ldquocachedrdquo view
16
Nut
ch ndash
Apa
cheC
on U
S 0
9
URL life-cycle and shardsinjected
discoveredscheduledfor fetching
S1
fetched up-to-date
scheduledfor fetching
fetched up-to-date
A time-lapse view of realityGoals
bull Fresh index rarr sync with the real rate of changes
bull Minimize re-fetching rarr dont fetch unmodified pages
Each page may need its own crawl scheduleShard management
bull Page may be present in many shards but only the most recent record is valid
bull Inconvenient to update in-place just mark as deleted
bull Phase-out old shards and force re-fetch of the remaining pages
scheduledfor fetching
fetched up-to-date
time
S2
S3
real page changesobserved changes
17
Nut
ch ndash
Apa
cheC
on U
S 0
9
Crawling frontierNo authoritative catalog of web pagesSearch engines discover their view of web universe
minus Start from ldquoseed listrdquominus Follow (walk) some (useful interesting) outlinks
Many dangers of simply wandering aroundminus explosion or collapse of the frontierminus collecting unwanted content (spam junk offensive)
I could usesome guidance
18
Nut
ch ndash
Apa
cheC
on U
S 0
9
Controlling the crawling frontierURL filter plugins
minus White-list black-list regexminus May use external resources
(DB-s services )
URL normalizer pluginsminus Resolving relative path
elementsminus ldquoEquivalentrdquo URLs
Additional controls using scoring plugins
minus priority metadata selectblockCrawler traps ndash difficult
minus Domain host path-level stats
seed
i = 1
i = 2i = 3
19
Nut
ch ndash
Apa
cheC
on U
S 0
9
Wide vs focused crawlingDifferences
bull Little technical difference in configurationbull Big difference in operations maintenance and qualityWide crawling
minus (Almost) Unlimited crawling frontierminus High risk of spamming and junk contentminus ldquoPolitenessrdquo a very important limiting factorminus Bandwidth amp DNS considerations
Focused (vertical or enterprise) crawlingminus Limited crawling frontierminus Bandwidth or politeness is often not an issueminus Low risk of spamming and junk content
20
Nut
ch ndash
Apa
cheC
on U
S 0
9
Vertical amp enterprise searchVertical search
bull Range of selected ldquoreferencerdquo sitesbull Robust control of the crawling frontierbull Extensive content post-processingbull Business-driven decisions about rankingEnterprise search
bull Variety of data sources and data formatsbull Well-defined and limited crawling frontierbull Integration with in-house data sourcesbull Little danger of spambull PageRank-like scoring usually works poorly
21
Nut
ch ndash
Apa
cheC
on U
S 0
9
Face to face with Nutch
22
Nut
ch ndash
Apa
cheC
on U
S 0
9
Simple set-up and runningYou already have Java 5+ rightGet a 10 release or a nightly build (pretty stable)Simple search setup uses Tomcat for search web appCommand-line bash script binnutch
bull Windows users get Cygwinbull Early version of a web-based UI consolebull Hadoop web-based monitoring
23
Nut
ch ndash
Apa
cheC
on U
S 0
9
Configuration filesEdit configuration in confnutch-sitexml
bull Check nutch-defaultxml for defaults and docsbull You MUST at least fill the name of your agentActive plugins configurationPer-plugin propertiesExternal configuration files
bull regex-urlfilterxmlbull regex-normalizexmlbull parse-pluginsxml mapping of MIME type to plugin
24
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch pluginsPlugin-based extensions for
bull Crawl schedulingbull URL filtering and normalizationbull Protocol for getting the contentbull Content parsingbull Text analysis (tokenization)bull Page signature (to detect near-duplicates)bull Indexing filters (index fields amp metadata)bull Snippet generation and highlightingbull Scoring and rankingbull Query translation and expansion (user rarr Lucene)
25
Nut
ch ndash
Apa
cheC
on U
S 0
9
Main Nutch workflow
Inject initial creation of CrawlDBbull Insert seed URLsbull Initial LinkDB is empty
Generate new shards fetchlistFetch raw contentParse content (discovers outlinks)Update CrawlDB from shardsUpdate LinkDB from shardsIndex shards
(repeat)
Command-line binnutchinject
generatefetchparseupdatedbupdatelinkdbindex solrindex
26
Nut
ch ndash
Apa
cheC
on U
S 0
9
Injecting new URL-s
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
27
Nut
ch ndash
Apa
cheC
on U
S 0
9
Generating fetchlists
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
28
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow generating fetchlistsWhat to fetch next
bull Breadth-first ndash important due to ldquopolitenessrdquo limitsbull Expired (longer than fetchTime + fetchInterval)bull Highly ranking (PageRank)bull Newly addedFetchlist generation
bull ldquotopNrdquo - select best candidatesminus Priority based on many factors pluggable
Adaptive fetch schedulebull Detect rate of changes amp time of changebull Detect unmodified content
minus Hmm and how to recognize this rarr approximate page signatures (near-duplicate detection)
29
Nut
ch ndash
Apa
cheC
on U
S 0
9
Fetching content
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
30
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow fetching
Multi-protocol HTTP(s) FTP file etc Coordinates multiple threads accessing the same hostbull ldquopolitenessrdquo issues vs crawling efficiencybull Host an IP or DNS namebull Redirections should follow immediately What kind
Other netiquette issuesbull robotstxt disallowed paths Crawl-DelayPreparation for parsingbull Content type detection issuesbull Parsing usually executed as a separate step
(resource hungry and sometimes unstable)
31
Nut
ch ndash
Apa
cheC
on U
S 0
9
Content processing
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
32
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow content processingProtocol plugins retrieve content as plain bytes + protocol-level metadata (eg HTTP headers)Parse pluginsbull Content is parsed by MIME-type specific parsersbull Content is parsed into parseData (title outlinks other
metadata) and parseText (plain text content)Nutch supports many popular file formats
33
Nut
ch ndash
Apa
cheC
on U
S 0
9
Link inversion
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
34
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow link inversionPages have outgoing links (outlinks)
hellip I know where Im pointing to
Question who points to mehellip I dont know there is no catalog of pageshellip NOBODY knows for sure either
In-degree indicates importance of the pageAnchor text provides important semantic infoPartial answer invert the outlinks that I know about and group by target
src 1
tgt 2
tgt 3
tgt 4tgt 5
tgt 1
tgt 1
src 2
src 3
src 4src 5
src 1
35
Nut
ch ndash
Apa
cheC
on U
S 0
9
Page importance - scoring
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
36
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow page scoringQuery-independent importance factor
bull Affects search rankingbull Affects crawl prioritizationMay include arbitrary decision of the operatorCurrently two systems (plugins + tools)OPIC scoring
bull Doesnt require explicit steps - ldquoonlinerdquobull Difficult to stabilizePageRank scoring with loop detection
bull Periodically run to update CrawlDb scoresbull Computationally intensive esp loop detection
37
Nut
ch ndash
Apa
cheC
on U
S 0
9
Indexing
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
38
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow indexingIndexing plugins
bull Create full-text search index from segment databull Apply adjustments to score per documentbull May further post-process the parsed content (eg
language identification) to facilitate advanced searchIndexers
bull Lucene indexer ndash builds indexes to be served by Nutch search servers
bull Solr indexer ndash submits documents to a Solr instance
39
Nut
ch ndash
Apa
cheC
on U
S 0
9
De-duplication
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
40
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow de-duplicationThe same page may be present in many shards
bull Obsolete versions in older shardsbull Mirrored pages or equivalent (acom rarr wwwacom)Many other pages are almost identical
bull Template-related differences (banners current date)bull Font layout changes minor re-wording
Hmm hellip what is a significant change bull Tricky issue Hint what is the page content
Near-duplicate detection and removalbull Nutch uses approximate page signatures (fingerprints)bull Duplicates are only marked as deleted
41
Nut
ch ndash
Apa
cheC
on U
S 0
9
Cycles may overlap
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
42
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch distributed search
Multiple search serversSearch front-end dispatches queries and collects search results
bull multiple access methods (API OpenSearch XML JSON)
Page servers build snippetsFault-tolerant
with degraded quality
Search front-end(s)Web app
APIXML
NutchBean
JSON
Search server(Nutch or Solr)
indexed shards
43
Nut
ch ndash
Apa
cheC
on U
S 0
9
Search configurationNutch syntax is limited ndash on purpose
bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins
bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20
Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation
44
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards
Searchfront-endSearch
front-end
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
45
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment distributed searchLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
46
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment map-reduce processing amp distrib search
Fully distributed storage and processingLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
DFS Data NodeDFS Data Node
DFS Data Node
DFS Name NodeDFS Name Node(Job Tracker)
FetcherIndexer
DB mgmt
(Job Tracker)FetcherIndexer
DB mgmt
Task TrackerTask Tracker
Task Tracker
47
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch on a Hadoop cluster
Assumes an up amp running Hadoop clusterhellip which is covered elsewhere
Build nutchjob jar and use it as usual
binhadoop jar nutchjob ltclassNamegt ltargsgt
Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar
Searching is not a map-reduce job ndash often on a separate group of machines
48
Nut
ch ndash
Apa
cheC
on U
S 0
9
49
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch index
50
Nut
ch ndash
Apa
cheC
on U
S 0
9
hellip and press Search
51
Nut
ch ndash
Apa
cheC
on U
S 0
9
Conclusions(This overview is a tip of the iceberg)
NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit
52
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of NutchAvoid code duplicationParsing rarr Tika
bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI
bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper
bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase
bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available
53
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of Nutch (2)Whats left then
bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons
Vision aacute la carte toolkit scalable from1-1000s nodes
54
Nut
ch ndash
Apa
cheC
on U
S 0
9
Summary
QampA
Further informationbull httpluceneapacheorgnutch
10
Nut
ch ndash
Apa
cheC
on U
S 0
9
Search engine building blocks
Web graph-page info-links (inout)
Crawler
Parser
Searcher
IndexerUpdater
Scheduler
Contentrepository
Injector
Crawling frontier controls
11
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch building blocks
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
12
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch data
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
Maintains info on all known URL-s Fetch schedule Fetch status Page signature Metadata
13
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch data
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
For each target URL keeps info on incoming links ie list of source URL-s and their associated anchor text
14
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch data
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
Shards (ldquosegmentsrdquo) keep Raw page content Parsed content + discovered metadata + outlinks Plain text for indexing and snippets Optional per-shard indexes
15
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch shards (aka ldquosegmentsrdquo)Unit of work (batch) ndash easier to process massive datasetsConvenience placeholder using predefined directory namesUnit of deployment to the search infrastructureMay include per-shard Lucene indexesOnce completed they are basically unmodifiable
minus No in-place updates of content or replacing of obsolete contentPeriodically phased-out
2009043012345
crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_text
2009043012345
crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_text
200904301234crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_textindexes
Generator
Fetcher
Parser
Indexersnippets
ldquocachedrdquo view
16
Nut
ch ndash
Apa
cheC
on U
S 0
9
URL life-cycle and shardsinjected
discoveredscheduledfor fetching
S1
fetched up-to-date
scheduledfor fetching
fetched up-to-date
A time-lapse view of realityGoals
bull Fresh index rarr sync with the real rate of changes
bull Minimize re-fetching rarr dont fetch unmodified pages
Each page may need its own crawl scheduleShard management
bull Page may be present in many shards but only the most recent record is valid
bull Inconvenient to update in-place just mark as deleted
bull Phase-out old shards and force re-fetch of the remaining pages
scheduledfor fetching
fetched up-to-date
time
S2
S3
real page changesobserved changes
17
Nut
ch ndash
Apa
cheC
on U
S 0
9
Crawling frontierNo authoritative catalog of web pagesSearch engines discover their view of web universe
minus Start from ldquoseed listrdquominus Follow (walk) some (useful interesting) outlinks
Many dangers of simply wandering aroundminus explosion or collapse of the frontierminus collecting unwanted content (spam junk offensive)
I could usesome guidance
18
Nut
ch ndash
Apa
cheC
on U
S 0
9
Controlling the crawling frontierURL filter plugins
minus White-list black-list regexminus May use external resources
(DB-s services )
URL normalizer pluginsminus Resolving relative path
elementsminus ldquoEquivalentrdquo URLs
Additional controls using scoring plugins
minus priority metadata selectblockCrawler traps ndash difficult
minus Domain host path-level stats
seed
i = 1
i = 2i = 3
19
Nut
ch ndash
Apa
cheC
on U
S 0
9
Wide vs focused crawlingDifferences
bull Little technical difference in configurationbull Big difference in operations maintenance and qualityWide crawling
minus (Almost) Unlimited crawling frontierminus High risk of spamming and junk contentminus ldquoPolitenessrdquo a very important limiting factorminus Bandwidth amp DNS considerations
Focused (vertical or enterprise) crawlingminus Limited crawling frontierminus Bandwidth or politeness is often not an issueminus Low risk of spamming and junk content
20
Nut
ch ndash
Apa
cheC
on U
S 0
9
Vertical amp enterprise searchVertical search
bull Range of selected ldquoreferencerdquo sitesbull Robust control of the crawling frontierbull Extensive content post-processingbull Business-driven decisions about rankingEnterprise search
bull Variety of data sources and data formatsbull Well-defined and limited crawling frontierbull Integration with in-house data sourcesbull Little danger of spambull PageRank-like scoring usually works poorly
21
Nut
ch ndash
Apa
cheC
on U
S 0
9
Face to face with Nutch
22
Nut
ch ndash
Apa
cheC
on U
S 0
9
Simple set-up and runningYou already have Java 5+ rightGet a 10 release or a nightly build (pretty stable)Simple search setup uses Tomcat for search web appCommand-line bash script binnutch
bull Windows users get Cygwinbull Early version of a web-based UI consolebull Hadoop web-based monitoring
23
Nut
ch ndash
Apa
cheC
on U
S 0
9
Configuration filesEdit configuration in confnutch-sitexml
bull Check nutch-defaultxml for defaults and docsbull You MUST at least fill the name of your agentActive plugins configurationPer-plugin propertiesExternal configuration files
bull regex-urlfilterxmlbull regex-normalizexmlbull parse-pluginsxml mapping of MIME type to plugin
24
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch pluginsPlugin-based extensions for
bull Crawl schedulingbull URL filtering and normalizationbull Protocol for getting the contentbull Content parsingbull Text analysis (tokenization)bull Page signature (to detect near-duplicates)bull Indexing filters (index fields amp metadata)bull Snippet generation and highlightingbull Scoring and rankingbull Query translation and expansion (user rarr Lucene)
25
Nut
ch ndash
Apa
cheC
on U
S 0
9
Main Nutch workflow
Inject initial creation of CrawlDBbull Insert seed URLsbull Initial LinkDB is empty
Generate new shards fetchlistFetch raw contentParse content (discovers outlinks)Update CrawlDB from shardsUpdate LinkDB from shardsIndex shards
(repeat)
Command-line binnutchinject
generatefetchparseupdatedbupdatelinkdbindex solrindex
26
Nut
ch ndash
Apa
cheC
on U
S 0
9
Injecting new URL-s
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
27
Nut
ch ndash
Apa
cheC
on U
S 0
9
Generating fetchlists
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
28
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow generating fetchlistsWhat to fetch next
bull Breadth-first ndash important due to ldquopolitenessrdquo limitsbull Expired (longer than fetchTime + fetchInterval)bull Highly ranking (PageRank)bull Newly addedFetchlist generation
bull ldquotopNrdquo - select best candidatesminus Priority based on many factors pluggable
Adaptive fetch schedulebull Detect rate of changes amp time of changebull Detect unmodified content
minus Hmm and how to recognize this rarr approximate page signatures (near-duplicate detection)
29
Nut
ch ndash
Apa
cheC
on U
S 0
9
Fetching content
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
30
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow fetching
Multi-protocol HTTP(s) FTP file etc Coordinates multiple threads accessing the same hostbull ldquopolitenessrdquo issues vs crawling efficiencybull Host an IP or DNS namebull Redirections should follow immediately What kind
Other netiquette issuesbull robotstxt disallowed paths Crawl-DelayPreparation for parsingbull Content type detection issuesbull Parsing usually executed as a separate step
(resource hungry and sometimes unstable)
31
Nut
ch ndash
Apa
cheC
on U
S 0
9
Content processing
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
32
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow content processingProtocol plugins retrieve content as plain bytes + protocol-level metadata (eg HTTP headers)Parse pluginsbull Content is parsed by MIME-type specific parsersbull Content is parsed into parseData (title outlinks other
metadata) and parseText (plain text content)Nutch supports many popular file formats
33
Nut
ch ndash
Apa
cheC
on U
S 0
9
Link inversion
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
34
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow link inversionPages have outgoing links (outlinks)
hellip I know where Im pointing to
Question who points to mehellip I dont know there is no catalog of pageshellip NOBODY knows for sure either
In-degree indicates importance of the pageAnchor text provides important semantic infoPartial answer invert the outlinks that I know about and group by target
src 1
tgt 2
tgt 3
tgt 4tgt 5
tgt 1
tgt 1
src 2
src 3
src 4src 5
src 1
35
Nut
ch ndash
Apa
cheC
on U
S 0
9
Page importance - scoring
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
36
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow page scoringQuery-independent importance factor
bull Affects search rankingbull Affects crawl prioritizationMay include arbitrary decision of the operatorCurrently two systems (plugins + tools)OPIC scoring
bull Doesnt require explicit steps - ldquoonlinerdquobull Difficult to stabilizePageRank scoring with loop detection
bull Periodically run to update CrawlDb scoresbull Computationally intensive esp loop detection
37
Nut
ch ndash
Apa
cheC
on U
S 0
9
Indexing
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
38
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow indexingIndexing plugins
bull Create full-text search index from segment databull Apply adjustments to score per documentbull May further post-process the parsed content (eg
language identification) to facilitate advanced searchIndexers
bull Lucene indexer ndash builds indexes to be served by Nutch search servers
bull Solr indexer ndash submits documents to a Solr instance
39
Nut
ch ndash
Apa
cheC
on U
S 0
9
De-duplication
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
40
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow de-duplicationThe same page may be present in many shards
bull Obsolete versions in older shardsbull Mirrored pages or equivalent (acom rarr wwwacom)Many other pages are almost identical
bull Template-related differences (banners current date)bull Font layout changes minor re-wording
Hmm hellip what is a significant change bull Tricky issue Hint what is the page content
Near-duplicate detection and removalbull Nutch uses approximate page signatures (fingerprints)bull Duplicates are only marked as deleted
41
Nut
ch ndash
Apa
cheC
on U
S 0
9
Cycles may overlap
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
42
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch distributed search
Multiple search serversSearch front-end dispatches queries and collects search results
bull multiple access methods (API OpenSearch XML JSON)
Page servers build snippetsFault-tolerant
with degraded quality
Search front-end(s)Web app
APIXML
NutchBean
JSON
Search server(Nutch or Solr)
indexed shards
43
Nut
ch ndash
Apa
cheC
on U
S 0
9
Search configurationNutch syntax is limited ndash on purpose
bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins
bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20
Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation
44
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards
Searchfront-endSearch
front-end
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
45
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment distributed searchLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
46
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment map-reduce processing amp distrib search
Fully distributed storage and processingLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
DFS Data NodeDFS Data Node
DFS Data Node
DFS Name NodeDFS Name Node(Job Tracker)
FetcherIndexer
DB mgmt
(Job Tracker)FetcherIndexer
DB mgmt
Task TrackerTask Tracker
Task Tracker
47
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch on a Hadoop cluster
Assumes an up amp running Hadoop clusterhellip which is covered elsewhere
Build nutchjob jar and use it as usual
binhadoop jar nutchjob ltclassNamegt ltargsgt
Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar
Searching is not a map-reduce job ndash often on a separate group of machines
48
Nut
ch ndash
Apa
cheC
on U
S 0
9
49
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch index
50
Nut
ch ndash
Apa
cheC
on U
S 0
9
hellip and press Search
51
Nut
ch ndash
Apa
cheC
on U
S 0
9
Conclusions(This overview is a tip of the iceberg)
NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit
52
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of NutchAvoid code duplicationParsing rarr Tika
bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI
bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper
bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase
bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available
53
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of Nutch (2)Whats left then
bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons
Vision aacute la carte toolkit scalable from1-1000s nodes
54
Nut
ch ndash
Apa
cheC
on U
S 0
9
Summary
QampA
Further informationbull httpluceneapacheorgnutch
11
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch building blocks
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
12
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch data
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
Maintains info on all known URL-s Fetch schedule Fetch status Page signature Metadata
13
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch data
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
For each target URL keeps info on incoming links ie list of source URL-s and their associated anchor text
14
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch data
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
Shards (ldquosegmentsrdquo) keep Raw page content Parsed content + discovered metadata + outlinks Plain text for indexing and snippets Optional per-shard indexes
15
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch shards (aka ldquosegmentsrdquo)Unit of work (batch) ndash easier to process massive datasetsConvenience placeholder using predefined directory namesUnit of deployment to the search infrastructureMay include per-shard Lucene indexesOnce completed they are basically unmodifiable
minus No in-place updates of content or replacing of obsolete contentPeriodically phased-out
2009043012345
crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_text
2009043012345
crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_text
200904301234crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_textindexes
Generator
Fetcher
Parser
Indexersnippets
ldquocachedrdquo view
16
Nut
ch ndash
Apa
cheC
on U
S 0
9
URL life-cycle and shardsinjected
discoveredscheduledfor fetching
S1
fetched up-to-date
scheduledfor fetching
fetched up-to-date
A time-lapse view of realityGoals
bull Fresh index rarr sync with the real rate of changes
bull Minimize re-fetching rarr dont fetch unmodified pages
Each page may need its own crawl scheduleShard management
bull Page may be present in many shards but only the most recent record is valid
bull Inconvenient to update in-place just mark as deleted
bull Phase-out old shards and force re-fetch of the remaining pages
scheduledfor fetching
fetched up-to-date
time
S2
S3
real page changesobserved changes
17
Nut
ch ndash
Apa
cheC
on U
S 0
9
Crawling frontierNo authoritative catalog of web pagesSearch engines discover their view of web universe
minus Start from ldquoseed listrdquominus Follow (walk) some (useful interesting) outlinks
Many dangers of simply wandering aroundminus explosion or collapse of the frontierminus collecting unwanted content (spam junk offensive)
I could usesome guidance
18
Nut
ch ndash
Apa
cheC
on U
S 0
9
Controlling the crawling frontierURL filter plugins
minus White-list black-list regexminus May use external resources
(DB-s services )
URL normalizer pluginsminus Resolving relative path
elementsminus ldquoEquivalentrdquo URLs
Additional controls using scoring plugins
minus priority metadata selectblockCrawler traps ndash difficult
minus Domain host path-level stats
seed
i = 1
i = 2i = 3
19
Nut
ch ndash
Apa
cheC
on U
S 0
9
Wide vs focused crawlingDifferences
bull Little technical difference in configurationbull Big difference in operations maintenance and qualityWide crawling
minus (Almost) Unlimited crawling frontierminus High risk of spamming and junk contentminus ldquoPolitenessrdquo a very important limiting factorminus Bandwidth amp DNS considerations
Focused (vertical or enterprise) crawlingminus Limited crawling frontierminus Bandwidth or politeness is often not an issueminus Low risk of spamming and junk content
20
Nut
ch ndash
Apa
cheC
on U
S 0
9
Vertical amp enterprise searchVertical search
bull Range of selected ldquoreferencerdquo sitesbull Robust control of the crawling frontierbull Extensive content post-processingbull Business-driven decisions about rankingEnterprise search
bull Variety of data sources and data formatsbull Well-defined and limited crawling frontierbull Integration with in-house data sourcesbull Little danger of spambull PageRank-like scoring usually works poorly
21
Nut
ch ndash
Apa
cheC
on U
S 0
9
Face to face with Nutch
22
Nut
ch ndash
Apa
cheC
on U
S 0
9
Simple set-up and runningYou already have Java 5+ rightGet a 10 release or a nightly build (pretty stable)Simple search setup uses Tomcat for search web appCommand-line bash script binnutch
bull Windows users get Cygwinbull Early version of a web-based UI consolebull Hadoop web-based monitoring
23
Nut
ch ndash
Apa
cheC
on U
S 0
9
Configuration filesEdit configuration in confnutch-sitexml
bull Check nutch-defaultxml for defaults and docsbull You MUST at least fill the name of your agentActive plugins configurationPer-plugin propertiesExternal configuration files
bull regex-urlfilterxmlbull regex-normalizexmlbull parse-pluginsxml mapping of MIME type to plugin
24
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch pluginsPlugin-based extensions for
bull Crawl schedulingbull URL filtering and normalizationbull Protocol for getting the contentbull Content parsingbull Text analysis (tokenization)bull Page signature (to detect near-duplicates)bull Indexing filters (index fields amp metadata)bull Snippet generation and highlightingbull Scoring and rankingbull Query translation and expansion (user rarr Lucene)
25
Nut
ch ndash
Apa
cheC
on U
S 0
9
Main Nutch workflow
Inject initial creation of CrawlDBbull Insert seed URLsbull Initial LinkDB is empty
Generate new shards fetchlistFetch raw contentParse content (discovers outlinks)Update CrawlDB from shardsUpdate LinkDB from shardsIndex shards
(repeat)
Command-line binnutchinject
generatefetchparseupdatedbupdatelinkdbindex solrindex
26
Nut
ch ndash
Apa
cheC
on U
S 0
9
Injecting new URL-s
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
27
Nut
ch ndash
Apa
cheC
on U
S 0
9
Generating fetchlists
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
28
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow generating fetchlistsWhat to fetch next
bull Breadth-first ndash important due to ldquopolitenessrdquo limitsbull Expired (longer than fetchTime + fetchInterval)bull Highly ranking (PageRank)bull Newly addedFetchlist generation
bull ldquotopNrdquo - select best candidatesminus Priority based on many factors pluggable
Adaptive fetch schedulebull Detect rate of changes amp time of changebull Detect unmodified content
minus Hmm and how to recognize this rarr approximate page signatures (near-duplicate detection)
29
Nut
ch ndash
Apa
cheC
on U
S 0
9
Fetching content
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
30
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow fetching
Multi-protocol HTTP(s) FTP file etc Coordinates multiple threads accessing the same hostbull ldquopolitenessrdquo issues vs crawling efficiencybull Host an IP or DNS namebull Redirections should follow immediately What kind
Other netiquette issuesbull robotstxt disallowed paths Crawl-DelayPreparation for parsingbull Content type detection issuesbull Parsing usually executed as a separate step
(resource hungry and sometimes unstable)
31
Nut
ch ndash
Apa
cheC
on U
S 0
9
Content processing
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
32
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow content processingProtocol plugins retrieve content as plain bytes + protocol-level metadata (eg HTTP headers)Parse pluginsbull Content is parsed by MIME-type specific parsersbull Content is parsed into parseData (title outlinks other
metadata) and parseText (plain text content)Nutch supports many popular file formats
33
Nut
ch ndash
Apa
cheC
on U
S 0
9
Link inversion
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
34
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow link inversionPages have outgoing links (outlinks)
hellip I know where Im pointing to
Question who points to mehellip I dont know there is no catalog of pageshellip NOBODY knows for sure either
In-degree indicates importance of the pageAnchor text provides important semantic infoPartial answer invert the outlinks that I know about and group by target
src 1
tgt 2
tgt 3
tgt 4tgt 5
tgt 1
tgt 1
src 2
src 3
src 4src 5
src 1
35
Nut
ch ndash
Apa
cheC
on U
S 0
9
Page importance - scoring
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
36
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow page scoringQuery-independent importance factor
bull Affects search rankingbull Affects crawl prioritizationMay include arbitrary decision of the operatorCurrently two systems (plugins + tools)OPIC scoring
bull Doesnt require explicit steps - ldquoonlinerdquobull Difficult to stabilizePageRank scoring with loop detection
bull Periodically run to update CrawlDb scoresbull Computationally intensive esp loop detection
37
Nut
ch ndash
Apa
cheC
on U
S 0
9
Indexing
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
38
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow indexingIndexing plugins
bull Create full-text search index from segment databull Apply adjustments to score per documentbull May further post-process the parsed content (eg
language identification) to facilitate advanced searchIndexers
bull Lucene indexer ndash builds indexes to be served by Nutch search servers
bull Solr indexer ndash submits documents to a Solr instance
39
Nut
ch ndash
Apa
cheC
on U
S 0
9
De-duplication
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
40
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow de-duplicationThe same page may be present in many shards
bull Obsolete versions in older shardsbull Mirrored pages or equivalent (acom rarr wwwacom)Many other pages are almost identical
bull Template-related differences (banners current date)bull Font layout changes minor re-wording
Hmm hellip what is a significant change bull Tricky issue Hint what is the page content
Near-duplicate detection and removalbull Nutch uses approximate page signatures (fingerprints)bull Duplicates are only marked as deleted
41
Nut
ch ndash
Apa
cheC
on U
S 0
9
Cycles may overlap
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
42
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch distributed search
Multiple search serversSearch front-end dispatches queries and collects search results
bull multiple access methods (API OpenSearch XML JSON)
Page servers build snippetsFault-tolerant
with degraded quality
Search front-end(s)Web app
APIXML
NutchBean
JSON
Search server(Nutch or Solr)
indexed shards
43
Nut
ch ndash
Apa
cheC
on U
S 0
9
Search configurationNutch syntax is limited ndash on purpose
bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins
bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20
Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation
44
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards
Searchfront-endSearch
front-end
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
45
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment distributed searchLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
46
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment map-reduce processing amp distrib search
Fully distributed storage and processingLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
DFS Data NodeDFS Data Node
DFS Data Node
DFS Name NodeDFS Name Node(Job Tracker)
FetcherIndexer
DB mgmt
(Job Tracker)FetcherIndexer
DB mgmt
Task TrackerTask Tracker
Task Tracker
47
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch on a Hadoop cluster
Assumes an up amp running Hadoop clusterhellip which is covered elsewhere
Build nutchjob jar and use it as usual
binhadoop jar nutchjob ltclassNamegt ltargsgt
Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar
Searching is not a map-reduce job ndash often on a separate group of machines
48
Nut
ch ndash
Apa
cheC
on U
S 0
9
49
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch index
50
Nut
ch ndash
Apa
cheC
on U
S 0
9
hellip and press Search
51
Nut
ch ndash
Apa
cheC
on U
S 0
9
Conclusions(This overview is a tip of the iceberg)
NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit
52
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of NutchAvoid code duplicationParsing rarr Tika
bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI
bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper
bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase
bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available
53
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of Nutch (2)Whats left then
bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons
Vision aacute la carte toolkit scalable from1-1000s nodes
54
Nut
ch ndash
Apa
cheC
on U
S 0
9
Summary
QampA
Further informationbull httpluceneapacheorgnutch
12
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch data
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
Maintains info on all known URL-s Fetch schedule Fetch status Page signature Metadata
13
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch data
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
For each target URL keeps info on incoming links ie list of source URL-s and their associated anchor text
14
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch data
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
Shards (ldquosegmentsrdquo) keep Raw page content Parsed content + discovered metadata + outlinks Plain text for indexing and snippets Optional per-shard indexes
15
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch shards (aka ldquosegmentsrdquo)Unit of work (batch) ndash easier to process massive datasetsConvenience placeholder using predefined directory namesUnit of deployment to the search infrastructureMay include per-shard Lucene indexesOnce completed they are basically unmodifiable
minus No in-place updates of content or replacing of obsolete contentPeriodically phased-out
2009043012345
crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_text
2009043012345
crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_text
200904301234crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_textindexes
Generator
Fetcher
Parser
Indexersnippets
ldquocachedrdquo view
16
Nut
ch ndash
Apa
cheC
on U
S 0
9
URL life-cycle and shardsinjected
discoveredscheduledfor fetching
S1
fetched up-to-date
scheduledfor fetching
fetched up-to-date
A time-lapse view of realityGoals
bull Fresh index rarr sync with the real rate of changes
bull Minimize re-fetching rarr dont fetch unmodified pages
Each page may need its own crawl scheduleShard management
bull Page may be present in many shards but only the most recent record is valid
bull Inconvenient to update in-place just mark as deleted
bull Phase-out old shards and force re-fetch of the remaining pages
scheduledfor fetching
fetched up-to-date
time
S2
S3
real page changesobserved changes
17
Nut
ch ndash
Apa
cheC
on U
S 0
9
Crawling frontierNo authoritative catalog of web pagesSearch engines discover their view of web universe
minus Start from ldquoseed listrdquominus Follow (walk) some (useful interesting) outlinks
Many dangers of simply wandering aroundminus explosion or collapse of the frontierminus collecting unwanted content (spam junk offensive)
I could usesome guidance
18
Nut
ch ndash
Apa
cheC
on U
S 0
9
Controlling the crawling frontierURL filter plugins
minus White-list black-list regexminus May use external resources
(DB-s services )
URL normalizer pluginsminus Resolving relative path
elementsminus ldquoEquivalentrdquo URLs
Additional controls using scoring plugins
minus priority metadata selectblockCrawler traps ndash difficult
minus Domain host path-level stats
seed
i = 1
i = 2i = 3
19
Nut
ch ndash
Apa
cheC
on U
S 0
9
Wide vs focused crawlingDifferences
bull Little technical difference in configurationbull Big difference in operations maintenance and qualityWide crawling
minus (Almost) Unlimited crawling frontierminus High risk of spamming and junk contentminus ldquoPolitenessrdquo a very important limiting factorminus Bandwidth amp DNS considerations
Focused (vertical or enterprise) crawlingminus Limited crawling frontierminus Bandwidth or politeness is often not an issueminus Low risk of spamming and junk content
20
Nut
ch ndash
Apa
cheC
on U
S 0
9
Vertical amp enterprise searchVertical search
bull Range of selected ldquoreferencerdquo sitesbull Robust control of the crawling frontierbull Extensive content post-processingbull Business-driven decisions about rankingEnterprise search
bull Variety of data sources and data formatsbull Well-defined and limited crawling frontierbull Integration with in-house data sourcesbull Little danger of spambull PageRank-like scoring usually works poorly
21
Nut
ch ndash
Apa
cheC
on U
S 0
9
Face to face with Nutch
22
Nut
ch ndash
Apa
cheC
on U
S 0
9
Simple set-up and runningYou already have Java 5+ rightGet a 10 release or a nightly build (pretty stable)Simple search setup uses Tomcat for search web appCommand-line bash script binnutch
bull Windows users get Cygwinbull Early version of a web-based UI consolebull Hadoop web-based monitoring
23
Nut
ch ndash
Apa
cheC
on U
S 0
9
Configuration filesEdit configuration in confnutch-sitexml
bull Check nutch-defaultxml for defaults and docsbull You MUST at least fill the name of your agentActive plugins configurationPer-plugin propertiesExternal configuration files
bull regex-urlfilterxmlbull regex-normalizexmlbull parse-pluginsxml mapping of MIME type to plugin
24
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch pluginsPlugin-based extensions for
bull Crawl schedulingbull URL filtering and normalizationbull Protocol for getting the contentbull Content parsingbull Text analysis (tokenization)bull Page signature (to detect near-duplicates)bull Indexing filters (index fields amp metadata)bull Snippet generation and highlightingbull Scoring and rankingbull Query translation and expansion (user rarr Lucene)
25
Nut
ch ndash
Apa
cheC
on U
S 0
9
Main Nutch workflow
Inject initial creation of CrawlDBbull Insert seed URLsbull Initial LinkDB is empty
Generate new shards fetchlistFetch raw contentParse content (discovers outlinks)Update CrawlDB from shardsUpdate LinkDB from shardsIndex shards
(repeat)
Command-line binnutchinject
generatefetchparseupdatedbupdatelinkdbindex solrindex
26
Nut
ch ndash
Apa
cheC
on U
S 0
9
Injecting new URL-s
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
27
Nut
ch ndash
Apa
cheC
on U
S 0
9
Generating fetchlists
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
28
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow generating fetchlistsWhat to fetch next
bull Breadth-first ndash important due to ldquopolitenessrdquo limitsbull Expired (longer than fetchTime + fetchInterval)bull Highly ranking (PageRank)bull Newly addedFetchlist generation
bull ldquotopNrdquo - select best candidatesminus Priority based on many factors pluggable
Adaptive fetch schedulebull Detect rate of changes amp time of changebull Detect unmodified content
minus Hmm and how to recognize this rarr approximate page signatures (near-duplicate detection)
29
Nut
ch ndash
Apa
cheC
on U
S 0
9
Fetching content
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
30
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow fetching
Multi-protocol HTTP(s) FTP file etc Coordinates multiple threads accessing the same hostbull ldquopolitenessrdquo issues vs crawling efficiencybull Host an IP or DNS namebull Redirections should follow immediately What kind
Other netiquette issuesbull robotstxt disallowed paths Crawl-DelayPreparation for parsingbull Content type detection issuesbull Parsing usually executed as a separate step
(resource hungry and sometimes unstable)
31
Nut
ch ndash
Apa
cheC
on U
S 0
9
Content processing
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
32
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow content processingProtocol plugins retrieve content as plain bytes + protocol-level metadata (eg HTTP headers)Parse pluginsbull Content is parsed by MIME-type specific parsersbull Content is parsed into parseData (title outlinks other
metadata) and parseText (plain text content)Nutch supports many popular file formats
33
Nut
ch ndash
Apa
cheC
on U
S 0
9
Link inversion
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
34
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow link inversionPages have outgoing links (outlinks)
hellip I know where Im pointing to
Question who points to mehellip I dont know there is no catalog of pageshellip NOBODY knows for sure either
In-degree indicates importance of the pageAnchor text provides important semantic infoPartial answer invert the outlinks that I know about and group by target
src 1
tgt 2
tgt 3
tgt 4tgt 5
tgt 1
tgt 1
src 2
src 3
src 4src 5
src 1
35
Nut
ch ndash
Apa
cheC
on U
S 0
9
Page importance - scoring
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
36
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow page scoringQuery-independent importance factor
bull Affects search rankingbull Affects crawl prioritizationMay include arbitrary decision of the operatorCurrently two systems (plugins + tools)OPIC scoring
bull Doesnt require explicit steps - ldquoonlinerdquobull Difficult to stabilizePageRank scoring with loop detection
bull Periodically run to update CrawlDb scoresbull Computationally intensive esp loop detection
37
Nut
ch ndash
Apa
cheC
on U
S 0
9
Indexing
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
38
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow indexingIndexing plugins
bull Create full-text search index from segment databull Apply adjustments to score per documentbull May further post-process the parsed content (eg
language identification) to facilitate advanced searchIndexers
bull Lucene indexer ndash builds indexes to be served by Nutch search servers
bull Solr indexer ndash submits documents to a Solr instance
39
Nut
ch ndash
Apa
cheC
on U
S 0
9
De-duplication
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
40
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow de-duplicationThe same page may be present in many shards
bull Obsolete versions in older shardsbull Mirrored pages or equivalent (acom rarr wwwacom)Many other pages are almost identical
bull Template-related differences (banners current date)bull Font layout changes minor re-wording
Hmm hellip what is a significant change bull Tricky issue Hint what is the page content
Near-duplicate detection and removalbull Nutch uses approximate page signatures (fingerprints)bull Duplicates are only marked as deleted
41
Nut
ch ndash
Apa
cheC
on U
S 0
9
Cycles may overlap
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
42
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch distributed search
Multiple search serversSearch front-end dispatches queries and collects search results
bull multiple access methods (API OpenSearch XML JSON)
Page servers build snippetsFault-tolerant
with degraded quality
Search front-end(s)Web app
APIXML
NutchBean
JSON
Search server(Nutch or Solr)
indexed shards
43
Nut
ch ndash
Apa
cheC
on U
S 0
9
Search configurationNutch syntax is limited ndash on purpose
bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins
bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20
Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation
44
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards
Searchfront-endSearch
front-end
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
45
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment distributed searchLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
46
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment map-reduce processing amp distrib search
Fully distributed storage and processingLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
DFS Data NodeDFS Data Node
DFS Data Node
DFS Name NodeDFS Name Node(Job Tracker)
FetcherIndexer
DB mgmt
(Job Tracker)FetcherIndexer
DB mgmt
Task TrackerTask Tracker
Task Tracker
47
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch on a Hadoop cluster
Assumes an up amp running Hadoop clusterhellip which is covered elsewhere
Build nutchjob jar and use it as usual
binhadoop jar nutchjob ltclassNamegt ltargsgt
Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar
Searching is not a map-reduce job ndash often on a separate group of machines
48
Nut
ch ndash
Apa
cheC
on U
S 0
9
49
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch index
50
Nut
ch ndash
Apa
cheC
on U
S 0
9
hellip and press Search
51
Nut
ch ndash
Apa
cheC
on U
S 0
9
Conclusions(This overview is a tip of the iceberg)
NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit
52
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of NutchAvoid code duplicationParsing rarr Tika
bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI
bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper
bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase
bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available
53
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of Nutch (2)Whats left then
bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons
Vision aacute la carte toolkit scalable from1-1000s nodes
54
Nut
ch ndash
Apa
cheC
on U
S 0
9
Summary
QampA
Further informationbull httpluceneapacheorgnutch
13
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch data
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
For each target URL keeps info on incoming links ie list of source URL-s and their associated anchor text
14
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch data
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
Shards (ldquosegmentsrdquo) keep Raw page content Parsed content + discovered metadata + outlinks Plain text for indexing and snippets Optional per-shard indexes
15
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch shards (aka ldquosegmentsrdquo)Unit of work (batch) ndash easier to process massive datasetsConvenience placeholder using predefined directory namesUnit of deployment to the search infrastructureMay include per-shard Lucene indexesOnce completed they are basically unmodifiable
minus No in-place updates of content or replacing of obsolete contentPeriodically phased-out
2009043012345
crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_text
2009043012345
crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_text
200904301234crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_textindexes
Generator
Fetcher
Parser
Indexersnippets
ldquocachedrdquo view
16
Nut
ch ndash
Apa
cheC
on U
S 0
9
URL life-cycle and shardsinjected
discoveredscheduledfor fetching
S1
fetched up-to-date
scheduledfor fetching
fetched up-to-date
A time-lapse view of realityGoals
bull Fresh index rarr sync with the real rate of changes
bull Minimize re-fetching rarr dont fetch unmodified pages
Each page may need its own crawl scheduleShard management
bull Page may be present in many shards but only the most recent record is valid
bull Inconvenient to update in-place just mark as deleted
bull Phase-out old shards and force re-fetch of the remaining pages
scheduledfor fetching
fetched up-to-date
time
S2
S3
real page changesobserved changes
17
Nut
ch ndash
Apa
cheC
on U
S 0
9
Crawling frontierNo authoritative catalog of web pagesSearch engines discover their view of web universe
minus Start from ldquoseed listrdquominus Follow (walk) some (useful interesting) outlinks
Many dangers of simply wandering aroundminus explosion or collapse of the frontierminus collecting unwanted content (spam junk offensive)
I could usesome guidance
18
Nut
ch ndash
Apa
cheC
on U
S 0
9
Controlling the crawling frontierURL filter plugins
minus White-list black-list regexminus May use external resources
(DB-s services )
URL normalizer pluginsminus Resolving relative path
elementsminus ldquoEquivalentrdquo URLs
Additional controls using scoring plugins
minus priority metadata selectblockCrawler traps ndash difficult
minus Domain host path-level stats
seed
i = 1
i = 2i = 3
19
Nut
ch ndash
Apa
cheC
on U
S 0
9
Wide vs focused crawlingDifferences
bull Little technical difference in configurationbull Big difference in operations maintenance and qualityWide crawling
minus (Almost) Unlimited crawling frontierminus High risk of spamming and junk contentminus ldquoPolitenessrdquo a very important limiting factorminus Bandwidth amp DNS considerations
Focused (vertical or enterprise) crawlingminus Limited crawling frontierminus Bandwidth or politeness is often not an issueminus Low risk of spamming and junk content
20
Nut
ch ndash
Apa
cheC
on U
S 0
9
Vertical amp enterprise searchVertical search
bull Range of selected ldquoreferencerdquo sitesbull Robust control of the crawling frontierbull Extensive content post-processingbull Business-driven decisions about rankingEnterprise search
bull Variety of data sources and data formatsbull Well-defined and limited crawling frontierbull Integration with in-house data sourcesbull Little danger of spambull PageRank-like scoring usually works poorly
21
Nut
ch ndash
Apa
cheC
on U
S 0
9
Face to face with Nutch
22
Nut
ch ndash
Apa
cheC
on U
S 0
9
Simple set-up and runningYou already have Java 5+ rightGet a 10 release or a nightly build (pretty stable)Simple search setup uses Tomcat for search web appCommand-line bash script binnutch
bull Windows users get Cygwinbull Early version of a web-based UI consolebull Hadoop web-based monitoring
23
Nut
ch ndash
Apa
cheC
on U
S 0
9
Configuration filesEdit configuration in confnutch-sitexml
bull Check nutch-defaultxml for defaults and docsbull You MUST at least fill the name of your agentActive plugins configurationPer-plugin propertiesExternal configuration files
bull regex-urlfilterxmlbull regex-normalizexmlbull parse-pluginsxml mapping of MIME type to plugin
24
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch pluginsPlugin-based extensions for
bull Crawl schedulingbull URL filtering and normalizationbull Protocol for getting the contentbull Content parsingbull Text analysis (tokenization)bull Page signature (to detect near-duplicates)bull Indexing filters (index fields amp metadata)bull Snippet generation and highlightingbull Scoring and rankingbull Query translation and expansion (user rarr Lucene)
25
Nut
ch ndash
Apa
cheC
on U
S 0
9
Main Nutch workflow
Inject initial creation of CrawlDBbull Insert seed URLsbull Initial LinkDB is empty
Generate new shards fetchlistFetch raw contentParse content (discovers outlinks)Update CrawlDB from shardsUpdate LinkDB from shardsIndex shards
(repeat)
Command-line binnutchinject
generatefetchparseupdatedbupdatelinkdbindex solrindex
26
Nut
ch ndash
Apa
cheC
on U
S 0
9
Injecting new URL-s
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
27
Nut
ch ndash
Apa
cheC
on U
S 0
9
Generating fetchlists
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
28
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow generating fetchlistsWhat to fetch next
bull Breadth-first ndash important due to ldquopolitenessrdquo limitsbull Expired (longer than fetchTime + fetchInterval)bull Highly ranking (PageRank)bull Newly addedFetchlist generation
bull ldquotopNrdquo - select best candidatesminus Priority based on many factors pluggable
Adaptive fetch schedulebull Detect rate of changes amp time of changebull Detect unmodified content
minus Hmm and how to recognize this rarr approximate page signatures (near-duplicate detection)
29
Nut
ch ndash
Apa
cheC
on U
S 0
9
Fetching content
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
30
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow fetching
Multi-protocol HTTP(s) FTP file etc Coordinates multiple threads accessing the same hostbull ldquopolitenessrdquo issues vs crawling efficiencybull Host an IP or DNS namebull Redirections should follow immediately What kind
Other netiquette issuesbull robotstxt disallowed paths Crawl-DelayPreparation for parsingbull Content type detection issuesbull Parsing usually executed as a separate step
(resource hungry and sometimes unstable)
31
Nut
ch ndash
Apa
cheC
on U
S 0
9
Content processing
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
32
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow content processingProtocol plugins retrieve content as plain bytes + protocol-level metadata (eg HTTP headers)Parse pluginsbull Content is parsed by MIME-type specific parsersbull Content is parsed into parseData (title outlinks other
metadata) and parseText (plain text content)Nutch supports many popular file formats
33
Nut
ch ndash
Apa
cheC
on U
S 0
9
Link inversion
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
34
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow link inversionPages have outgoing links (outlinks)
hellip I know where Im pointing to
Question who points to mehellip I dont know there is no catalog of pageshellip NOBODY knows for sure either
In-degree indicates importance of the pageAnchor text provides important semantic infoPartial answer invert the outlinks that I know about and group by target
src 1
tgt 2
tgt 3
tgt 4tgt 5
tgt 1
tgt 1
src 2
src 3
src 4src 5
src 1
35
Nut
ch ndash
Apa
cheC
on U
S 0
9
Page importance - scoring
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
36
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow page scoringQuery-independent importance factor
bull Affects search rankingbull Affects crawl prioritizationMay include arbitrary decision of the operatorCurrently two systems (plugins + tools)OPIC scoring
bull Doesnt require explicit steps - ldquoonlinerdquobull Difficult to stabilizePageRank scoring with loop detection
bull Periodically run to update CrawlDb scoresbull Computationally intensive esp loop detection
37
Nut
ch ndash
Apa
cheC
on U
S 0
9
Indexing
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
38
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow indexingIndexing plugins
bull Create full-text search index from segment databull Apply adjustments to score per documentbull May further post-process the parsed content (eg
language identification) to facilitate advanced searchIndexers
bull Lucene indexer ndash builds indexes to be served by Nutch search servers
bull Solr indexer ndash submits documents to a Solr instance
39
Nut
ch ndash
Apa
cheC
on U
S 0
9
De-duplication
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
40
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow de-duplicationThe same page may be present in many shards
bull Obsolete versions in older shardsbull Mirrored pages or equivalent (acom rarr wwwacom)Many other pages are almost identical
bull Template-related differences (banners current date)bull Font layout changes minor re-wording
Hmm hellip what is a significant change bull Tricky issue Hint what is the page content
Near-duplicate detection and removalbull Nutch uses approximate page signatures (fingerprints)bull Duplicates are only marked as deleted
41
Nut
ch ndash
Apa
cheC
on U
S 0
9
Cycles may overlap
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
42
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch distributed search
Multiple search serversSearch front-end dispatches queries and collects search results
bull multiple access methods (API OpenSearch XML JSON)
Page servers build snippetsFault-tolerant
with degraded quality
Search front-end(s)Web app
APIXML
NutchBean
JSON
Search server(Nutch or Solr)
indexed shards
43
Nut
ch ndash
Apa
cheC
on U
S 0
9
Search configurationNutch syntax is limited ndash on purpose
bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins
bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20
Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation
44
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards
Searchfront-endSearch
front-end
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
45
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment distributed searchLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
46
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment map-reduce processing amp distrib search
Fully distributed storage and processingLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
DFS Data NodeDFS Data Node
DFS Data Node
DFS Name NodeDFS Name Node(Job Tracker)
FetcherIndexer
DB mgmt
(Job Tracker)FetcherIndexer
DB mgmt
Task TrackerTask Tracker
Task Tracker
47
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch on a Hadoop cluster
Assumes an up amp running Hadoop clusterhellip which is covered elsewhere
Build nutchjob jar and use it as usual
binhadoop jar nutchjob ltclassNamegt ltargsgt
Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar
Searching is not a map-reduce job ndash often on a separate group of machines
48
Nut
ch ndash
Apa
cheC
on U
S 0
9
49
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch index
50
Nut
ch ndash
Apa
cheC
on U
S 0
9
hellip and press Search
51
Nut
ch ndash
Apa
cheC
on U
S 0
9
Conclusions(This overview is a tip of the iceberg)
NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit
52
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of NutchAvoid code duplicationParsing rarr Tika
bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI
bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper
bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase
bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available
53
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of Nutch (2)Whats left then
bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons
Vision aacute la carte toolkit scalable from1-1000s nodes
54
Nut
ch ndash
Apa
cheC
on U
S 0
9
Summary
QampA
Further informationbull httpluceneapacheorgnutch
14
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch data
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
Shards (ldquosegmentsrdquo) keep Raw page content Parsed content + discovered metadata + outlinks Plain text for indexing and snippets Optional per-shard indexes
15
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch shards (aka ldquosegmentsrdquo)Unit of work (batch) ndash easier to process massive datasetsConvenience placeholder using predefined directory namesUnit of deployment to the search infrastructureMay include per-shard Lucene indexesOnce completed they are basically unmodifiable
minus No in-place updates of content or replacing of obsolete contentPeriodically phased-out
2009043012345
crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_text
2009043012345
crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_text
200904301234crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_textindexes
Generator
Fetcher
Parser
Indexersnippets
ldquocachedrdquo view
16
Nut
ch ndash
Apa
cheC
on U
S 0
9
URL life-cycle and shardsinjected
discoveredscheduledfor fetching
S1
fetched up-to-date
scheduledfor fetching
fetched up-to-date
A time-lapse view of realityGoals
bull Fresh index rarr sync with the real rate of changes
bull Minimize re-fetching rarr dont fetch unmodified pages
Each page may need its own crawl scheduleShard management
bull Page may be present in many shards but only the most recent record is valid
bull Inconvenient to update in-place just mark as deleted
bull Phase-out old shards and force re-fetch of the remaining pages
scheduledfor fetching
fetched up-to-date
time
S2
S3
real page changesobserved changes
17
Nut
ch ndash
Apa
cheC
on U
S 0
9
Crawling frontierNo authoritative catalog of web pagesSearch engines discover their view of web universe
minus Start from ldquoseed listrdquominus Follow (walk) some (useful interesting) outlinks
Many dangers of simply wandering aroundminus explosion or collapse of the frontierminus collecting unwanted content (spam junk offensive)
I could usesome guidance
18
Nut
ch ndash
Apa
cheC
on U
S 0
9
Controlling the crawling frontierURL filter plugins
minus White-list black-list regexminus May use external resources
(DB-s services )
URL normalizer pluginsminus Resolving relative path
elementsminus ldquoEquivalentrdquo URLs
Additional controls using scoring plugins
minus priority metadata selectblockCrawler traps ndash difficult
minus Domain host path-level stats
seed
i = 1
i = 2i = 3
19
Nut
ch ndash
Apa
cheC
on U
S 0
9
Wide vs focused crawlingDifferences
bull Little technical difference in configurationbull Big difference in operations maintenance and qualityWide crawling
minus (Almost) Unlimited crawling frontierminus High risk of spamming and junk contentminus ldquoPolitenessrdquo a very important limiting factorminus Bandwidth amp DNS considerations
Focused (vertical or enterprise) crawlingminus Limited crawling frontierminus Bandwidth or politeness is often not an issueminus Low risk of spamming and junk content
20
Nut
ch ndash
Apa
cheC
on U
S 0
9
Vertical amp enterprise searchVertical search
bull Range of selected ldquoreferencerdquo sitesbull Robust control of the crawling frontierbull Extensive content post-processingbull Business-driven decisions about rankingEnterprise search
bull Variety of data sources and data formatsbull Well-defined and limited crawling frontierbull Integration with in-house data sourcesbull Little danger of spambull PageRank-like scoring usually works poorly
21
Nut
ch ndash
Apa
cheC
on U
S 0
9
Face to face with Nutch
22
Nut
ch ndash
Apa
cheC
on U
S 0
9
Simple set-up and runningYou already have Java 5+ rightGet a 10 release or a nightly build (pretty stable)Simple search setup uses Tomcat for search web appCommand-line bash script binnutch
bull Windows users get Cygwinbull Early version of a web-based UI consolebull Hadoop web-based monitoring
23
Nut
ch ndash
Apa
cheC
on U
S 0
9
Configuration filesEdit configuration in confnutch-sitexml
bull Check nutch-defaultxml for defaults and docsbull You MUST at least fill the name of your agentActive plugins configurationPer-plugin propertiesExternal configuration files
bull regex-urlfilterxmlbull regex-normalizexmlbull parse-pluginsxml mapping of MIME type to plugin
24
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch pluginsPlugin-based extensions for
bull Crawl schedulingbull URL filtering and normalizationbull Protocol for getting the contentbull Content parsingbull Text analysis (tokenization)bull Page signature (to detect near-duplicates)bull Indexing filters (index fields amp metadata)bull Snippet generation and highlightingbull Scoring and rankingbull Query translation and expansion (user rarr Lucene)
25
Nut
ch ndash
Apa
cheC
on U
S 0
9
Main Nutch workflow
Inject initial creation of CrawlDBbull Insert seed URLsbull Initial LinkDB is empty
Generate new shards fetchlistFetch raw contentParse content (discovers outlinks)Update CrawlDB from shardsUpdate LinkDB from shardsIndex shards
(repeat)
Command-line binnutchinject
generatefetchparseupdatedbupdatelinkdbindex solrindex
26
Nut
ch ndash
Apa
cheC
on U
S 0
9
Injecting new URL-s
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
27
Nut
ch ndash
Apa
cheC
on U
S 0
9
Generating fetchlists
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
28
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow generating fetchlistsWhat to fetch next
bull Breadth-first ndash important due to ldquopolitenessrdquo limitsbull Expired (longer than fetchTime + fetchInterval)bull Highly ranking (PageRank)bull Newly addedFetchlist generation
bull ldquotopNrdquo - select best candidatesminus Priority based on many factors pluggable
Adaptive fetch schedulebull Detect rate of changes amp time of changebull Detect unmodified content
minus Hmm and how to recognize this rarr approximate page signatures (near-duplicate detection)
29
Nut
ch ndash
Apa
cheC
on U
S 0
9
Fetching content
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
30
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow fetching
Multi-protocol HTTP(s) FTP file etc Coordinates multiple threads accessing the same hostbull ldquopolitenessrdquo issues vs crawling efficiencybull Host an IP or DNS namebull Redirections should follow immediately What kind
Other netiquette issuesbull robotstxt disallowed paths Crawl-DelayPreparation for parsingbull Content type detection issuesbull Parsing usually executed as a separate step
(resource hungry and sometimes unstable)
31
Nut
ch ndash
Apa
cheC
on U
S 0
9
Content processing
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
32
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow content processingProtocol plugins retrieve content as plain bytes + protocol-level metadata (eg HTTP headers)Parse pluginsbull Content is parsed by MIME-type specific parsersbull Content is parsed into parseData (title outlinks other
metadata) and parseText (plain text content)Nutch supports many popular file formats
33
Nut
ch ndash
Apa
cheC
on U
S 0
9
Link inversion
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
34
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow link inversionPages have outgoing links (outlinks)
hellip I know where Im pointing to
Question who points to mehellip I dont know there is no catalog of pageshellip NOBODY knows for sure either
In-degree indicates importance of the pageAnchor text provides important semantic infoPartial answer invert the outlinks that I know about and group by target
src 1
tgt 2
tgt 3
tgt 4tgt 5
tgt 1
tgt 1
src 2
src 3
src 4src 5
src 1
35
Nut
ch ndash
Apa
cheC
on U
S 0
9
Page importance - scoring
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
36
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow page scoringQuery-independent importance factor
bull Affects search rankingbull Affects crawl prioritizationMay include arbitrary decision of the operatorCurrently two systems (plugins + tools)OPIC scoring
bull Doesnt require explicit steps - ldquoonlinerdquobull Difficult to stabilizePageRank scoring with loop detection
bull Periodically run to update CrawlDb scoresbull Computationally intensive esp loop detection
37
Nut
ch ndash
Apa
cheC
on U
S 0
9
Indexing
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
38
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow indexingIndexing plugins
bull Create full-text search index from segment databull Apply adjustments to score per documentbull May further post-process the parsed content (eg
language identification) to facilitate advanced searchIndexers
bull Lucene indexer ndash builds indexes to be served by Nutch search servers
bull Solr indexer ndash submits documents to a Solr instance
39
Nut
ch ndash
Apa
cheC
on U
S 0
9
De-duplication
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
40
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow de-duplicationThe same page may be present in many shards
bull Obsolete versions in older shardsbull Mirrored pages or equivalent (acom rarr wwwacom)Many other pages are almost identical
bull Template-related differences (banners current date)bull Font layout changes minor re-wording
Hmm hellip what is a significant change bull Tricky issue Hint what is the page content
Near-duplicate detection and removalbull Nutch uses approximate page signatures (fingerprints)bull Duplicates are only marked as deleted
41
Nut
ch ndash
Apa
cheC
on U
S 0
9
Cycles may overlap
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
42
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch distributed search
Multiple search serversSearch front-end dispatches queries and collects search results
bull multiple access methods (API OpenSearch XML JSON)
Page servers build snippetsFault-tolerant
with degraded quality
Search front-end(s)Web app
APIXML
NutchBean
JSON
Search server(Nutch or Solr)
indexed shards
43
Nut
ch ndash
Apa
cheC
on U
S 0
9
Search configurationNutch syntax is limited ndash on purpose
bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins
bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20
Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation
44
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards
Searchfront-endSearch
front-end
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
45
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment distributed searchLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
46
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment map-reduce processing amp distrib search
Fully distributed storage and processingLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
DFS Data NodeDFS Data Node
DFS Data Node
DFS Name NodeDFS Name Node(Job Tracker)
FetcherIndexer
DB mgmt
(Job Tracker)FetcherIndexer
DB mgmt
Task TrackerTask Tracker
Task Tracker
47
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch on a Hadoop cluster
Assumes an up amp running Hadoop clusterhellip which is covered elsewhere
Build nutchjob jar and use it as usual
binhadoop jar nutchjob ltclassNamegt ltargsgt
Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar
Searching is not a map-reduce job ndash often on a separate group of machines
48
Nut
ch ndash
Apa
cheC
on U
S 0
9
49
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch index
50
Nut
ch ndash
Apa
cheC
on U
S 0
9
hellip and press Search
51
Nut
ch ndash
Apa
cheC
on U
S 0
9
Conclusions(This overview is a tip of the iceberg)
NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit
52
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of NutchAvoid code duplicationParsing rarr Tika
bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI
bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper
bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase
bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available
53
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of Nutch (2)Whats left then
bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons
Vision aacute la carte toolkit scalable from1-1000s nodes
54
Nut
ch ndash
Apa
cheC
on U
S 0
9
Summary
QampA
Further informationbull httpluceneapacheorgnutch
15
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch shards (aka ldquosegmentsrdquo)Unit of work (batch) ndash easier to process massive datasetsConvenience placeholder using predefined directory namesUnit of deployment to the search infrastructureMay include per-shard Lucene indexesOnce completed they are basically unmodifiable
minus No in-place updates of content or replacing of obsolete contentPeriodically phased-out
2009043012345
crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_text
2009043012345
crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_text
200904301234crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_textindexes
Generator
Fetcher
Parser
Indexersnippets
ldquocachedrdquo view
16
Nut
ch ndash
Apa
cheC
on U
S 0
9
URL life-cycle and shardsinjected
discoveredscheduledfor fetching
S1
fetched up-to-date
scheduledfor fetching
fetched up-to-date
A time-lapse view of realityGoals
bull Fresh index rarr sync with the real rate of changes
bull Minimize re-fetching rarr dont fetch unmodified pages
Each page may need its own crawl scheduleShard management
bull Page may be present in many shards but only the most recent record is valid
bull Inconvenient to update in-place just mark as deleted
bull Phase-out old shards and force re-fetch of the remaining pages
scheduledfor fetching
fetched up-to-date
time
S2
S3
real page changesobserved changes
17
Nut
ch ndash
Apa
cheC
on U
S 0
9
Crawling frontierNo authoritative catalog of web pagesSearch engines discover their view of web universe
minus Start from ldquoseed listrdquominus Follow (walk) some (useful interesting) outlinks
Many dangers of simply wandering aroundminus explosion or collapse of the frontierminus collecting unwanted content (spam junk offensive)
I could usesome guidance
18
Nut
ch ndash
Apa
cheC
on U
S 0
9
Controlling the crawling frontierURL filter plugins
minus White-list black-list regexminus May use external resources
(DB-s services )
URL normalizer pluginsminus Resolving relative path
elementsminus ldquoEquivalentrdquo URLs
Additional controls using scoring plugins
minus priority metadata selectblockCrawler traps ndash difficult
minus Domain host path-level stats
seed
i = 1
i = 2i = 3
19
Nut
ch ndash
Apa
cheC
on U
S 0
9
Wide vs focused crawlingDifferences
bull Little technical difference in configurationbull Big difference in operations maintenance and qualityWide crawling
minus (Almost) Unlimited crawling frontierminus High risk of spamming and junk contentminus ldquoPolitenessrdquo a very important limiting factorminus Bandwidth amp DNS considerations
Focused (vertical or enterprise) crawlingminus Limited crawling frontierminus Bandwidth or politeness is often not an issueminus Low risk of spamming and junk content
20
Nut
ch ndash
Apa
cheC
on U
S 0
9
Vertical amp enterprise searchVertical search
bull Range of selected ldquoreferencerdquo sitesbull Robust control of the crawling frontierbull Extensive content post-processingbull Business-driven decisions about rankingEnterprise search
bull Variety of data sources and data formatsbull Well-defined and limited crawling frontierbull Integration with in-house data sourcesbull Little danger of spambull PageRank-like scoring usually works poorly
21
Nut
ch ndash
Apa
cheC
on U
S 0
9
Face to face with Nutch
22
Nut
ch ndash
Apa
cheC
on U
S 0
9
Simple set-up and runningYou already have Java 5+ rightGet a 10 release or a nightly build (pretty stable)Simple search setup uses Tomcat for search web appCommand-line bash script binnutch
bull Windows users get Cygwinbull Early version of a web-based UI consolebull Hadoop web-based monitoring
23
Nut
ch ndash
Apa
cheC
on U
S 0
9
Configuration filesEdit configuration in confnutch-sitexml
bull Check nutch-defaultxml for defaults and docsbull You MUST at least fill the name of your agentActive plugins configurationPer-plugin propertiesExternal configuration files
bull regex-urlfilterxmlbull regex-normalizexmlbull parse-pluginsxml mapping of MIME type to plugin
24
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch pluginsPlugin-based extensions for
bull Crawl schedulingbull URL filtering and normalizationbull Protocol for getting the contentbull Content parsingbull Text analysis (tokenization)bull Page signature (to detect near-duplicates)bull Indexing filters (index fields amp metadata)bull Snippet generation and highlightingbull Scoring and rankingbull Query translation and expansion (user rarr Lucene)
25
Nut
ch ndash
Apa
cheC
on U
S 0
9
Main Nutch workflow
Inject initial creation of CrawlDBbull Insert seed URLsbull Initial LinkDB is empty
Generate new shards fetchlistFetch raw contentParse content (discovers outlinks)Update CrawlDB from shardsUpdate LinkDB from shardsIndex shards
(repeat)
Command-line binnutchinject
generatefetchparseupdatedbupdatelinkdbindex solrindex
26
Nut
ch ndash
Apa
cheC
on U
S 0
9
Injecting new URL-s
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
27
Nut
ch ndash
Apa
cheC
on U
S 0
9
Generating fetchlists
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
28
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow generating fetchlistsWhat to fetch next
bull Breadth-first ndash important due to ldquopolitenessrdquo limitsbull Expired (longer than fetchTime + fetchInterval)bull Highly ranking (PageRank)bull Newly addedFetchlist generation
bull ldquotopNrdquo - select best candidatesminus Priority based on many factors pluggable
Adaptive fetch schedulebull Detect rate of changes amp time of changebull Detect unmodified content
minus Hmm and how to recognize this rarr approximate page signatures (near-duplicate detection)
29
Nut
ch ndash
Apa
cheC
on U
S 0
9
Fetching content
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
30
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow fetching
Multi-protocol HTTP(s) FTP file etc Coordinates multiple threads accessing the same hostbull ldquopolitenessrdquo issues vs crawling efficiencybull Host an IP or DNS namebull Redirections should follow immediately What kind
Other netiquette issuesbull robotstxt disallowed paths Crawl-DelayPreparation for parsingbull Content type detection issuesbull Parsing usually executed as a separate step
(resource hungry and sometimes unstable)
31
Nut
ch ndash
Apa
cheC
on U
S 0
9
Content processing
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
32
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow content processingProtocol plugins retrieve content as plain bytes + protocol-level metadata (eg HTTP headers)Parse pluginsbull Content is parsed by MIME-type specific parsersbull Content is parsed into parseData (title outlinks other
metadata) and parseText (plain text content)Nutch supports many popular file formats
33
Nut
ch ndash
Apa
cheC
on U
S 0
9
Link inversion
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
34
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow link inversionPages have outgoing links (outlinks)
hellip I know where Im pointing to
Question who points to mehellip I dont know there is no catalog of pageshellip NOBODY knows for sure either
In-degree indicates importance of the pageAnchor text provides important semantic infoPartial answer invert the outlinks that I know about and group by target
src 1
tgt 2
tgt 3
tgt 4tgt 5
tgt 1
tgt 1
src 2
src 3
src 4src 5
src 1
35
Nut
ch ndash
Apa
cheC
on U
S 0
9
Page importance - scoring
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
36
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow page scoringQuery-independent importance factor
bull Affects search rankingbull Affects crawl prioritizationMay include arbitrary decision of the operatorCurrently two systems (plugins + tools)OPIC scoring
bull Doesnt require explicit steps - ldquoonlinerdquobull Difficult to stabilizePageRank scoring with loop detection
bull Periodically run to update CrawlDb scoresbull Computationally intensive esp loop detection
37
Nut
ch ndash
Apa
cheC
on U
S 0
9
Indexing
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
38
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow indexingIndexing plugins
bull Create full-text search index from segment databull Apply adjustments to score per documentbull May further post-process the parsed content (eg
language identification) to facilitate advanced searchIndexers
bull Lucene indexer ndash builds indexes to be served by Nutch search servers
bull Solr indexer ndash submits documents to a Solr instance
39
Nut
ch ndash
Apa
cheC
on U
S 0
9
De-duplication
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
40
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow de-duplicationThe same page may be present in many shards
bull Obsolete versions in older shardsbull Mirrored pages or equivalent (acom rarr wwwacom)Many other pages are almost identical
bull Template-related differences (banners current date)bull Font layout changes minor re-wording
Hmm hellip what is a significant change bull Tricky issue Hint what is the page content
Near-duplicate detection and removalbull Nutch uses approximate page signatures (fingerprints)bull Duplicates are only marked as deleted
41
Nut
ch ndash
Apa
cheC
on U
S 0
9
Cycles may overlap
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
42
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch distributed search
Multiple search serversSearch front-end dispatches queries and collects search results
bull multiple access methods (API OpenSearch XML JSON)
Page servers build snippetsFault-tolerant
with degraded quality
Search front-end(s)Web app
APIXML
NutchBean
JSON
Search server(Nutch or Solr)
indexed shards
43
Nut
ch ndash
Apa
cheC
on U
S 0
9
Search configurationNutch syntax is limited ndash on purpose
bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins
bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20
Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation
44
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards
Searchfront-endSearch
front-end
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
45
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment distributed searchLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
46
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment map-reduce processing amp distrib search
Fully distributed storage and processingLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
DFS Data NodeDFS Data Node
DFS Data Node
DFS Name NodeDFS Name Node(Job Tracker)
FetcherIndexer
DB mgmt
(Job Tracker)FetcherIndexer
DB mgmt
Task TrackerTask Tracker
Task Tracker
47
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch on a Hadoop cluster
Assumes an up amp running Hadoop clusterhellip which is covered elsewhere
Build nutchjob jar and use it as usual
binhadoop jar nutchjob ltclassNamegt ltargsgt
Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar
Searching is not a map-reduce job ndash often on a separate group of machines
48
Nut
ch ndash
Apa
cheC
on U
S 0
9
49
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch index
50
Nut
ch ndash
Apa
cheC
on U
S 0
9
hellip and press Search
51
Nut
ch ndash
Apa
cheC
on U
S 0
9
Conclusions(This overview is a tip of the iceberg)
NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit
52
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of NutchAvoid code duplicationParsing rarr Tika
bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI
bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper
bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase
bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available
53
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of Nutch (2)Whats left then
bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons
Vision aacute la carte toolkit scalable from1-1000s nodes
54
Nut
ch ndash
Apa
cheC
on U
S 0
9
Summary
QampA
Further informationbull httpluceneapacheorgnutch
16
Nut
ch ndash
Apa
cheC
on U
S 0
9
URL life-cycle and shardsinjected
discoveredscheduledfor fetching
S1
fetched up-to-date
scheduledfor fetching
fetched up-to-date
A time-lapse view of realityGoals
bull Fresh index rarr sync with the real rate of changes
bull Minimize re-fetching rarr dont fetch unmodified pages
Each page may need its own crawl scheduleShard management
bull Page may be present in many shards but only the most recent record is valid
bull Inconvenient to update in-place just mark as deleted
bull Phase-out old shards and force re-fetch of the remaining pages
scheduledfor fetching
fetched up-to-date
time
S2
S3
real page changesobserved changes
17
Nut
ch ndash
Apa
cheC
on U
S 0
9
Crawling frontierNo authoritative catalog of web pagesSearch engines discover their view of web universe
minus Start from ldquoseed listrdquominus Follow (walk) some (useful interesting) outlinks
Many dangers of simply wandering aroundminus explosion or collapse of the frontierminus collecting unwanted content (spam junk offensive)
I could usesome guidance
18
Nut
ch ndash
Apa
cheC
on U
S 0
9
Controlling the crawling frontierURL filter plugins
minus White-list black-list regexminus May use external resources
(DB-s services )
URL normalizer pluginsminus Resolving relative path
elementsminus ldquoEquivalentrdquo URLs
Additional controls using scoring plugins
minus priority metadata selectblockCrawler traps ndash difficult
minus Domain host path-level stats
seed
i = 1
i = 2i = 3
19
Nut
ch ndash
Apa
cheC
on U
S 0
9
Wide vs focused crawlingDifferences
bull Little technical difference in configurationbull Big difference in operations maintenance and qualityWide crawling
minus (Almost) Unlimited crawling frontierminus High risk of spamming and junk contentminus ldquoPolitenessrdquo a very important limiting factorminus Bandwidth amp DNS considerations
Focused (vertical or enterprise) crawlingminus Limited crawling frontierminus Bandwidth or politeness is often not an issueminus Low risk of spamming and junk content
20
Nut
ch ndash
Apa
cheC
on U
S 0
9
Vertical amp enterprise searchVertical search
bull Range of selected ldquoreferencerdquo sitesbull Robust control of the crawling frontierbull Extensive content post-processingbull Business-driven decisions about rankingEnterprise search
bull Variety of data sources and data formatsbull Well-defined and limited crawling frontierbull Integration with in-house data sourcesbull Little danger of spambull PageRank-like scoring usually works poorly
21
Nut
ch ndash
Apa
cheC
on U
S 0
9
Face to face with Nutch
22
Nut
ch ndash
Apa
cheC
on U
S 0
9
Simple set-up and runningYou already have Java 5+ rightGet a 10 release or a nightly build (pretty stable)Simple search setup uses Tomcat for search web appCommand-line bash script binnutch
bull Windows users get Cygwinbull Early version of a web-based UI consolebull Hadoop web-based monitoring
23
Nut
ch ndash
Apa
cheC
on U
S 0
9
Configuration filesEdit configuration in confnutch-sitexml
bull Check nutch-defaultxml for defaults and docsbull You MUST at least fill the name of your agentActive plugins configurationPer-plugin propertiesExternal configuration files
bull regex-urlfilterxmlbull regex-normalizexmlbull parse-pluginsxml mapping of MIME type to plugin
24
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch pluginsPlugin-based extensions for
bull Crawl schedulingbull URL filtering and normalizationbull Protocol for getting the contentbull Content parsingbull Text analysis (tokenization)bull Page signature (to detect near-duplicates)bull Indexing filters (index fields amp metadata)bull Snippet generation and highlightingbull Scoring and rankingbull Query translation and expansion (user rarr Lucene)
25
Nut
ch ndash
Apa
cheC
on U
S 0
9
Main Nutch workflow
Inject initial creation of CrawlDBbull Insert seed URLsbull Initial LinkDB is empty
Generate new shards fetchlistFetch raw contentParse content (discovers outlinks)Update CrawlDB from shardsUpdate LinkDB from shardsIndex shards
(repeat)
Command-line binnutchinject
generatefetchparseupdatedbupdatelinkdbindex solrindex
26
Nut
ch ndash
Apa
cheC
on U
S 0
9
Injecting new URL-s
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
27
Nut
ch ndash
Apa
cheC
on U
S 0
9
Generating fetchlists
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
28
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow generating fetchlistsWhat to fetch next
bull Breadth-first ndash important due to ldquopolitenessrdquo limitsbull Expired (longer than fetchTime + fetchInterval)bull Highly ranking (PageRank)bull Newly addedFetchlist generation
bull ldquotopNrdquo - select best candidatesminus Priority based on many factors pluggable
Adaptive fetch schedulebull Detect rate of changes amp time of changebull Detect unmodified content
minus Hmm and how to recognize this rarr approximate page signatures (near-duplicate detection)
29
Nut
ch ndash
Apa
cheC
on U
S 0
9
Fetching content
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
30
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow fetching
Multi-protocol HTTP(s) FTP file etc Coordinates multiple threads accessing the same hostbull ldquopolitenessrdquo issues vs crawling efficiencybull Host an IP or DNS namebull Redirections should follow immediately What kind
Other netiquette issuesbull robotstxt disallowed paths Crawl-DelayPreparation for parsingbull Content type detection issuesbull Parsing usually executed as a separate step
(resource hungry and sometimes unstable)
31
Nut
ch ndash
Apa
cheC
on U
S 0
9
Content processing
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
32
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow content processingProtocol plugins retrieve content as plain bytes + protocol-level metadata (eg HTTP headers)Parse pluginsbull Content is parsed by MIME-type specific parsersbull Content is parsed into parseData (title outlinks other
metadata) and parseText (plain text content)Nutch supports many popular file formats
33
Nut
ch ndash
Apa
cheC
on U
S 0
9
Link inversion
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
34
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow link inversionPages have outgoing links (outlinks)
hellip I know where Im pointing to
Question who points to mehellip I dont know there is no catalog of pageshellip NOBODY knows for sure either
In-degree indicates importance of the pageAnchor text provides important semantic infoPartial answer invert the outlinks that I know about and group by target
src 1
tgt 2
tgt 3
tgt 4tgt 5
tgt 1
tgt 1
src 2
src 3
src 4src 5
src 1
35
Nut
ch ndash
Apa
cheC
on U
S 0
9
Page importance - scoring
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
36
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow page scoringQuery-independent importance factor
bull Affects search rankingbull Affects crawl prioritizationMay include arbitrary decision of the operatorCurrently two systems (plugins + tools)OPIC scoring
bull Doesnt require explicit steps - ldquoonlinerdquobull Difficult to stabilizePageRank scoring with loop detection
bull Periodically run to update CrawlDb scoresbull Computationally intensive esp loop detection
37
Nut
ch ndash
Apa
cheC
on U
S 0
9
Indexing
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
38
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow indexingIndexing plugins
bull Create full-text search index from segment databull Apply adjustments to score per documentbull May further post-process the parsed content (eg
language identification) to facilitate advanced searchIndexers
bull Lucene indexer ndash builds indexes to be served by Nutch search servers
bull Solr indexer ndash submits documents to a Solr instance
39
Nut
ch ndash
Apa
cheC
on U
S 0
9
De-duplication
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
40
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow de-duplicationThe same page may be present in many shards
bull Obsolete versions in older shardsbull Mirrored pages or equivalent (acom rarr wwwacom)Many other pages are almost identical
bull Template-related differences (banners current date)bull Font layout changes minor re-wording
Hmm hellip what is a significant change bull Tricky issue Hint what is the page content
Near-duplicate detection and removalbull Nutch uses approximate page signatures (fingerprints)bull Duplicates are only marked as deleted
41
Nut
ch ndash
Apa
cheC
on U
S 0
9
Cycles may overlap
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
42
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch distributed search
Multiple search serversSearch front-end dispatches queries and collects search results
bull multiple access methods (API OpenSearch XML JSON)
Page servers build snippetsFault-tolerant
with degraded quality
Search front-end(s)Web app
APIXML
NutchBean
JSON
Search server(Nutch or Solr)
indexed shards
43
Nut
ch ndash
Apa
cheC
on U
S 0
9
Search configurationNutch syntax is limited ndash on purpose
bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins
bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20
Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation
44
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards
Searchfront-endSearch
front-end
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
45
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment distributed searchLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
46
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment map-reduce processing amp distrib search
Fully distributed storage and processingLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
DFS Data NodeDFS Data Node
DFS Data Node
DFS Name NodeDFS Name Node(Job Tracker)
FetcherIndexer
DB mgmt
(Job Tracker)FetcherIndexer
DB mgmt
Task TrackerTask Tracker
Task Tracker
47
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch on a Hadoop cluster
Assumes an up amp running Hadoop clusterhellip which is covered elsewhere
Build nutchjob jar and use it as usual
binhadoop jar nutchjob ltclassNamegt ltargsgt
Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar
Searching is not a map-reduce job ndash often on a separate group of machines
48
Nut
ch ndash
Apa
cheC
on U
S 0
9
49
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch index
50
Nut
ch ndash
Apa
cheC
on U
S 0
9
hellip and press Search
51
Nut
ch ndash
Apa
cheC
on U
S 0
9
Conclusions(This overview is a tip of the iceberg)
NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit
52
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of NutchAvoid code duplicationParsing rarr Tika
bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI
bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper
bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase
bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available
53
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of Nutch (2)Whats left then
bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons
Vision aacute la carte toolkit scalable from1-1000s nodes
54
Nut
ch ndash
Apa
cheC
on U
S 0
9
Summary
QampA
Further informationbull httpluceneapacheorgnutch
17
Nut
ch ndash
Apa
cheC
on U
S 0
9
Crawling frontierNo authoritative catalog of web pagesSearch engines discover their view of web universe
minus Start from ldquoseed listrdquominus Follow (walk) some (useful interesting) outlinks
Many dangers of simply wandering aroundminus explosion or collapse of the frontierminus collecting unwanted content (spam junk offensive)
I could usesome guidance
18
Nut
ch ndash
Apa
cheC
on U
S 0
9
Controlling the crawling frontierURL filter plugins
minus White-list black-list regexminus May use external resources
(DB-s services )
URL normalizer pluginsminus Resolving relative path
elementsminus ldquoEquivalentrdquo URLs
Additional controls using scoring plugins
minus priority metadata selectblockCrawler traps ndash difficult
minus Domain host path-level stats
seed
i = 1
i = 2i = 3
19
Nut
ch ndash
Apa
cheC
on U
S 0
9
Wide vs focused crawlingDifferences
bull Little technical difference in configurationbull Big difference in operations maintenance and qualityWide crawling
minus (Almost) Unlimited crawling frontierminus High risk of spamming and junk contentminus ldquoPolitenessrdquo a very important limiting factorminus Bandwidth amp DNS considerations
Focused (vertical or enterprise) crawlingminus Limited crawling frontierminus Bandwidth or politeness is often not an issueminus Low risk of spamming and junk content
20
Nut
ch ndash
Apa
cheC
on U
S 0
9
Vertical amp enterprise searchVertical search
bull Range of selected ldquoreferencerdquo sitesbull Robust control of the crawling frontierbull Extensive content post-processingbull Business-driven decisions about rankingEnterprise search
bull Variety of data sources and data formatsbull Well-defined and limited crawling frontierbull Integration with in-house data sourcesbull Little danger of spambull PageRank-like scoring usually works poorly
21
Nut
ch ndash
Apa
cheC
on U
S 0
9
Face to face with Nutch
22
Nut
ch ndash
Apa
cheC
on U
S 0
9
Simple set-up and runningYou already have Java 5+ rightGet a 10 release or a nightly build (pretty stable)Simple search setup uses Tomcat for search web appCommand-line bash script binnutch
bull Windows users get Cygwinbull Early version of a web-based UI consolebull Hadoop web-based monitoring
23
Nut
ch ndash
Apa
cheC
on U
S 0
9
Configuration filesEdit configuration in confnutch-sitexml
bull Check nutch-defaultxml for defaults and docsbull You MUST at least fill the name of your agentActive plugins configurationPer-plugin propertiesExternal configuration files
bull regex-urlfilterxmlbull regex-normalizexmlbull parse-pluginsxml mapping of MIME type to plugin
24
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch pluginsPlugin-based extensions for
bull Crawl schedulingbull URL filtering and normalizationbull Protocol for getting the contentbull Content parsingbull Text analysis (tokenization)bull Page signature (to detect near-duplicates)bull Indexing filters (index fields amp metadata)bull Snippet generation and highlightingbull Scoring and rankingbull Query translation and expansion (user rarr Lucene)
25
Nut
ch ndash
Apa
cheC
on U
S 0
9
Main Nutch workflow
Inject initial creation of CrawlDBbull Insert seed URLsbull Initial LinkDB is empty
Generate new shards fetchlistFetch raw contentParse content (discovers outlinks)Update CrawlDB from shardsUpdate LinkDB from shardsIndex shards
(repeat)
Command-line binnutchinject
generatefetchparseupdatedbupdatelinkdbindex solrindex
26
Nut
ch ndash
Apa
cheC
on U
S 0
9
Injecting new URL-s
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
27
Nut
ch ndash
Apa
cheC
on U
S 0
9
Generating fetchlists
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
28
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow generating fetchlistsWhat to fetch next
bull Breadth-first ndash important due to ldquopolitenessrdquo limitsbull Expired (longer than fetchTime + fetchInterval)bull Highly ranking (PageRank)bull Newly addedFetchlist generation
bull ldquotopNrdquo - select best candidatesminus Priority based on many factors pluggable
Adaptive fetch schedulebull Detect rate of changes amp time of changebull Detect unmodified content
minus Hmm and how to recognize this rarr approximate page signatures (near-duplicate detection)
29
Nut
ch ndash
Apa
cheC
on U
S 0
9
Fetching content
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
30
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow fetching
Multi-protocol HTTP(s) FTP file etc Coordinates multiple threads accessing the same hostbull ldquopolitenessrdquo issues vs crawling efficiencybull Host an IP or DNS namebull Redirections should follow immediately What kind
Other netiquette issuesbull robotstxt disallowed paths Crawl-DelayPreparation for parsingbull Content type detection issuesbull Parsing usually executed as a separate step
(resource hungry and sometimes unstable)
31
Nut
ch ndash
Apa
cheC
on U
S 0
9
Content processing
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
32
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow content processingProtocol plugins retrieve content as plain bytes + protocol-level metadata (eg HTTP headers)Parse pluginsbull Content is parsed by MIME-type specific parsersbull Content is parsed into parseData (title outlinks other
metadata) and parseText (plain text content)Nutch supports many popular file formats
33
Nut
ch ndash
Apa
cheC
on U
S 0
9
Link inversion
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
34
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow link inversionPages have outgoing links (outlinks)
hellip I know where Im pointing to
Question who points to mehellip I dont know there is no catalog of pageshellip NOBODY knows for sure either
In-degree indicates importance of the pageAnchor text provides important semantic infoPartial answer invert the outlinks that I know about and group by target
src 1
tgt 2
tgt 3
tgt 4tgt 5
tgt 1
tgt 1
src 2
src 3
src 4src 5
src 1
35
Nut
ch ndash
Apa
cheC
on U
S 0
9
Page importance - scoring
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
36
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow page scoringQuery-independent importance factor
bull Affects search rankingbull Affects crawl prioritizationMay include arbitrary decision of the operatorCurrently two systems (plugins + tools)OPIC scoring
bull Doesnt require explicit steps - ldquoonlinerdquobull Difficult to stabilizePageRank scoring with loop detection
bull Periodically run to update CrawlDb scoresbull Computationally intensive esp loop detection
37
Nut
ch ndash
Apa
cheC
on U
S 0
9
Indexing
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
38
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow indexingIndexing plugins
bull Create full-text search index from segment databull Apply adjustments to score per documentbull May further post-process the parsed content (eg
language identification) to facilitate advanced searchIndexers
bull Lucene indexer ndash builds indexes to be served by Nutch search servers
bull Solr indexer ndash submits documents to a Solr instance
39
Nut
ch ndash
Apa
cheC
on U
S 0
9
De-duplication
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
40
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow de-duplicationThe same page may be present in many shards
bull Obsolete versions in older shardsbull Mirrored pages or equivalent (acom rarr wwwacom)Many other pages are almost identical
bull Template-related differences (banners current date)bull Font layout changes minor re-wording
Hmm hellip what is a significant change bull Tricky issue Hint what is the page content
Near-duplicate detection and removalbull Nutch uses approximate page signatures (fingerprints)bull Duplicates are only marked as deleted
41
Nut
ch ndash
Apa
cheC
on U
S 0
9
Cycles may overlap
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
42
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch distributed search
Multiple search serversSearch front-end dispatches queries and collects search results
bull multiple access methods (API OpenSearch XML JSON)
Page servers build snippetsFault-tolerant
with degraded quality
Search front-end(s)Web app
APIXML
NutchBean
JSON
Search server(Nutch or Solr)
indexed shards
43
Nut
ch ndash
Apa
cheC
on U
S 0
9
Search configurationNutch syntax is limited ndash on purpose
bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins
bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20
Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation
44
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards
Searchfront-endSearch
front-end
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
45
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment distributed searchLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
46
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment map-reduce processing amp distrib search
Fully distributed storage and processingLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
DFS Data NodeDFS Data Node
DFS Data Node
DFS Name NodeDFS Name Node(Job Tracker)
FetcherIndexer
DB mgmt
(Job Tracker)FetcherIndexer
DB mgmt
Task TrackerTask Tracker
Task Tracker
47
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch on a Hadoop cluster
Assumes an up amp running Hadoop clusterhellip which is covered elsewhere
Build nutchjob jar and use it as usual
binhadoop jar nutchjob ltclassNamegt ltargsgt
Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar
Searching is not a map-reduce job ndash often on a separate group of machines
48
Nut
ch ndash
Apa
cheC
on U
S 0
9
49
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch index
50
Nut
ch ndash
Apa
cheC
on U
S 0
9
hellip and press Search
51
Nut
ch ndash
Apa
cheC
on U
S 0
9
Conclusions(This overview is a tip of the iceberg)
NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit
52
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of NutchAvoid code duplicationParsing rarr Tika
bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI
bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper
bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase
bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available
53
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of Nutch (2)Whats left then
bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons
Vision aacute la carte toolkit scalable from1-1000s nodes
54
Nut
ch ndash
Apa
cheC
on U
S 0
9
Summary
QampA
Further informationbull httpluceneapacheorgnutch
18
Nut
ch ndash
Apa
cheC
on U
S 0
9
Controlling the crawling frontierURL filter plugins
minus White-list black-list regexminus May use external resources
(DB-s services )
URL normalizer pluginsminus Resolving relative path
elementsminus ldquoEquivalentrdquo URLs
Additional controls using scoring plugins
minus priority metadata selectblockCrawler traps ndash difficult
minus Domain host path-level stats
seed
i = 1
i = 2i = 3
19
Nut
ch ndash
Apa
cheC
on U
S 0
9
Wide vs focused crawlingDifferences
bull Little technical difference in configurationbull Big difference in operations maintenance and qualityWide crawling
minus (Almost) Unlimited crawling frontierminus High risk of spamming and junk contentminus ldquoPolitenessrdquo a very important limiting factorminus Bandwidth amp DNS considerations
Focused (vertical or enterprise) crawlingminus Limited crawling frontierminus Bandwidth or politeness is often not an issueminus Low risk of spamming and junk content
20
Nut
ch ndash
Apa
cheC
on U
S 0
9
Vertical amp enterprise searchVertical search
bull Range of selected ldquoreferencerdquo sitesbull Robust control of the crawling frontierbull Extensive content post-processingbull Business-driven decisions about rankingEnterprise search
bull Variety of data sources and data formatsbull Well-defined and limited crawling frontierbull Integration with in-house data sourcesbull Little danger of spambull PageRank-like scoring usually works poorly
21
Nut
ch ndash
Apa
cheC
on U
S 0
9
Face to face with Nutch
22
Nut
ch ndash
Apa
cheC
on U
S 0
9
Simple set-up and runningYou already have Java 5+ rightGet a 10 release or a nightly build (pretty stable)Simple search setup uses Tomcat for search web appCommand-line bash script binnutch
bull Windows users get Cygwinbull Early version of a web-based UI consolebull Hadoop web-based monitoring
23
Nut
ch ndash
Apa
cheC
on U
S 0
9
Configuration filesEdit configuration in confnutch-sitexml
bull Check nutch-defaultxml for defaults and docsbull You MUST at least fill the name of your agentActive plugins configurationPer-plugin propertiesExternal configuration files
bull regex-urlfilterxmlbull regex-normalizexmlbull parse-pluginsxml mapping of MIME type to plugin
24
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch pluginsPlugin-based extensions for
bull Crawl schedulingbull URL filtering and normalizationbull Protocol for getting the contentbull Content parsingbull Text analysis (tokenization)bull Page signature (to detect near-duplicates)bull Indexing filters (index fields amp metadata)bull Snippet generation and highlightingbull Scoring and rankingbull Query translation and expansion (user rarr Lucene)
25
Nut
ch ndash
Apa
cheC
on U
S 0
9
Main Nutch workflow
Inject initial creation of CrawlDBbull Insert seed URLsbull Initial LinkDB is empty
Generate new shards fetchlistFetch raw contentParse content (discovers outlinks)Update CrawlDB from shardsUpdate LinkDB from shardsIndex shards
(repeat)
Command-line binnutchinject
generatefetchparseupdatedbupdatelinkdbindex solrindex
26
Nut
ch ndash
Apa
cheC
on U
S 0
9
Injecting new URL-s
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
27
Nut
ch ndash
Apa
cheC
on U
S 0
9
Generating fetchlists
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
28
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow generating fetchlistsWhat to fetch next
bull Breadth-first ndash important due to ldquopolitenessrdquo limitsbull Expired (longer than fetchTime + fetchInterval)bull Highly ranking (PageRank)bull Newly addedFetchlist generation
bull ldquotopNrdquo - select best candidatesminus Priority based on many factors pluggable
Adaptive fetch schedulebull Detect rate of changes amp time of changebull Detect unmodified content
minus Hmm and how to recognize this rarr approximate page signatures (near-duplicate detection)
29
Nut
ch ndash
Apa
cheC
on U
S 0
9
Fetching content
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
30
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow fetching
Multi-protocol HTTP(s) FTP file etc Coordinates multiple threads accessing the same hostbull ldquopolitenessrdquo issues vs crawling efficiencybull Host an IP or DNS namebull Redirections should follow immediately What kind
Other netiquette issuesbull robotstxt disallowed paths Crawl-DelayPreparation for parsingbull Content type detection issuesbull Parsing usually executed as a separate step
(resource hungry and sometimes unstable)
31
Nut
ch ndash
Apa
cheC
on U
S 0
9
Content processing
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
32
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow content processingProtocol plugins retrieve content as plain bytes + protocol-level metadata (eg HTTP headers)Parse pluginsbull Content is parsed by MIME-type specific parsersbull Content is parsed into parseData (title outlinks other
metadata) and parseText (plain text content)Nutch supports many popular file formats
33
Nut
ch ndash
Apa
cheC
on U
S 0
9
Link inversion
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
34
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow link inversionPages have outgoing links (outlinks)
hellip I know where Im pointing to
Question who points to mehellip I dont know there is no catalog of pageshellip NOBODY knows for sure either
In-degree indicates importance of the pageAnchor text provides important semantic infoPartial answer invert the outlinks that I know about and group by target
src 1
tgt 2
tgt 3
tgt 4tgt 5
tgt 1
tgt 1
src 2
src 3
src 4src 5
src 1
35
Nut
ch ndash
Apa
cheC
on U
S 0
9
Page importance - scoring
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
36
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow page scoringQuery-independent importance factor
bull Affects search rankingbull Affects crawl prioritizationMay include arbitrary decision of the operatorCurrently two systems (plugins + tools)OPIC scoring
bull Doesnt require explicit steps - ldquoonlinerdquobull Difficult to stabilizePageRank scoring with loop detection
bull Periodically run to update CrawlDb scoresbull Computationally intensive esp loop detection
37
Nut
ch ndash
Apa
cheC
on U
S 0
9
Indexing
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
38
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow indexingIndexing plugins
bull Create full-text search index from segment databull Apply adjustments to score per documentbull May further post-process the parsed content (eg
language identification) to facilitate advanced searchIndexers
bull Lucene indexer ndash builds indexes to be served by Nutch search servers
bull Solr indexer ndash submits documents to a Solr instance
39
Nut
ch ndash
Apa
cheC
on U
S 0
9
De-duplication
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
40
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow de-duplicationThe same page may be present in many shards
bull Obsolete versions in older shardsbull Mirrored pages or equivalent (acom rarr wwwacom)Many other pages are almost identical
bull Template-related differences (banners current date)bull Font layout changes minor re-wording
Hmm hellip what is a significant change bull Tricky issue Hint what is the page content
Near-duplicate detection and removalbull Nutch uses approximate page signatures (fingerprints)bull Duplicates are only marked as deleted
41
Nut
ch ndash
Apa
cheC
on U
S 0
9
Cycles may overlap
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
42
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch distributed search
Multiple search serversSearch front-end dispatches queries and collects search results
bull multiple access methods (API OpenSearch XML JSON)
Page servers build snippetsFault-tolerant
with degraded quality
Search front-end(s)Web app
APIXML
NutchBean
JSON
Search server(Nutch or Solr)
indexed shards
43
Nut
ch ndash
Apa
cheC
on U
S 0
9
Search configurationNutch syntax is limited ndash on purpose
bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins
bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20
Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation
44
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards
Searchfront-endSearch
front-end
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
45
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment distributed searchLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
46
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment map-reduce processing amp distrib search
Fully distributed storage and processingLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
DFS Data NodeDFS Data Node
DFS Data Node
DFS Name NodeDFS Name Node(Job Tracker)
FetcherIndexer
DB mgmt
(Job Tracker)FetcherIndexer
DB mgmt
Task TrackerTask Tracker
Task Tracker
47
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch on a Hadoop cluster
Assumes an up amp running Hadoop clusterhellip which is covered elsewhere
Build nutchjob jar and use it as usual
binhadoop jar nutchjob ltclassNamegt ltargsgt
Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar
Searching is not a map-reduce job ndash often on a separate group of machines
48
Nut
ch ndash
Apa
cheC
on U
S 0
9
49
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch index
50
Nut
ch ndash
Apa
cheC
on U
S 0
9
hellip and press Search
51
Nut
ch ndash
Apa
cheC
on U
S 0
9
Conclusions(This overview is a tip of the iceberg)
NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit
52
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of NutchAvoid code duplicationParsing rarr Tika
bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI
bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper
bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase
bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available
53
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of Nutch (2)Whats left then
bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons
Vision aacute la carte toolkit scalable from1-1000s nodes
54
Nut
ch ndash
Apa
cheC
on U
S 0
9
Summary
QampA
Further informationbull httpluceneapacheorgnutch
19
Nut
ch ndash
Apa
cheC
on U
S 0
9
Wide vs focused crawlingDifferences
bull Little technical difference in configurationbull Big difference in operations maintenance and qualityWide crawling
minus (Almost) Unlimited crawling frontierminus High risk of spamming and junk contentminus ldquoPolitenessrdquo a very important limiting factorminus Bandwidth amp DNS considerations
Focused (vertical or enterprise) crawlingminus Limited crawling frontierminus Bandwidth or politeness is often not an issueminus Low risk of spamming and junk content
20
Nut
ch ndash
Apa
cheC
on U
S 0
9
Vertical amp enterprise searchVertical search
bull Range of selected ldquoreferencerdquo sitesbull Robust control of the crawling frontierbull Extensive content post-processingbull Business-driven decisions about rankingEnterprise search
bull Variety of data sources and data formatsbull Well-defined and limited crawling frontierbull Integration with in-house data sourcesbull Little danger of spambull PageRank-like scoring usually works poorly
21
Nut
ch ndash
Apa
cheC
on U
S 0
9
Face to face with Nutch
22
Nut
ch ndash
Apa
cheC
on U
S 0
9
Simple set-up and runningYou already have Java 5+ rightGet a 10 release or a nightly build (pretty stable)Simple search setup uses Tomcat for search web appCommand-line bash script binnutch
bull Windows users get Cygwinbull Early version of a web-based UI consolebull Hadoop web-based monitoring
23
Nut
ch ndash
Apa
cheC
on U
S 0
9
Configuration filesEdit configuration in confnutch-sitexml
bull Check nutch-defaultxml for defaults and docsbull You MUST at least fill the name of your agentActive plugins configurationPer-plugin propertiesExternal configuration files
bull regex-urlfilterxmlbull regex-normalizexmlbull parse-pluginsxml mapping of MIME type to plugin
24
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch pluginsPlugin-based extensions for
bull Crawl schedulingbull URL filtering and normalizationbull Protocol for getting the contentbull Content parsingbull Text analysis (tokenization)bull Page signature (to detect near-duplicates)bull Indexing filters (index fields amp metadata)bull Snippet generation and highlightingbull Scoring and rankingbull Query translation and expansion (user rarr Lucene)
25
Nut
ch ndash
Apa
cheC
on U
S 0
9
Main Nutch workflow
Inject initial creation of CrawlDBbull Insert seed URLsbull Initial LinkDB is empty
Generate new shards fetchlistFetch raw contentParse content (discovers outlinks)Update CrawlDB from shardsUpdate LinkDB from shardsIndex shards
(repeat)
Command-line binnutchinject
generatefetchparseupdatedbupdatelinkdbindex solrindex
26
Nut
ch ndash
Apa
cheC
on U
S 0
9
Injecting new URL-s
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
27
Nut
ch ndash
Apa
cheC
on U
S 0
9
Generating fetchlists
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
28
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow generating fetchlistsWhat to fetch next
bull Breadth-first ndash important due to ldquopolitenessrdquo limitsbull Expired (longer than fetchTime + fetchInterval)bull Highly ranking (PageRank)bull Newly addedFetchlist generation
bull ldquotopNrdquo - select best candidatesminus Priority based on many factors pluggable
Adaptive fetch schedulebull Detect rate of changes amp time of changebull Detect unmodified content
minus Hmm and how to recognize this rarr approximate page signatures (near-duplicate detection)
29
Nut
ch ndash
Apa
cheC
on U
S 0
9
Fetching content
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
30
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow fetching
Multi-protocol HTTP(s) FTP file etc Coordinates multiple threads accessing the same hostbull ldquopolitenessrdquo issues vs crawling efficiencybull Host an IP or DNS namebull Redirections should follow immediately What kind
Other netiquette issuesbull robotstxt disallowed paths Crawl-DelayPreparation for parsingbull Content type detection issuesbull Parsing usually executed as a separate step
(resource hungry and sometimes unstable)
31
Nut
ch ndash
Apa
cheC
on U
S 0
9
Content processing
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
32
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow content processingProtocol plugins retrieve content as plain bytes + protocol-level metadata (eg HTTP headers)Parse pluginsbull Content is parsed by MIME-type specific parsersbull Content is parsed into parseData (title outlinks other
metadata) and parseText (plain text content)Nutch supports many popular file formats
33
Nut
ch ndash
Apa
cheC
on U
S 0
9
Link inversion
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
34
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow link inversionPages have outgoing links (outlinks)
hellip I know where Im pointing to
Question who points to mehellip I dont know there is no catalog of pageshellip NOBODY knows for sure either
In-degree indicates importance of the pageAnchor text provides important semantic infoPartial answer invert the outlinks that I know about and group by target
src 1
tgt 2
tgt 3
tgt 4tgt 5
tgt 1
tgt 1
src 2
src 3
src 4src 5
src 1
35
Nut
ch ndash
Apa
cheC
on U
S 0
9
Page importance - scoring
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
36
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow page scoringQuery-independent importance factor
bull Affects search rankingbull Affects crawl prioritizationMay include arbitrary decision of the operatorCurrently two systems (plugins + tools)OPIC scoring
bull Doesnt require explicit steps - ldquoonlinerdquobull Difficult to stabilizePageRank scoring with loop detection
bull Periodically run to update CrawlDb scoresbull Computationally intensive esp loop detection
37
Nut
ch ndash
Apa
cheC
on U
S 0
9
Indexing
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
38
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow indexingIndexing plugins
bull Create full-text search index from segment databull Apply adjustments to score per documentbull May further post-process the parsed content (eg
language identification) to facilitate advanced searchIndexers
bull Lucene indexer ndash builds indexes to be served by Nutch search servers
bull Solr indexer ndash submits documents to a Solr instance
39
Nut
ch ndash
Apa
cheC
on U
S 0
9
De-duplication
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
40
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow de-duplicationThe same page may be present in many shards
bull Obsolete versions in older shardsbull Mirrored pages or equivalent (acom rarr wwwacom)Many other pages are almost identical
bull Template-related differences (banners current date)bull Font layout changes minor re-wording
Hmm hellip what is a significant change bull Tricky issue Hint what is the page content
Near-duplicate detection and removalbull Nutch uses approximate page signatures (fingerprints)bull Duplicates are only marked as deleted
41
Nut
ch ndash
Apa
cheC
on U
S 0
9
Cycles may overlap
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
42
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch distributed search
Multiple search serversSearch front-end dispatches queries and collects search results
bull multiple access methods (API OpenSearch XML JSON)
Page servers build snippetsFault-tolerant
with degraded quality
Search front-end(s)Web app
APIXML
NutchBean
JSON
Search server(Nutch or Solr)
indexed shards
43
Nut
ch ndash
Apa
cheC
on U
S 0
9
Search configurationNutch syntax is limited ndash on purpose
bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins
bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20
Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation
44
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards
Searchfront-endSearch
front-end
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
45
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment distributed searchLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
46
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment map-reduce processing amp distrib search
Fully distributed storage and processingLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
DFS Data NodeDFS Data Node
DFS Data Node
DFS Name NodeDFS Name Node(Job Tracker)
FetcherIndexer
DB mgmt
(Job Tracker)FetcherIndexer
DB mgmt
Task TrackerTask Tracker
Task Tracker
47
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch on a Hadoop cluster
Assumes an up amp running Hadoop clusterhellip which is covered elsewhere
Build nutchjob jar and use it as usual
binhadoop jar nutchjob ltclassNamegt ltargsgt
Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar
Searching is not a map-reduce job ndash often on a separate group of machines
48
Nut
ch ndash
Apa
cheC
on U
S 0
9
49
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch index
50
Nut
ch ndash
Apa
cheC
on U
S 0
9
hellip and press Search
51
Nut
ch ndash
Apa
cheC
on U
S 0
9
Conclusions(This overview is a tip of the iceberg)
NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit
52
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of NutchAvoid code duplicationParsing rarr Tika
bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI
bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper
bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase
bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available
53
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of Nutch (2)Whats left then
bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons
Vision aacute la carte toolkit scalable from1-1000s nodes
54
Nut
ch ndash
Apa
cheC
on U
S 0
9
Summary
QampA
Further informationbull httpluceneapacheorgnutch
20
Nut
ch ndash
Apa
cheC
on U
S 0
9
Vertical amp enterprise searchVertical search
bull Range of selected ldquoreferencerdquo sitesbull Robust control of the crawling frontierbull Extensive content post-processingbull Business-driven decisions about rankingEnterprise search
bull Variety of data sources and data formatsbull Well-defined and limited crawling frontierbull Integration with in-house data sourcesbull Little danger of spambull PageRank-like scoring usually works poorly
21
Nut
ch ndash
Apa
cheC
on U
S 0
9
Face to face with Nutch
22
Nut
ch ndash
Apa
cheC
on U
S 0
9
Simple set-up and runningYou already have Java 5+ rightGet a 10 release or a nightly build (pretty stable)Simple search setup uses Tomcat for search web appCommand-line bash script binnutch
bull Windows users get Cygwinbull Early version of a web-based UI consolebull Hadoop web-based monitoring
23
Nut
ch ndash
Apa
cheC
on U
S 0
9
Configuration filesEdit configuration in confnutch-sitexml
bull Check nutch-defaultxml for defaults and docsbull You MUST at least fill the name of your agentActive plugins configurationPer-plugin propertiesExternal configuration files
bull regex-urlfilterxmlbull regex-normalizexmlbull parse-pluginsxml mapping of MIME type to plugin
24
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch pluginsPlugin-based extensions for
bull Crawl schedulingbull URL filtering and normalizationbull Protocol for getting the contentbull Content parsingbull Text analysis (tokenization)bull Page signature (to detect near-duplicates)bull Indexing filters (index fields amp metadata)bull Snippet generation and highlightingbull Scoring and rankingbull Query translation and expansion (user rarr Lucene)
25
Nut
ch ndash
Apa
cheC
on U
S 0
9
Main Nutch workflow
Inject initial creation of CrawlDBbull Insert seed URLsbull Initial LinkDB is empty
Generate new shards fetchlistFetch raw contentParse content (discovers outlinks)Update CrawlDB from shardsUpdate LinkDB from shardsIndex shards
(repeat)
Command-line binnutchinject
generatefetchparseupdatedbupdatelinkdbindex solrindex
26
Nut
ch ndash
Apa
cheC
on U
S 0
9
Injecting new URL-s
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
27
Nut
ch ndash
Apa
cheC
on U
S 0
9
Generating fetchlists
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
28
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow generating fetchlistsWhat to fetch next
bull Breadth-first ndash important due to ldquopolitenessrdquo limitsbull Expired (longer than fetchTime + fetchInterval)bull Highly ranking (PageRank)bull Newly addedFetchlist generation
bull ldquotopNrdquo - select best candidatesminus Priority based on many factors pluggable
Adaptive fetch schedulebull Detect rate of changes amp time of changebull Detect unmodified content
minus Hmm and how to recognize this rarr approximate page signatures (near-duplicate detection)
29
Nut
ch ndash
Apa
cheC
on U
S 0
9
Fetching content
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
30
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow fetching
Multi-protocol HTTP(s) FTP file etc Coordinates multiple threads accessing the same hostbull ldquopolitenessrdquo issues vs crawling efficiencybull Host an IP or DNS namebull Redirections should follow immediately What kind
Other netiquette issuesbull robotstxt disallowed paths Crawl-DelayPreparation for parsingbull Content type detection issuesbull Parsing usually executed as a separate step
(resource hungry and sometimes unstable)
31
Nut
ch ndash
Apa
cheC
on U
S 0
9
Content processing
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
32
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow content processingProtocol plugins retrieve content as plain bytes + protocol-level metadata (eg HTTP headers)Parse pluginsbull Content is parsed by MIME-type specific parsersbull Content is parsed into parseData (title outlinks other
metadata) and parseText (plain text content)Nutch supports many popular file formats
33
Nut
ch ndash
Apa
cheC
on U
S 0
9
Link inversion
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
34
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow link inversionPages have outgoing links (outlinks)
hellip I know where Im pointing to
Question who points to mehellip I dont know there is no catalog of pageshellip NOBODY knows for sure either
In-degree indicates importance of the pageAnchor text provides important semantic infoPartial answer invert the outlinks that I know about and group by target
src 1
tgt 2
tgt 3
tgt 4tgt 5
tgt 1
tgt 1
src 2
src 3
src 4src 5
src 1
35
Nut
ch ndash
Apa
cheC
on U
S 0
9
Page importance - scoring
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
36
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow page scoringQuery-independent importance factor
bull Affects search rankingbull Affects crawl prioritizationMay include arbitrary decision of the operatorCurrently two systems (plugins + tools)OPIC scoring
bull Doesnt require explicit steps - ldquoonlinerdquobull Difficult to stabilizePageRank scoring with loop detection
bull Periodically run to update CrawlDb scoresbull Computationally intensive esp loop detection
37
Nut
ch ndash
Apa
cheC
on U
S 0
9
Indexing
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
38
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow indexingIndexing plugins
bull Create full-text search index from segment databull Apply adjustments to score per documentbull May further post-process the parsed content (eg
language identification) to facilitate advanced searchIndexers
bull Lucene indexer ndash builds indexes to be served by Nutch search servers
bull Solr indexer ndash submits documents to a Solr instance
39
Nut
ch ndash
Apa
cheC
on U
S 0
9
De-duplication
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
40
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow de-duplicationThe same page may be present in many shards
bull Obsolete versions in older shardsbull Mirrored pages or equivalent (acom rarr wwwacom)Many other pages are almost identical
bull Template-related differences (banners current date)bull Font layout changes minor re-wording
Hmm hellip what is a significant change bull Tricky issue Hint what is the page content
Near-duplicate detection and removalbull Nutch uses approximate page signatures (fingerprints)bull Duplicates are only marked as deleted
41
Nut
ch ndash
Apa
cheC
on U
S 0
9
Cycles may overlap
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
42
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch distributed search
Multiple search serversSearch front-end dispatches queries and collects search results
bull multiple access methods (API OpenSearch XML JSON)
Page servers build snippetsFault-tolerant
with degraded quality
Search front-end(s)Web app
APIXML
NutchBean
JSON
Search server(Nutch or Solr)
indexed shards
43
Nut
ch ndash
Apa
cheC
on U
S 0
9
Search configurationNutch syntax is limited ndash on purpose
bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins
bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20
Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation
44
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards
Searchfront-endSearch
front-end
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
45
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment distributed searchLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
46
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment map-reduce processing amp distrib search
Fully distributed storage and processingLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
DFS Data NodeDFS Data Node
DFS Data Node
DFS Name NodeDFS Name Node(Job Tracker)
FetcherIndexer
DB mgmt
(Job Tracker)FetcherIndexer
DB mgmt
Task TrackerTask Tracker
Task Tracker
47
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch on a Hadoop cluster
Assumes an up amp running Hadoop clusterhellip which is covered elsewhere
Build nutchjob jar and use it as usual
binhadoop jar nutchjob ltclassNamegt ltargsgt
Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar
Searching is not a map-reduce job ndash often on a separate group of machines
48
Nut
ch ndash
Apa
cheC
on U
S 0
9
49
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch index
50
Nut
ch ndash
Apa
cheC
on U
S 0
9
hellip and press Search
51
Nut
ch ndash
Apa
cheC
on U
S 0
9
Conclusions(This overview is a tip of the iceberg)
NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit
52
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of NutchAvoid code duplicationParsing rarr Tika
bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI
bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper
bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase
bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available
53
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of Nutch (2)Whats left then
bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons
Vision aacute la carte toolkit scalable from1-1000s nodes
54
Nut
ch ndash
Apa
cheC
on U
S 0
9
Summary
QampA
Further informationbull httpluceneapacheorgnutch
21
Nut
ch ndash
Apa
cheC
on U
S 0
9
Face to face with Nutch
22
Nut
ch ndash
Apa
cheC
on U
S 0
9
Simple set-up and runningYou already have Java 5+ rightGet a 10 release or a nightly build (pretty stable)Simple search setup uses Tomcat for search web appCommand-line bash script binnutch
bull Windows users get Cygwinbull Early version of a web-based UI consolebull Hadoop web-based monitoring
23
Nut
ch ndash
Apa
cheC
on U
S 0
9
Configuration filesEdit configuration in confnutch-sitexml
bull Check nutch-defaultxml for defaults and docsbull You MUST at least fill the name of your agentActive plugins configurationPer-plugin propertiesExternal configuration files
bull regex-urlfilterxmlbull regex-normalizexmlbull parse-pluginsxml mapping of MIME type to plugin
24
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch pluginsPlugin-based extensions for
bull Crawl schedulingbull URL filtering and normalizationbull Protocol for getting the contentbull Content parsingbull Text analysis (tokenization)bull Page signature (to detect near-duplicates)bull Indexing filters (index fields amp metadata)bull Snippet generation and highlightingbull Scoring and rankingbull Query translation and expansion (user rarr Lucene)
25
Nut
ch ndash
Apa
cheC
on U
S 0
9
Main Nutch workflow
Inject initial creation of CrawlDBbull Insert seed URLsbull Initial LinkDB is empty
Generate new shards fetchlistFetch raw contentParse content (discovers outlinks)Update CrawlDB from shardsUpdate LinkDB from shardsIndex shards
(repeat)
Command-line binnutchinject
generatefetchparseupdatedbupdatelinkdbindex solrindex
26
Nut
ch ndash
Apa
cheC
on U
S 0
9
Injecting new URL-s
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
27
Nut
ch ndash
Apa
cheC
on U
S 0
9
Generating fetchlists
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
28
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow generating fetchlistsWhat to fetch next
bull Breadth-first ndash important due to ldquopolitenessrdquo limitsbull Expired (longer than fetchTime + fetchInterval)bull Highly ranking (PageRank)bull Newly addedFetchlist generation
bull ldquotopNrdquo - select best candidatesminus Priority based on many factors pluggable
Adaptive fetch schedulebull Detect rate of changes amp time of changebull Detect unmodified content
minus Hmm and how to recognize this rarr approximate page signatures (near-duplicate detection)
29
Nut
ch ndash
Apa
cheC
on U
S 0
9
Fetching content
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
30
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow fetching
Multi-protocol HTTP(s) FTP file etc Coordinates multiple threads accessing the same hostbull ldquopolitenessrdquo issues vs crawling efficiencybull Host an IP or DNS namebull Redirections should follow immediately What kind
Other netiquette issuesbull robotstxt disallowed paths Crawl-DelayPreparation for parsingbull Content type detection issuesbull Parsing usually executed as a separate step
(resource hungry and sometimes unstable)
31
Nut
ch ndash
Apa
cheC
on U
S 0
9
Content processing
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
32
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow content processingProtocol plugins retrieve content as plain bytes + protocol-level metadata (eg HTTP headers)Parse pluginsbull Content is parsed by MIME-type specific parsersbull Content is parsed into parseData (title outlinks other
metadata) and parseText (plain text content)Nutch supports many popular file formats
33
Nut
ch ndash
Apa
cheC
on U
S 0
9
Link inversion
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
34
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow link inversionPages have outgoing links (outlinks)
hellip I know where Im pointing to
Question who points to mehellip I dont know there is no catalog of pageshellip NOBODY knows for sure either
In-degree indicates importance of the pageAnchor text provides important semantic infoPartial answer invert the outlinks that I know about and group by target
src 1
tgt 2
tgt 3
tgt 4tgt 5
tgt 1
tgt 1
src 2
src 3
src 4src 5
src 1
35
Nut
ch ndash
Apa
cheC
on U
S 0
9
Page importance - scoring
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
36
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow page scoringQuery-independent importance factor
bull Affects search rankingbull Affects crawl prioritizationMay include arbitrary decision of the operatorCurrently two systems (plugins + tools)OPIC scoring
bull Doesnt require explicit steps - ldquoonlinerdquobull Difficult to stabilizePageRank scoring with loop detection
bull Periodically run to update CrawlDb scoresbull Computationally intensive esp loop detection
37
Nut
ch ndash
Apa
cheC
on U
S 0
9
Indexing
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
38
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow indexingIndexing plugins
bull Create full-text search index from segment databull Apply adjustments to score per documentbull May further post-process the parsed content (eg
language identification) to facilitate advanced searchIndexers
bull Lucene indexer ndash builds indexes to be served by Nutch search servers
bull Solr indexer ndash submits documents to a Solr instance
39
Nut
ch ndash
Apa
cheC
on U
S 0
9
De-duplication
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
40
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow de-duplicationThe same page may be present in many shards
bull Obsolete versions in older shardsbull Mirrored pages or equivalent (acom rarr wwwacom)Many other pages are almost identical
bull Template-related differences (banners current date)bull Font layout changes minor re-wording
Hmm hellip what is a significant change bull Tricky issue Hint what is the page content
Near-duplicate detection and removalbull Nutch uses approximate page signatures (fingerprints)bull Duplicates are only marked as deleted
41
Nut
ch ndash
Apa
cheC
on U
S 0
9
Cycles may overlap
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
42
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch distributed search
Multiple search serversSearch front-end dispatches queries and collects search results
bull multiple access methods (API OpenSearch XML JSON)
Page servers build snippetsFault-tolerant
with degraded quality
Search front-end(s)Web app
APIXML
NutchBean
JSON
Search server(Nutch or Solr)
indexed shards
43
Nut
ch ndash
Apa
cheC
on U
S 0
9
Search configurationNutch syntax is limited ndash on purpose
bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins
bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20
Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation
44
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards
Searchfront-endSearch
front-end
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
45
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment distributed searchLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
46
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment map-reduce processing amp distrib search
Fully distributed storage and processingLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
DFS Data NodeDFS Data Node
DFS Data Node
DFS Name NodeDFS Name Node(Job Tracker)
FetcherIndexer
DB mgmt
(Job Tracker)FetcherIndexer
DB mgmt
Task TrackerTask Tracker
Task Tracker
47
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch on a Hadoop cluster
Assumes an up amp running Hadoop clusterhellip which is covered elsewhere
Build nutchjob jar and use it as usual
binhadoop jar nutchjob ltclassNamegt ltargsgt
Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar
Searching is not a map-reduce job ndash often on a separate group of machines
48
Nut
ch ndash
Apa
cheC
on U
S 0
9
49
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch index
50
Nut
ch ndash
Apa
cheC
on U
S 0
9
hellip and press Search
51
Nut
ch ndash
Apa
cheC
on U
S 0
9
Conclusions(This overview is a tip of the iceberg)
NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit
52
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of NutchAvoid code duplicationParsing rarr Tika
bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI
bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper
bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase
bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available
53
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of Nutch (2)Whats left then
bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons
Vision aacute la carte toolkit scalable from1-1000s nodes
54
Nut
ch ndash
Apa
cheC
on U
S 0
9
Summary
QampA
Further informationbull httpluceneapacheorgnutch
22
Nut
ch ndash
Apa
cheC
on U
S 0
9
Simple set-up and runningYou already have Java 5+ rightGet a 10 release or a nightly build (pretty stable)Simple search setup uses Tomcat for search web appCommand-line bash script binnutch
bull Windows users get Cygwinbull Early version of a web-based UI consolebull Hadoop web-based monitoring
23
Nut
ch ndash
Apa
cheC
on U
S 0
9
Configuration filesEdit configuration in confnutch-sitexml
bull Check nutch-defaultxml for defaults and docsbull You MUST at least fill the name of your agentActive plugins configurationPer-plugin propertiesExternal configuration files
bull regex-urlfilterxmlbull regex-normalizexmlbull parse-pluginsxml mapping of MIME type to plugin
24
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch pluginsPlugin-based extensions for
bull Crawl schedulingbull URL filtering and normalizationbull Protocol for getting the contentbull Content parsingbull Text analysis (tokenization)bull Page signature (to detect near-duplicates)bull Indexing filters (index fields amp metadata)bull Snippet generation and highlightingbull Scoring and rankingbull Query translation and expansion (user rarr Lucene)
25
Nut
ch ndash
Apa
cheC
on U
S 0
9
Main Nutch workflow
Inject initial creation of CrawlDBbull Insert seed URLsbull Initial LinkDB is empty
Generate new shards fetchlistFetch raw contentParse content (discovers outlinks)Update CrawlDB from shardsUpdate LinkDB from shardsIndex shards
(repeat)
Command-line binnutchinject
generatefetchparseupdatedbupdatelinkdbindex solrindex
26
Nut
ch ndash
Apa
cheC
on U
S 0
9
Injecting new URL-s
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
27
Nut
ch ndash
Apa
cheC
on U
S 0
9
Generating fetchlists
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
28
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow generating fetchlistsWhat to fetch next
bull Breadth-first ndash important due to ldquopolitenessrdquo limitsbull Expired (longer than fetchTime + fetchInterval)bull Highly ranking (PageRank)bull Newly addedFetchlist generation
bull ldquotopNrdquo - select best candidatesminus Priority based on many factors pluggable
Adaptive fetch schedulebull Detect rate of changes amp time of changebull Detect unmodified content
minus Hmm and how to recognize this rarr approximate page signatures (near-duplicate detection)
29
Nut
ch ndash
Apa
cheC
on U
S 0
9
Fetching content
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
30
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow fetching
Multi-protocol HTTP(s) FTP file etc Coordinates multiple threads accessing the same hostbull ldquopolitenessrdquo issues vs crawling efficiencybull Host an IP or DNS namebull Redirections should follow immediately What kind
Other netiquette issuesbull robotstxt disallowed paths Crawl-DelayPreparation for parsingbull Content type detection issuesbull Parsing usually executed as a separate step
(resource hungry and sometimes unstable)
31
Nut
ch ndash
Apa
cheC
on U
S 0
9
Content processing
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
32
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow content processingProtocol plugins retrieve content as plain bytes + protocol-level metadata (eg HTTP headers)Parse pluginsbull Content is parsed by MIME-type specific parsersbull Content is parsed into parseData (title outlinks other
metadata) and parseText (plain text content)Nutch supports many popular file formats
33
Nut
ch ndash
Apa
cheC
on U
S 0
9
Link inversion
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
34
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow link inversionPages have outgoing links (outlinks)
hellip I know where Im pointing to
Question who points to mehellip I dont know there is no catalog of pageshellip NOBODY knows for sure either
In-degree indicates importance of the pageAnchor text provides important semantic infoPartial answer invert the outlinks that I know about and group by target
src 1
tgt 2
tgt 3
tgt 4tgt 5
tgt 1
tgt 1
src 2
src 3
src 4src 5
src 1
35
Nut
ch ndash
Apa
cheC
on U
S 0
9
Page importance - scoring
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
36
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow page scoringQuery-independent importance factor
bull Affects search rankingbull Affects crawl prioritizationMay include arbitrary decision of the operatorCurrently two systems (plugins + tools)OPIC scoring
bull Doesnt require explicit steps - ldquoonlinerdquobull Difficult to stabilizePageRank scoring with loop detection
bull Periodically run to update CrawlDb scoresbull Computationally intensive esp loop detection
37
Nut
ch ndash
Apa
cheC
on U
S 0
9
Indexing
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
38
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow indexingIndexing plugins
bull Create full-text search index from segment databull Apply adjustments to score per documentbull May further post-process the parsed content (eg
language identification) to facilitate advanced searchIndexers
bull Lucene indexer ndash builds indexes to be served by Nutch search servers
bull Solr indexer ndash submits documents to a Solr instance
39
Nut
ch ndash
Apa
cheC
on U
S 0
9
De-duplication
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
40
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow de-duplicationThe same page may be present in many shards
bull Obsolete versions in older shardsbull Mirrored pages or equivalent (acom rarr wwwacom)Many other pages are almost identical
bull Template-related differences (banners current date)bull Font layout changes minor re-wording
Hmm hellip what is a significant change bull Tricky issue Hint what is the page content
Near-duplicate detection and removalbull Nutch uses approximate page signatures (fingerprints)bull Duplicates are only marked as deleted
41
Nut
ch ndash
Apa
cheC
on U
S 0
9
Cycles may overlap
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
42
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch distributed search
Multiple search serversSearch front-end dispatches queries and collects search results
bull multiple access methods (API OpenSearch XML JSON)
Page servers build snippetsFault-tolerant
with degraded quality
Search front-end(s)Web app
APIXML
NutchBean
JSON
Search server(Nutch or Solr)
indexed shards
43
Nut
ch ndash
Apa
cheC
on U
S 0
9
Search configurationNutch syntax is limited ndash on purpose
bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins
bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20
Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation
44
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards
Searchfront-endSearch
front-end
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
45
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment distributed searchLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
46
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment map-reduce processing amp distrib search
Fully distributed storage and processingLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
DFS Data NodeDFS Data Node
DFS Data Node
DFS Name NodeDFS Name Node(Job Tracker)
FetcherIndexer
DB mgmt
(Job Tracker)FetcherIndexer
DB mgmt
Task TrackerTask Tracker
Task Tracker
47
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch on a Hadoop cluster
Assumes an up amp running Hadoop clusterhellip which is covered elsewhere
Build nutchjob jar and use it as usual
binhadoop jar nutchjob ltclassNamegt ltargsgt
Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar
Searching is not a map-reduce job ndash often on a separate group of machines
48
Nut
ch ndash
Apa
cheC
on U
S 0
9
49
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch index
50
Nut
ch ndash
Apa
cheC
on U
S 0
9
hellip and press Search
51
Nut
ch ndash
Apa
cheC
on U
S 0
9
Conclusions(This overview is a tip of the iceberg)
NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit
52
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of NutchAvoid code duplicationParsing rarr Tika
bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI
bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper
bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase
bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available
53
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of Nutch (2)Whats left then
bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons
Vision aacute la carte toolkit scalable from1-1000s nodes
54
Nut
ch ndash
Apa
cheC
on U
S 0
9
Summary
QampA
Further informationbull httpluceneapacheorgnutch
23
Nut
ch ndash
Apa
cheC
on U
S 0
9
Configuration filesEdit configuration in confnutch-sitexml
bull Check nutch-defaultxml for defaults and docsbull You MUST at least fill the name of your agentActive plugins configurationPer-plugin propertiesExternal configuration files
bull regex-urlfilterxmlbull regex-normalizexmlbull parse-pluginsxml mapping of MIME type to plugin
24
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch pluginsPlugin-based extensions for
bull Crawl schedulingbull URL filtering and normalizationbull Protocol for getting the contentbull Content parsingbull Text analysis (tokenization)bull Page signature (to detect near-duplicates)bull Indexing filters (index fields amp metadata)bull Snippet generation and highlightingbull Scoring and rankingbull Query translation and expansion (user rarr Lucene)
25
Nut
ch ndash
Apa
cheC
on U
S 0
9
Main Nutch workflow
Inject initial creation of CrawlDBbull Insert seed URLsbull Initial LinkDB is empty
Generate new shards fetchlistFetch raw contentParse content (discovers outlinks)Update CrawlDB from shardsUpdate LinkDB from shardsIndex shards
(repeat)
Command-line binnutchinject
generatefetchparseupdatedbupdatelinkdbindex solrindex
26
Nut
ch ndash
Apa
cheC
on U
S 0
9
Injecting new URL-s
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
27
Nut
ch ndash
Apa
cheC
on U
S 0
9
Generating fetchlists
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
28
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow generating fetchlistsWhat to fetch next
bull Breadth-first ndash important due to ldquopolitenessrdquo limitsbull Expired (longer than fetchTime + fetchInterval)bull Highly ranking (PageRank)bull Newly addedFetchlist generation
bull ldquotopNrdquo - select best candidatesminus Priority based on many factors pluggable
Adaptive fetch schedulebull Detect rate of changes amp time of changebull Detect unmodified content
minus Hmm and how to recognize this rarr approximate page signatures (near-duplicate detection)
29
Nut
ch ndash
Apa
cheC
on U
S 0
9
Fetching content
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
30
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow fetching
Multi-protocol HTTP(s) FTP file etc Coordinates multiple threads accessing the same hostbull ldquopolitenessrdquo issues vs crawling efficiencybull Host an IP or DNS namebull Redirections should follow immediately What kind
Other netiquette issuesbull robotstxt disallowed paths Crawl-DelayPreparation for parsingbull Content type detection issuesbull Parsing usually executed as a separate step
(resource hungry and sometimes unstable)
31
Nut
ch ndash
Apa
cheC
on U
S 0
9
Content processing
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
32
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow content processingProtocol plugins retrieve content as plain bytes + protocol-level metadata (eg HTTP headers)Parse pluginsbull Content is parsed by MIME-type specific parsersbull Content is parsed into parseData (title outlinks other
metadata) and parseText (plain text content)Nutch supports many popular file formats
33
Nut
ch ndash
Apa
cheC
on U
S 0
9
Link inversion
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
34
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow link inversionPages have outgoing links (outlinks)
hellip I know where Im pointing to
Question who points to mehellip I dont know there is no catalog of pageshellip NOBODY knows for sure either
In-degree indicates importance of the pageAnchor text provides important semantic infoPartial answer invert the outlinks that I know about and group by target
src 1
tgt 2
tgt 3
tgt 4tgt 5
tgt 1
tgt 1
src 2
src 3
src 4src 5
src 1
35
Nut
ch ndash
Apa
cheC
on U
S 0
9
Page importance - scoring
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
36
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow page scoringQuery-independent importance factor
bull Affects search rankingbull Affects crawl prioritizationMay include arbitrary decision of the operatorCurrently two systems (plugins + tools)OPIC scoring
bull Doesnt require explicit steps - ldquoonlinerdquobull Difficult to stabilizePageRank scoring with loop detection
bull Periodically run to update CrawlDb scoresbull Computationally intensive esp loop detection
37
Nut
ch ndash
Apa
cheC
on U
S 0
9
Indexing
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
38
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow indexingIndexing plugins
bull Create full-text search index from segment databull Apply adjustments to score per documentbull May further post-process the parsed content (eg
language identification) to facilitate advanced searchIndexers
bull Lucene indexer ndash builds indexes to be served by Nutch search servers
bull Solr indexer ndash submits documents to a Solr instance
39
Nut
ch ndash
Apa
cheC
on U
S 0
9
De-duplication
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
40
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow de-duplicationThe same page may be present in many shards
bull Obsolete versions in older shardsbull Mirrored pages or equivalent (acom rarr wwwacom)Many other pages are almost identical
bull Template-related differences (banners current date)bull Font layout changes minor re-wording
Hmm hellip what is a significant change bull Tricky issue Hint what is the page content
Near-duplicate detection and removalbull Nutch uses approximate page signatures (fingerprints)bull Duplicates are only marked as deleted
41
Nut
ch ndash
Apa
cheC
on U
S 0
9
Cycles may overlap
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
42
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch distributed search
Multiple search serversSearch front-end dispatches queries and collects search results
bull multiple access methods (API OpenSearch XML JSON)
Page servers build snippetsFault-tolerant
with degraded quality
Search front-end(s)Web app
APIXML
NutchBean
JSON
Search server(Nutch or Solr)
indexed shards
43
Nut
ch ndash
Apa
cheC
on U
S 0
9
Search configurationNutch syntax is limited ndash on purpose
bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins
bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20
Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation
44
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards
Searchfront-endSearch
front-end
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
45
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment distributed searchLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
46
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment map-reduce processing amp distrib search
Fully distributed storage and processingLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
DFS Data NodeDFS Data Node
DFS Data Node
DFS Name NodeDFS Name Node(Job Tracker)
FetcherIndexer
DB mgmt
(Job Tracker)FetcherIndexer
DB mgmt
Task TrackerTask Tracker
Task Tracker
47
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch on a Hadoop cluster
Assumes an up amp running Hadoop clusterhellip which is covered elsewhere
Build nutchjob jar and use it as usual
binhadoop jar nutchjob ltclassNamegt ltargsgt
Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar
Searching is not a map-reduce job ndash often on a separate group of machines
48
Nut
ch ndash
Apa
cheC
on U
S 0
9
49
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch index
50
Nut
ch ndash
Apa
cheC
on U
S 0
9
hellip and press Search
51
Nut
ch ndash
Apa
cheC
on U
S 0
9
Conclusions(This overview is a tip of the iceberg)
NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit
52
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of NutchAvoid code duplicationParsing rarr Tika
bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI
bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper
bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase
bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available
53
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of Nutch (2)Whats left then
bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons
Vision aacute la carte toolkit scalable from1-1000s nodes
54
Nut
ch ndash
Apa
cheC
on U
S 0
9
Summary
QampA
Further informationbull httpluceneapacheorgnutch
24
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch pluginsPlugin-based extensions for
bull Crawl schedulingbull URL filtering and normalizationbull Protocol for getting the contentbull Content parsingbull Text analysis (tokenization)bull Page signature (to detect near-duplicates)bull Indexing filters (index fields amp metadata)bull Snippet generation and highlightingbull Scoring and rankingbull Query translation and expansion (user rarr Lucene)
25
Nut
ch ndash
Apa
cheC
on U
S 0
9
Main Nutch workflow
Inject initial creation of CrawlDBbull Insert seed URLsbull Initial LinkDB is empty
Generate new shards fetchlistFetch raw contentParse content (discovers outlinks)Update CrawlDB from shardsUpdate LinkDB from shardsIndex shards
(repeat)
Command-line binnutchinject
generatefetchparseupdatedbupdatelinkdbindex solrindex
26
Nut
ch ndash
Apa
cheC
on U
S 0
9
Injecting new URL-s
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
27
Nut
ch ndash
Apa
cheC
on U
S 0
9
Generating fetchlists
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
28
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow generating fetchlistsWhat to fetch next
bull Breadth-first ndash important due to ldquopolitenessrdquo limitsbull Expired (longer than fetchTime + fetchInterval)bull Highly ranking (PageRank)bull Newly addedFetchlist generation
bull ldquotopNrdquo - select best candidatesminus Priority based on many factors pluggable
Adaptive fetch schedulebull Detect rate of changes amp time of changebull Detect unmodified content
minus Hmm and how to recognize this rarr approximate page signatures (near-duplicate detection)
29
Nut
ch ndash
Apa
cheC
on U
S 0
9
Fetching content
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
30
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow fetching
Multi-protocol HTTP(s) FTP file etc Coordinates multiple threads accessing the same hostbull ldquopolitenessrdquo issues vs crawling efficiencybull Host an IP or DNS namebull Redirections should follow immediately What kind
Other netiquette issuesbull robotstxt disallowed paths Crawl-DelayPreparation for parsingbull Content type detection issuesbull Parsing usually executed as a separate step
(resource hungry and sometimes unstable)
31
Nut
ch ndash
Apa
cheC
on U
S 0
9
Content processing
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
32
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow content processingProtocol plugins retrieve content as plain bytes + protocol-level metadata (eg HTTP headers)Parse pluginsbull Content is parsed by MIME-type specific parsersbull Content is parsed into parseData (title outlinks other
metadata) and parseText (plain text content)Nutch supports many popular file formats
33
Nut
ch ndash
Apa
cheC
on U
S 0
9
Link inversion
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
34
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow link inversionPages have outgoing links (outlinks)
hellip I know where Im pointing to
Question who points to mehellip I dont know there is no catalog of pageshellip NOBODY knows for sure either
In-degree indicates importance of the pageAnchor text provides important semantic infoPartial answer invert the outlinks that I know about and group by target
src 1
tgt 2
tgt 3
tgt 4tgt 5
tgt 1
tgt 1
src 2
src 3
src 4src 5
src 1
35
Nut
ch ndash
Apa
cheC
on U
S 0
9
Page importance - scoring
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
36
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow page scoringQuery-independent importance factor
bull Affects search rankingbull Affects crawl prioritizationMay include arbitrary decision of the operatorCurrently two systems (plugins + tools)OPIC scoring
bull Doesnt require explicit steps - ldquoonlinerdquobull Difficult to stabilizePageRank scoring with loop detection
bull Periodically run to update CrawlDb scoresbull Computationally intensive esp loop detection
37
Nut
ch ndash
Apa
cheC
on U
S 0
9
Indexing
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
38
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow indexingIndexing plugins
bull Create full-text search index from segment databull Apply adjustments to score per documentbull May further post-process the parsed content (eg
language identification) to facilitate advanced searchIndexers
bull Lucene indexer ndash builds indexes to be served by Nutch search servers
bull Solr indexer ndash submits documents to a Solr instance
39
Nut
ch ndash
Apa
cheC
on U
S 0
9
De-duplication
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
40
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow de-duplicationThe same page may be present in many shards
bull Obsolete versions in older shardsbull Mirrored pages or equivalent (acom rarr wwwacom)Many other pages are almost identical
bull Template-related differences (banners current date)bull Font layout changes minor re-wording
Hmm hellip what is a significant change bull Tricky issue Hint what is the page content
Near-duplicate detection and removalbull Nutch uses approximate page signatures (fingerprints)bull Duplicates are only marked as deleted
41
Nut
ch ndash
Apa
cheC
on U
S 0
9
Cycles may overlap
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
42
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch distributed search
Multiple search serversSearch front-end dispatches queries and collects search results
bull multiple access methods (API OpenSearch XML JSON)
Page servers build snippetsFault-tolerant
with degraded quality
Search front-end(s)Web app
APIXML
NutchBean
JSON
Search server(Nutch or Solr)
indexed shards
43
Nut
ch ndash
Apa
cheC
on U
S 0
9
Search configurationNutch syntax is limited ndash on purpose
bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins
bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20
Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation
44
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards
Searchfront-endSearch
front-end
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
45
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment distributed searchLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
46
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment map-reduce processing amp distrib search
Fully distributed storage and processingLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
DFS Data NodeDFS Data Node
DFS Data Node
DFS Name NodeDFS Name Node(Job Tracker)
FetcherIndexer
DB mgmt
(Job Tracker)FetcherIndexer
DB mgmt
Task TrackerTask Tracker
Task Tracker
47
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch on a Hadoop cluster
Assumes an up amp running Hadoop clusterhellip which is covered elsewhere
Build nutchjob jar and use it as usual
binhadoop jar nutchjob ltclassNamegt ltargsgt
Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar
Searching is not a map-reduce job ndash often on a separate group of machines
48
Nut
ch ndash
Apa
cheC
on U
S 0
9
49
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch index
50
Nut
ch ndash
Apa
cheC
on U
S 0
9
hellip and press Search
51
Nut
ch ndash
Apa
cheC
on U
S 0
9
Conclusions(This overview is a tip of the iceberg)
NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit
52
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of NutchAvoid code duplicationParsing rarr Tika
bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI
bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper
bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase
bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available
53
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of Nutch (2)Whats left then
bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons
Vision aacute la carte toolkit scalable from1-1000s nodes
54
Nut
ch ndash
Apa
cheC
on U
S 0
9
Summary
QampA
Further informationbull httpluceneapacheorgnutch
25
Nut
ch ndash
Apa
cheC
on U
S 0
9
Main Nutch workflow
Inject initial creation of CrawlDBbull Insert seed URLsbull Initial LinkDB is empty
Generate new shards fetchlistFetch raw contentParse content (discovers outlinks)Update CrawlDB from shardsUpdate LinkDB from shardsIndex shards
(repeat)
Command-line binnutchinject
generatefetchparseupdatedbupdatelinkdbindex solrindex
26
Nut
ch ndash
Apa
cheC
on U
S 0
9
Injecting new URL-s
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
27
Nut
ch ndash
Apa
cheC
on U
S 0
9
Generating fetchlists
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
28
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow generating fetchlistsWhat to fetch next
bull Breadth-first ndash important due to ldquopolitenessrdquo limitsbull Expired (longer than fetchTime + fetchInterval)bull Highly ranking (PageRank)bull Newly addedFetchlist generation
bull ldquotopNrdquo - select best candidatesminus Priority based on many factors pluggable
Adaptive fetch schedulebull Detect rate of changes amp time of changebull Detect unmodified content
minus Hmm and how to recognize this rarr approximate page signatures (near-duplicate detection)
29
Nut
ch ndash
Apa
cheC
on U
S 0
9
Fetching content
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
30
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow fetching
Multi-protocol HTTP(s) FTP file etc Coordinates multiple threads accessing the same hostbull ldquopolitenessrdquo issues vs crawling efficiencybull Host an IP or DNS namebull Redirections should follow immediately What kind
Other netiquette issuesbull robotstxt disallowed paths Crawl-DelayPreparation for parsingbull Content type detection issuesbull Parsing usually executed as a separate step
(resource hungry and sometimes unstable)
31
Nut
ch ndash
Apa
cheC
on U
S 0
9
Content processing
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
32
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow content processingProtocol plugins retrieve content as plain bytes + protocol-level metadata (eg HTTP headers)Parse pluginsbull Content is parsed by MIME-type specific parsersbull Content is parsed into parseData (title outlinks other
metadata) and parseText (plain text content)Nutch supports many popular file formats
33
Nut
ch ndash
Apa
cheC
on U
S 0
9
Link inversion
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
34
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow link inversionPages have outgoing links (outlinks)
hellip I know where Im pointing to
Question who points to mehellip I dont know there is no catalog of pageshellip NOBODY knows for sure either
In-degree indicates importance of the pageAnchor text provides important semantic infoPartial answer invert the outlinks that I know about and group by target
src 1
tgt 2
tgt 3
tgt 4tgt 5
tgt 1
tgt 1
src 2
src 3
src 4src 5
src 1
35
Nut
ch ndash
Apa
cheC
on U
S 0
9
Page importance - scoring
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
36
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow page scoringQuery-independent importance factor
bull Affects search rankingbull Affects crawl prioritizationMay include arbitrary decision of the operatorCurrently two systems (plugins + tools)OPIC scoring
bull Doesnt require explicit steps - ldquoonlinerdquobull Difficult to stabilizePageRank scoring with loop detection
bull Periodically run to update CrawlDb scoresbull Computationally intensive esp loop detection
37
Nut
ch ndash
Apa
cheC
on U
S 0
9
Indexing
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
38
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow indexingIndexing plugins
bull Create full-text search index from segment databull Apply adjustments to score per documentbull May further post-process the parsed content (eg
language identification) to facilitate advanced searchIndexers
bull Lucene indexer ndash builds indexes to be served by Nutch search servers
bull Solr indexer ndash submits documents to a Solr instance
39
Nut
ch ndash
Apa
cheC
on U
S 0
9
De-duplication
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
40
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow de-duplicationThe same page may be present in many shards
bull Obsolete versions in older shardsbull Mirrored pages or equivalent (acom rarr wwwacom)Many other pages are almost identical
bull Template-related differences (banners current date)bull Font layout changes minor re-wording
Hmm hellip what is a significant change bull Tricky issue Hint what is the page content
Near-duplicate detection and removalbull Nutch uses approximate page signatures (fingerprints)bull Duplicates are only marked as deleted
41
Nut
ch ndash
Apa
cheC
on U
S 0
9
Cycles may overlap
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
42
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch distributed search
Multiple search serversSearch front-end dispatches queries and collects search results
bull multiple access methods (API OpenSearch XML JSON)
Page servers build snippetsFault-tolerant
with degraded quality
Search front-end(s)Web app
APIXML
NutchBean
JSON
Search server(Nutch or Solr)
indexed shards
43
Nut
ch ndash
Apa
cheC
on U
S 0
9
Search configurationNutch syntax is limited ndash on purpose
bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins
bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20
Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation
44
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards
Searchfront-endSearch
front-end
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
45
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment distributed searchLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
46
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment map-reduce processing amp distrib search
Fully distributed storage and processingLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
DFS Data NodeDFS Data Node
DFS Data Node
DFS Name NodeDFS Name Node(Job Tracker)
FetcherIndexer
DB mgmt
(Job Tracker)FetcherIndexer
DB mgmt
Task TrackerTask Tracker
Task Tracker
47
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch on a Hadoop cluster
Assumes an up amp running Hadoop clusterhellip which is covered elsewhere
Build nutchjob jar and use it as usual
binhadoop jar nutchjob ltclassNamegt ltargsgt
Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar
Searching is not a map-reduce job ndash often on a separate group of machines
48
Nut
ch ndash
Apa
cheC
on U
S 0
9
49
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch index
50
Nut
ch ndash
Apa
cheC
on U
S 0
9
hellip and press Search
51
Nut
ch ndash
Apa
cheC
on U
S 0
9
Conclusions(This overview is a tip of the iceberg)
NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit
52
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of NutchAvoid code duplicationParsing rarr Tika
bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI
bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper
bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase
bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available
53
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of Nutch (2)Whats left then
bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons
Vision aacute la carte toolkit scalable from1-1000s nodes
54
Nut
ch ndash
Apa
cheC
on U
S 0
9
Summary
QampA
Further informationbull httpluceneapacheorgnutch
26
Nut
ch ndash
Apa
cheC
on U
S 0
9
Injecting new URL-s
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
27
Nut
ch ndash
Apa
cheC
on U
S 0
9
Generating fetchlists
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
28
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow generating fetchlistsWhat to fetch next
bull Breadth-first ndash important due to ldquopolitenessrdquo limitsbull Expired (longer than fetchTime + fetchInterval)bull Highly ranking (PageRank)bull Newly addedFetchlist generation
bull ldquotopNrdquo - select best candidatesminus Priority based on many factors pluggable
Adaptive fetch schedulebull Detect rate of changes amp time of changebull Detect unmodified content
minus Hmm and how to recognize this rarr approximate page signatures (near-duplicate detection)
29
Nut
ch ndash
Apa
cheC
on U
S 0
9
Fetching content
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
30
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow fetching
Multi-protocol HTTP(s) FTP file etc Coordinates multiple threads accessing the same hostbull ldquopolitenessrdquo issues vs crawling efficiencybull Host an IP or DNS namebull Redirections should follow immediately What kind
Other netiquette issuesbull robotstxt disallowed paths Crawl-DelayPreparation for parsingbull Content type detection issuesbull Parsing usually executed as a separate step
(resource hungry and sometimes unstable)
31
Nut
ch ndash
Apa
cheC
on U
S 0
9
Content processing
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
32
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow content processingProtocol plugins retrieve content as plain bytes + protocol-level metadata (eg HTTP headers)Parse pluginsbull Content is parsed by MIME-type specific parsersbull Content is parsed into parseData (title outlinks other
metadata) and parseText (plain text content)Nutch supports many popular file formats
33
Nut
ch ndash
Apa
cheC
on U
S 0
9
Link inversion
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
34
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow link inversionPages have outgoing links (outlinks)
hellip I know where Im pointing to
Question who points to mehellip I dont know there is no catalog of pageshellip NOBODY knows for sure either
In-degree indicates importance of the pageAnchor text provides important semantic infoPartial answer invert the outlinks that I know about and group by target
src 1
tgt 2
tgt 3
tgt 4tgt 5
tgt 1
tgt 1
src 2
src 3
src 4src 5
src 1
35
Nut
ch ndash
Apa
cheC
on U
S 0
9
Page importance - scoring
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
36
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow page scoringQuery-independent importance factor
bull Affects search rankingbull Affects crawl prioritizationMay include arbitrary decision of the operatorCurrently two systems (plugins + tools)OPIC scoring
bull Doesnt require explicit steps - ldquoonlinerdquobull Difficult to stabilizePageRank scoring with loop detection
bull Periodically run to update CrawlDb scoresbull Computationally intensive esp loop detection
37
Nut
ch ndash
Apa
cheC
on U
S 0
9
Indexing
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
38
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow indexingIndexing plugins
bull Create full-text search index from segment databull Apply adjustments to score per documentbull May further post-process the parsed content (eg
language identification) to facilitate advanced searchIndexers
bull Lucene indexer ndash builds indexes to be served by Nutch search servers
bull Solr indexer ndash submits documents to a Solr instance
39
Nut
ch ndash
Apa
cheC
on U
S 0
9
De-duplication
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
40
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow de-duplicationThe same page may be present in many shards
bull Obsolete versions in older shardsbull Mirrored pages or equivalent (acom rarr wwwacom)Many other pages are almost identical
bull Template-related differences (banners current date)bull Font layout changes minor re-wording
Hmm hellip what is a significant change bull Tricky issue Hint what is the page content
Near-duplicate detection and removalbull Nutch uses approximate page signatures (fingerprints)bull Duplicates are only marked as deleted
41
Nut
ch ndash
Apa
cheC
on U
S 0
9
Cycles may overlap
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
42
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch distributed search
Multiple search serversSearch front-end dispatches queries and collects search results
bull multiple access methods (API OpenSearch XML JSON)
Page servers build snippetsFault-tolerant
with degraded quality
Search front-end(s)Web app
APIXML
NutchBean
JSON
Search server(Nutch or Solr)
indexed shards
43
Nut
ch ndash
Apa
cheC
on U
S 0
9
Search configurationNutch syntax is limited ndash on purpose
bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins
bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20
Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation
44
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards
Searchfront-endSearch
front-end
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
45
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment distributed searchLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
46
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment map-reduce processing amp distrib search
Fully distributed storage and processingLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
DFS Data NodeDFS Data Node
DFS Data Node
DFS Name NodeDFS Name Node(Job Tracker)
FetcherIndexer
DB mgmt
(Job Tracker)FetcherIndexer
DB mgmt
Task TrackerTask Tracker
Task Tracker
47
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch on a Hadoop cluster
Assumes an up amp running Hadoop clusterhellip which is covered elsewhere
Build nutchjob jar and use it as usual
binhadoop jar nutchjob ltclassNamegt ltargsgt
Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar
Searching is not a map-reduce job ndash often on a separate group of machines
48
Nut
ch ndash
Apa
cheC
on U
S 0
9
49
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch index
50
Nut
ch ndash
Apa
cheC
on U
S 0
9
hellip and press Search
51
Nut
ch ndash
Apa
cheC
on U
S 0
9
Conclusions(This overview is a tip of the iceberg)
NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit
52
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of NutchAvoid code duplicationParsing rarr Tika
bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI
bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper
bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase
bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available
53
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of Nutch (2)Whats left then
bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons
Vision aacute la carte toolkit scalable from1-1000s nodes
54
Nut
ch ndash
Apa
cheC
on U
S 0
9
Summary
QampA
Further informationbull httpluceneapacheorgnutch
27
Nut
ch ndash
Apa
cheC
on U
S 0
9
Generating fetchlists
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
28
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow generating fetchlistsWhat to fetch next
bull Breadth-first ndash important due to ldquopolitenessrdquo limitsbull Expired (longer than fetchTime + fetchInterval)bull Highly ranking (PageRank)bull Newly addedFetchlist generation
bull ldquotopNrdquo - select best candidatesminus Priority based on many factors pluggable
Adaptive fetch schedulebull Detect rate of changes amp time of changebull Detect unmodified content
minus Hmm and how to recognize this rarr approximate page signatures (near-duplicate detection)
29
Nut
ch ndash
Apa
cheC
on U
S 0
9
Fetching content
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
30
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow fetching
Multi-protocol HTTP(s) FTP file etc Coordinates multiple threads accessing the same hostbull ldquopolitenessrdquo issues vs crawling efficiencybull Host an IP or DNS namebull Redirections should follow immediately What kind
Other netiquette issuesbull robotstxt disallowed paths Crawl-DelayPreparation for parsingbull Content type detection issuesbull Parsing usually executed as a separate step
(resource hungry and sometimes unstable)
31
Nut
ch ndash
Apa
cheC
on U
S 0
9
Content processing
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
32
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow content processingProtocol plugins retrieve content as plain bytes + protocol-level metadata (eg HTTP headers)Parse pluginsbull Content is parsed by MIME-type specific parsersbull Content is parsed into parseData (title outlinks other
metadata) and parseText (plain text content)Nutch supports many popular file formats
33
Nut
ch ndash
Apa
cheC
on U
S 0
9
Link inversion
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
34
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow link inversionPages have outgoing links (outlinks)
hellip I know where Im pointing to
Question who points to mehellip I dont know there is no catalog of pageshellip NOBODY knows for sure either
In-degree indicates importance of the pageAnchor text provides important semantic infoPartial answer invert the outlinks that I know about and group by target
src 1
tgt 2
tgt 3
tgt 4tgt 5
tgt 1
tgt 1
src 2
src 3
src 4src 5
src 1
35
Nut
ch ndash
Apa
cheC
on U
S 0
9
Page importance - scoring
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
36
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow page scoringQuery-independent importance factor
bull Affects search rankingbull Affects crawl prioritizationMay include arbitrary decision of the operatorCurrently two systems (plugins + tools)OPIC scoring
bull Doesnt require explicit steps - ldquoonlinerdquobull Difficult to stabilizePageRank scoring with loop detection
bull Periodically run to update CrawlDb scoresbull Computationally intensive esp loop detection
37
Nut
ch ndash
Apa
cheC
on U
S 0
9
Indexing
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
38
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow indexingIndexing plugins
bull Create full-text search index from segment databull Apply adjustments to score per documentbull May further post-process the parsed content (eg
language identification) to facilitate advanced searchIndexers
bull Lucene indexer ndash builds indexes to be served by Nutch search servers
bull Solr indexer ndash submits documents to a Solr instance
39
Nut
ch ndash
Apa
cheC
on U
S 0
9
De-duplication
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
40
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow de-duplicationThe same page may be present in many shards
bull Obsolete versions in older shardsbull Mirrored pages or equivalent (acom rarr wwwacom)Many other pages are almost identical
bull Template-related differences (banners current date)bull Font layout changes minor re-wording
Hmm hellip what is a significant change bull Tricky issue Hint what is the page content
Near-duplicate detection and removalbull Nutch uses approximate page signatures (fingerprints)bull Duplicates are only marked as deleted
41
Nut
ch ndash
Apa
cheC
on U
S 0
9
Cycles may overlap
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
42
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch distributed search
Multiple search serversSearch front-end dispatches queries and collects search results
bull multiple access methods (API OpenSearch XML JSON)
Page servers build snippetsFault-tolerant
with degraded quality
Search front-end(s)Web app
APIXML
NutchBean
JSON
Search server(Nutch or Solr)
indexed shards
43
Nut
ch ndash
Apa
cheC
on U
S 0
9
Search configurationNutch syntax is limited ndash on purpose
bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins
bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20
Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation
44
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards
Searchfront-endSearch
front-end
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
45
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment distributed searchLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
46
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment map-reduce processing amp distrib search
Fully distributed storage and processingLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
DFS Data NodeDFS Data Node
DFS Data Node
DFS Name NodeDFS Name Node(Job Tracker)
FetcherIndexer
DB mgmt
(Job Tracker)FetcherIndexer
DB mgmt
Task TrackerTask Tracker
Task Tracker
47
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch on a Hadoop cluster
Assumes an up amp running Hadoop clusterhellip which is covered elsewhere
Build nutchjob jar and use it as usual
binhadoop jar nutchjob ltclassNamegt ltargsgt
Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar
Searching is not a map-reduce job ndash often on a separate group of machines
48
Nut
ch ndash
Apa
cheC
on U
S 0
9
49
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch index
50
Nut
ch ndash
Apa
cheC
on U
S 0
9
hellip and press Search
51
Nut
ch ndash
Apa
cheC
on U
S 0
9
Conclusions(This overview is a tip of the iceberg)
NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit
52
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of NutchAvoid code duplicationParsing rarr Tika
bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI
bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper
bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase
bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available
53
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of Nutch (2)Whats left then
bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons
Vision aacute la carte toolkit scalable from1-1000s nodes
54
Nut
ch ndash
Apa
cheC
on U
S 0
9
Summary
QampA
Further informationbull httpluceneapacheorgnutch
28
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow generating fetchlistsWhat to fetch next
bull Breadth-first ndash important due to ldquopolitenessrdquo limitsbull Expired (longer than fetchTime + fetchInterval)bull Highly ranking (PageRank)bull Newly addedFetchlist generation
bull ldquotopNrdquo - select best candidatesminus Priority based on many factors pluggable
Adaptive fetch schedulebull Detect rate of changes amp time of changebull Detect unmodified content
minus Hmm and how to recognize this rarr approximate page signatures (near-duplicate detection)
29
Nut
ch ndash
Apa
cheC
on U
S 0
9
Fetching content
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
30
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow fetching
Multi-protocol HTTP(s) FTP file etc Coordinates multiple threads accessing the same hostbull ldquopolitenessrdquo issues vs crawling efficiencybull Host an IP or DNS namebull Redirections should follow immediately What kind
Other netiquette issuesbull robotstxt disallowed paths Crawl-DelayPreparation for parsingbull Content type detection issuesbull Parsing usually executed as a separate step
(resource hungry and sometimes unstable)
31
Nut
ch ndash
Apa
cheC
on U
S 0
9
Content processing
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
32
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow content processingProtocol plugins retrieve content as plain bytes + protocol-level metadata (eg HTTP headers)Parse pluginsbull Content is parsed by MIME-type specific parsersbull Content is parsed into parseData (title outlinks other
metadata) and parseText (plain text content)Nutch supports many popular file formats
33
Nut
ch ndash
Apa
cheC
on U
S 0
9
Link inversion
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
34
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow link inversionPages have outgoing links (outlinks)
hellip I know where Im pointing to
Question who points to mehellip I dont know there is no catalog of pageshellip NOBODY knows for sure either
In-degree indicates importance of the pageAnchor text provides important semantic infoPartial answer invert the outlinks that I know about and group by target
src 1
tgt 2
tgt 3
tgt 4tgt 5
tgt 1
tgt 1
src 2
src 3
src 4src 5
src 1
35
Nut
ch ndash
Apa
cheC
on U
S 0
9
Page importance - scoring
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
36
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow page scoringQuery-independent importance factor
bull Affects search rankingbull Affects crawl prioritizationMay include arbitrary decision of the operatorCurrently two systems (plugins + tools)OPIC scoring
bull Doesnt require explicit steps - ldquoonlinerdquobull Difficult to stabilizePageRank scoring with loop detection
bull Periodically run to update CrawlDb scoresbull Computationally intensive esp loop detection
37
Nut
ch ndash
Apa
cheC
on U
S 0
9
Indexing
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
38
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow indexingIndexing plugins
bull Create full-text search index from segment databull Apply adjustments to score per documentbull May further post-process the parsed content (eg
language identification) to facilitate advanced searchIndexers
bull Lucene indexer ndash builds indexes to be served by Nutch search servers
bull Solr indexer ndash submits documents to a Solr instance
39
Nut
ch ndash
Apa
cheC
on U
S 0
9
De-duplication
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
40
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow de-duplicationThe same page may be present in many shards
bull Obsolete versions in older shardsbull Mirrored pages or equivalent (acom rarr wwwacom)Many other pages are almost identical
bull Template-related differences (banners current date)bull Font layout changes minor re-wording
Hmm hellip what is a significant change bull Tricky issue Hint what is the page content
Near-duplicate detection and removalbull Nutch uses approximate page signatures (fingerprints)bull Duplicates are only marked as deleted
41
Nut
ch ndash
Apa
cheC
on U
S 0
9
Cycles may overlap
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
42
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch distributed search
Multiple search serversSearch front-end dispatches queries and collects search results
bull multiple access methods (API OpenSearch XML JSON)
Page servers build snippetsFault-tolerant
with degraded quality
Search front-end(s)Web app
APIXML
NutchBean
JSON
Search server(Nutch or Solr)
indexed shards
43
Nut
ch ndash
Apa
cheC
on U
S 0
9
Search configurationNutch syntax is limited ndash on purpose
bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins
bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20
Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation
44
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards
Searchfront-endSearch
front-end
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
45
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment distributed searchLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
46
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment map-reduce processing amp distrib search
Fully distributed storage and processingLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
DFS Data NodeDFS Data Node
DFS Data Node
DFS Name NodeDFS Name Node(Job Tracker)
FetcherIndexer
DB mgmt
(Job Tracker)FetcherIndexer
DB mgmt
Task TrackerTask Tracker
Task Tracker
47
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch on a Hadoop cluster
Assumes an up amp running Hadoop clusterhellip which is covered elsewhere
Build nutchjob jar and use it as usual
binhadoop jar nutchjob ltclassNamegt ltargsgt
Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar
Searching is not a map-reduce job ndash often on a separate group of machines
48
Nut
ch ndash
Apa
cheC
on U
S 0
9
49
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch index
50
Nut
ch ndash
Apa
cheC
on U
S 0
9
hellip and press Search
51
Nut
ch ndash
Apa
cheC
on U
S 0
9
Conclusions(This overview is a tip of the iceberg)
NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit
52
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of NutchAvoid code duplicationParsing rarr Tika
bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI
bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper
bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase
bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available
53
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of Nutch (2)Whats left then
bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons
Vision aacute la carte toolkit scalable from1-1000s nodes
54
Nut
ch ndash
Apa
cheC
on U
S 0
9
Summary
QampA
Further informationbull httpluceneapacheorgnutch
29
Nut
ch ndash
Apa
cheC
on U
S 0
9
Fetching content
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
30
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow fetching
Multi-protocol HTTP(s) FTP file etc Coordinates multiple threads accessing the same hostbull ldquopolitenessrdquo issues vs crawling efficiencybull Host an IP or DNS namebull Redirections should follow immediately What kind
Other netiquette issuesbull robotstxt disallowed paths Crawl-DelayPreparation for parsingbull Content type detection issuesbull Parsing usually executed as a separate step
(resource hungry and sometimes unstable)
31
Nut
ch ndash
Apa
cheC
on U
S 0
9
Content processing
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
32
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow content processingProtocol plugins retrieve content as plain bytes + protocol-level metadata (eg HTTP headers)Parse pluginsbull Content is parsed by MIME-type specific parsersbull Content is parsed into parseData (title outlinks other
metadata) and parseText (plain text content)Nutch supports many popular file formats
33
Nut
ch ndash
Apa
cheC
on U
S 0
9
Link inversion
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
34
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow link inversionPages have outgoing links (outlinks)
hellip I know where Im pointing to
Question who points to mehellip I dont know there is no catalog of pageshellip NOBODY knows for sure either
In-degree indicates importance of the pageAnchor text provides important semantic infoPartial answer invert the outlinks that I know about and group by target
src 1
tgt 2
tgt 3
tgt 4tgt 5
tgt 1
tgt 1
src 2
src 3
src 4src 5
src 1
35
Nut
ch ndash
Apa
cheC
on U
S 0
9
Page importance - scoring
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
36
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow page scoringQuery-independent importance factor
bull Affects search rankingbull Affects crawl prioritizationMay include arbitrary decision of the operatorCurrently two systems (plugins + tools)OPIC scoring
bull Doesnt require explicit steps - ldquoonlinerdquobull Difficult to stabilizePageRank scoring with loop detection
bull Periodically run to update CrawlDb scoresbull Computationally intensive esp loop detection
37
Nut
ch ndash
Apa
cheC
on U
S 0
9
Indexing
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
38
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow indexingIndexing plugins
bull Create full-text search index from segment databull Apply adjustments to score per documentbull May further post-process the parsed content (eg
language identification) to facilitate advanced searchIndexers
bull Lucene indexer ndash builds indexes to be served by Nutch search servers
bull Solr indexer ndash submits documents to a Solr instance
39
Nut
ch ndash
Apa
cheC
on U
S 0
9
De-duplication
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
40
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow de-duplicationThe same page may be present in many shards
bull Obsolete versions in older shardsbull Mirrored pages or equivalent (acom rarr wwwacom)Many other pages are almost identical
bull Template-related differences (banners current date)bull Font layout changes minor re-wording
Hmm hellip what is a significant change bull Tricky issue Hint what is the page content
Near-duplicate detection and removalbull Nutch uses approximate page signatures (fingerprints)bull Duplicates are only marked as deleted
41
Nut
ch ndash
Apa
cheC
on U
S 0
9
Cycles may overlap
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
42
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch distributed search
Multiple search serversSearch front-end dispatches queries and collects search results
bull multiple access methods (API OpenSearch XML JSON)
Page servers build snippetsFault-tolerant
with degraded quality
Search front-end(s)Web app
APIXML
NutchBean
JSON
Search server(Nutch or Solr)
indexed shards
43
Nut
ch ndash
Apa
cheC
on U
S 0
9
Search configurationNutch syntax is limited ndash on purpose
bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins
bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20
Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation
44
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards
Searchfront-endSearch
front-end
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
45
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment distributed searchLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
46
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment map-reduce processing amp distrib search
Fully distributed storage and processingLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
DFS Data NodeDFS Data Node
DFS Data Node
DFS Name NodeDFS Name Node(Job Tracker)
FetcherIndexer
DB mgmt
(Job Tracker)FetcherIndexer
DB mgmt
Task TrackerTask Tracker
Task Tracker
47
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch on a Hadoop cluster
Assumes an up amp running Hadoop clusterhellip which is covered elsewhere
Build nutchjob jar and use it as usual
binhadoop jar nutchjob ltclassNamegt ltargsgt
Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar
Searching is not a map-reduce job ndash often on a separate group of machines
48
Nut
ch ndash
Apa
cheC
on U
S 0
9
49
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch index
50
Nut
ch ndash
Apa
cheC
on U
S 0
9
hellip and press Search
51
Nut
ch ndash
Apa
cheC
on U
S 0
9
Conclusions(This overview is a tip of the iceberg)
NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit
52
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of NutchAvoid code duplicationParsing rarr Tika
bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI
bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper
bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase
bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available
53
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of Nutch (2)Whats left then
bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons
Vision aacute la carte toolkit scalable from1-1000s nodes
54
Nut
ch ndash
Apa
cheC
on U
S 0
9
Summary
QampA
Further informationbull httpluceneapacheorgnutch
30
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow fetching
Multi-protocol HTTP(s) FTP file etc Coordinates multiple threads accessing the same hostbull ldquopolitenessrdquo issues vs crawling efficiencybull Host an IP or DNS namebull Redirections should follow immediately What kind
Other netiquette issuesbull robotstxt disallowed paths Crawl-DelayPreparation for parsingbull Content type detection issuesbull Parsing usually executed as a separate step
(resource hungry and sometimes unstable)
31
Nut
ch ndash
Apa
cheC
on U
S 0
9
Content processing
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
32
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow content processingProtocol plugins retrieve content as plain bytes + protocol-level metadata (eg HTTP headers)Parse pluginsbull Content is parsed by MIME-type specific parsersbull Content is parsed into parseData (title outlinks other
metadata) and parseText (plain text content)Nutch supports many popular file formats
33
Nut
ch ndash
Apa
cheC
on U
S 0
9
Link inversion
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
34
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow link inversionPages have outgoing links (outlinks)
hellip I know where Im pointing to
Question who points to mehellip I dont know there is no catalog of pageshellip NOBODY knows for sure either
In-degree indicates importance of the pageAnchor text provides important semantic infoPartial answer invert the outlinks that I know about and group by target
src 1
tgt 2
tgt 3
tgt 4tgt 5
tgt 1
tgt 1
src 2
src 3
src 4src 5
src 1
35
Nut
ch ndash
Apa
cheC
on U
S 0
9
Page importance - scoring
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
36
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow page scoringQuery-independent importance factor
bull Affects search rankingbull Affects crawl prioritizationMay include arbitrary decision of the operatorCurrently two systems (plugins + tools)OPIC scoring
bull Doesnt require explicit steps - ldquoonlinerdquobull Difficult to stabilizePageRank scoring with loop detection
bull Periodically run to update CrawlDb scoresbull Computationally intensive esp loop detection
37
Nut
ch ndash
Apa
cheC
on U
S 0
9
Indexing
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
38
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow indexingIndexing plugins
bull Create full-text search index from segment databull Apply adjustments to score per documentbull May further post-process the parsed content (eg
language identification) to facilitate advanced searchIndexers
bull Lucene indexer ndash builds indexes to be served by Nutch search servers
bull Solr indexer ndash submits documents to a Solr instance
39
Nut
ch ndash
Apa
cheC
on U
S 0
9
De-duplication
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
40
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow de-duplicationThe same page may be present in many shards
bull Obsolete versions in older shardsbull Mirrored pages or equivalent (acom rarr wwwacom)Many other pages are almost identical
bull Template-related differences (banners current date)bull Font layout changes minor re-wording
Hmm hellip what is a significant change bull Tricky issue Hint what is the page content
Near-duplicate detection and removalbull Nutch uses approximate page signatures (fingerprints)bull Duplicates are only marked as deleted
41
Nut
ch ndash
Apa
cheC
on U
S 0
9
Cycles may overlap
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
42
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch distributed search
Multiple search serversSearch front-end dispatches queries and collects search results
bull multiple access methods (API OpenSearch XML JSON)
Page servers build snippetsFault-tolerant
with degraded quality
Search front-end(s)Web app
APIXML
NutchBean
JSON
Search server(Nutch or Solr)
indexed shards
43
Nut
ch ndash
Apa
cheC
on U
S 0
9
Search configurationNutch syntax is limited ndash on purpose
bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins
bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20
Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation
44
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards
Searchfront-endSearch
front-end
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
45
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment distributed searchLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
46
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment map-reduce processing amp distrib search
Fully distributed storage and processingLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
DFS Data NodeDFS Data Node
DFS Data Node
DFS Name NodeDFS Name Node(Job Tracker)
FetcherIndexer
DB mgmt
(Job Tracker)FetcherIndexer
DB mgmt
Task TrackerTask Tracker
Task Tracker
47
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch on a Hadoop cluster
Assumes an up amp running Hadoop clusterhellip which is covered elsewhere
Build nutchjob jar and use it as usual
binhadoop jar nutchjob ltclassNamegt ltargsgt
Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar
Searching is not a map-reduce job ndash often on a separate group of machines
48
Nut
ch ndash
Apa
cheC
on U
S 0
9
49
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch index
50
Nut
ch ndash
Apa
cheC
on U
S 0
9
hellip and press Search
51
Nut
ch ndash
Apa
cheC
on U
S 0
9
Conclusions(This overview is a tip of the iceberg)
NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit
52
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of NutchAvoid code duplicationParsing rarr Tika
bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI
bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper
bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase
bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available
53
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of Nutch (2)Whats left then
bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons
Vision aacute la carte toolkit scalable from1-1000s nodes
54
Nut
ch ndash
Apa
cheC
on U
S 0
9
Summary
QampA
Further informationbull httpluceneapacheorgnutch
31
Nut
ch ndash
Apa
cheC
on U
S 0
9
Content processing
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
32
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow content processingProtocol plugins retrieve content as plain bytes + protocol-level metadata (eg HTTP headers)Parse pluginsbull Content is parsed by MIME-type specific parsersbull Content is parsed into parseData (title outlinks other
metadata) and parseText (plain text content)Nutch supports many popular file formats
33
Nut
ch ndash
Apa
cheC
on U
S 0
9
Link inversion
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
34
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow link inversionPages have outgoing links (outlinks)
hellip I know where Im pointing to
Question who points to mehellip I dont know there is no catalog of pageshellip NOBODY knows for sure either
In-degree indicates importance of the pageAnchor text provides important semantic infoPartial answer invert the outlinks that I know about and group by target
src 1
tgt 2
tgt 3
tgt 4tgt 5
tgt 1
tgt 1
src 2
src 3
src 4src 5
src 1
35
Nut
ch ndash
Apa
cheC
on U
S 0
9
Page importance - scoring
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
36
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow page scoringQuery-independent importance factor
bull Affects search rankingbull Affects crawl prioritizationMay include arbitrary decision of the operatorCurrently two systems (plugins + tools)OPIC scoring
bull Doesnt require explicit steps - ldquoonlinerdquobull Difficult to stabilizePageRank scoring with loop detection
bull Periodically run to update CrawlDb scoresbull Computationally intensive esp loop detection
37
Nut
ch ndash
Apa
cheC
on U
S 0
9
Indexing
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
38
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow indexingIndexing plugins
bull Create full-text search index from segment databull Apply adjustments to score per documentbull May further post-process the parsed content (eg
language identification) to facilitate advanced searchIndexers
bull Lucene indexer ndash builds indexes to be served by Nutch search servers
bull Solr indexer ndash submits documents to a Solr instance
39
Nut
ch ndash
Apa
cheC
on U
S 0
9
De-duplication
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
40
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow de-duplicationThe same page may be present in many shards
bull Obsolete versions in older shardsbull Mirrored pages or equivalent (acom rarr wwwacom)Many other pages are almost identical
bull Template-related differences (banners current date)bull Font layout changes minor re-wording
Hmm hellip what is a significant change bull Tricky issue Hint what is the page content
Near-duplicate detection and removalbull Nutch uses approximate page signatures (fingerprints)bull Duplicates are only marked as deleted
41
Nut
ch ndash
Apa
cheC
on U
S 0
9
Cycles may overlap
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
42
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch distributed search
Multiple search serversSearch front-end dispatches queries and collects search results
bull multiple access methods (API OpenSearch XML JSON)
Page servers build snippetsFault-tolerant
with degraded quality
Search front-end(s)Web app
APIXML
NutchBean
JSON
Search server(Nutch or Solr)
indexed shards
43
Nut
ch ndash
Apa
cheC
on U
S 0
9
Search configurationNutch syntax is limited ndash on purpose
bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins
bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20
Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation
44
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards
Searchfront-endSearch
front-end
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
45
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment distributed searchLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
46
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment map-reduce processing amp distrib search
Fully distributed storage and processingLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
DFS Data NodeDFS Data Node
DFS Data Node
DFS Name NodeDFS Name Node(Job Tracker)
FetcherIndexer
DB mgmt
(Job Tracker)FetcherIndexer
DB mgmt
Task TrackerTask Tracker
Task Tracker
47
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch on a Hadoop cluster
Assumes an up amp running Hadoop clusterhellip which is covered elsewhere
Build nutchjob jar and use it as usual
binhadoop jar nutchjob ltclassNamegt ltargsgt
Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar
Searching is not a map-reduce job ndash often on a separate group of machines
48
Nut
ch ndash
Apa
cheC
on U
S 0
9
49
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch index
50
Nut
ch ndash
Apa
cheC
on U
S 0
9
hellip and press Search
51
Nut
ch ndash
Apa
cheC
on U
S 0
9
Conclusions(This overview is a tip of the iceberg)
NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit
52
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of NutchAvoid code duplicationParsing rarr Tika
bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI
bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper
bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase
bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available
53
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of Nutch (2)Whats left then
bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons
Vision aacute la carte toolkit scalable from1-1000s nodes
54
Nut
ch ndash
Apa
cheC
on U
S 0
9
Summary
QampA
Further informationbull httpluceneapacheorgnutch
32
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow content processingProtocol plugins retrieve content as plain bytes + protocol-level metadata (eg HTTP headers)Parse pluginsbull Content is parsed by MIME-type specific parsersbull Content is parsed into parseData (title outlinks other
metadata) and parseText (plain text content)Nutch supports many popular file formats
33
Nut
ch ndash
Apa
cheC
on U
S 0
9
Link inversion
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
34
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow link inversionPages have outgoing links (outlinks)
hellip I know where Im pointing to
Question who points to mehellip I dont know there is no catalog of pageshellip NOBODY knows for sure either
In-degree indicates importance of the pageAnchor text provides important semantic infoPartial answer invert the outlinks that I know about and group by target
src 1
tgt 2
tgt 3
tgt 4tgt 5
tgt 1
tgt 1
src 2
src 3
src 4src 5
src 1
35
Nut
ch ndash
Apa
cheC
on U
S 0
9
Page importance - scoring
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
36
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow page scoringQuery-independent importance factor
bull Affects search rankingbull Affects crawl prioritizationMay include arbitrary decision of the operatorCurrently two systems (plugins + tools)OPIC scoring
bull Doesnt require explicit steps - ldquoonlinerdquobull Difficult to stabilizePageRank scoring with loop detection
bull Periodically run to update CrawlDb scoresbull Computationally intensive esp loop detection
37
Nut
ch ndash
Apa
cheC
on U
S 0
9
Indexing
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
38
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow indexingIndexing plugins
bull Create full-text search index from segment databull Apply adjustments to score per documentbull May further post-process the parsed content (eg
language identification) to facilitate advanced searchIndexers
bull Lucene indexer ndash builds indexes to be served by Nutch search servers
bull Solr indexer ndash submits documents to a Solr instance
39
Nut
ch ndash
Apa
cheC
on U
S 0
9
De-duplication
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
40
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow de-duplicationThe same page may be present in many shards
bull Obsolete versions in older shardsbull Mirrored pages or equivalent (acom rarr wwwacom)Many other pages are almost identical
bull Template-related differences (banners current date)bull Font layout changes minor re-wording
Hmm hellip what is a significant change bull Tricky issue Hint what is the page content
Near-duplicate detection and removalbull Nutch uses approximate page signatures (fingerprints)bull Duplicates are only marked as deleted
41
Nut
ch ndash
Apa
cheC
on U
S 0
9
Cycles may overlap
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
42
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch distributed search
Multiple search serversSearch front-end dispatches queries and collects search results
bull multiple access methods (API OpenSearch XML JSON)
Page servers build snippetsFault-tolerant
with degraded quality
Search front-end(s)Web app
APIXML
NutchBean
JSON
Search server(Nutch or Solr)
indexed shards
43
Nut
ch ndash
Apa
cheC
on U
S 0
9
Search configurationNutch syntax is limited ndash on purpose
bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins
bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20
Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation
44
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards
Searchfront-endSearch
front-end
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
45
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment distributed searchLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
46
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment map-reduce processing amp distrib search
Fully distributed storage and processingLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
DFS Data NodeDFS Data Node
DFS Data Node
DFS Name NodeDFS Name Node(Job Tracker)
FetcherIndexer
DB mgmt
(Job Tracker)FetcherIndexer
DB mgmt
Task TrackerTask Tracker
Task Tracker
47
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch on a Hadoop cluster
Assumes an up amp running Hadoop clusterhellip which is covered elsewhere
Build nutchjob jar and use it as usual
binhadoop jar nutchjob ltclassNamegt ltargsgt
Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar
Searching is not a map-reduce job ndash often on a separate group of machines
48
Nut
ch ndash
Apa
cheC
on U
S 0
9
49
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch index
50
Nut
ch ndash
Apa
cheC
on U
S 0
9
hellip and press Search
51
Nut
ch ndash
Apa
cheC
on U
S 0
9
Conclusions(This overview is a tip of the iceberg)
NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit
52
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of NutchAvoid code duplicationParsing rarr Tika
bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI
bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper
bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase
bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available
53
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of Nutch (2)Whats left then
bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons
Vision aacute la carte toolkit scalable from1-1000s nodes
54
Nut
ch ndash
Apa
cheC
on U
S 0
9
Summary
QampA
Further informationbull httpluceneapacheorgnutch
33
Nut
ch ndash
Apa
cheC
on U
S 0
9
Link inversion
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
34
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow link inversionPages have outgoing links (outlinks)
hellip I know where Im pointing to
Question who points to mehellip I dont know there is no catalog of pageshellip NOBODY knows for sure either
In-degree indicates importance of the pageAnchor text provides important semantic infoPartial answer invert the outlinks that I know about and group by target
src 1
tgt 2
tgt 3
tgt 4tgt 5
tgt 1
tgt 1
src 2
src 3
src 4src 5
src 1
35
Nut
ch ndash
Apa
cheC
on U
S 0
9
Page importance - scoring
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
36
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow page scoringQuery-independent importance factor
bull Affects search rankingbull Affects crawl prioritizationMay include arbitrary decision of the operatorCurrently two systems (plugins + tools)OPIC scoring
bull Doesnt require explicit steps - ldquoonlinerdquobull Difficult to stabilizePageRank scoring with loop detection
bull Periodically run to update CrawlDb scoresbull Computationally intensive esp loop detection
37
Nut
ch ndash
Apa
cheC
on U
S 0
9
Indexing
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
38
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow indexingIndexing plugins
bull Create full-text search index from segment databull Apply adjustments to score per documentbull May further post-process the parsed content (eg
language identification) to facilitate advanced searchIndexers
bull Lucene indexer ndash builds indexes to be served by Nutch search servers
bull Solr indexer ndash submits documents to a Solr instance
39
Nut
ch ndash
Apa
cheC
on U
S 0
9
De-duplication
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
40
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow de-duplicationThe same page may be present in many shards
bull Obsolete versions in older shardsbull Mirrored pages or equivalent (acom rarr wwwacom)Many other pages are almost identical
bull Template-related differences (banners current date)bull Font layout changes minor re-wording
Hmm hellip what is a significant change bull Tricky issue Hint what is the page content
Near-duplicate detection and removalbull Nutch uses approximate page signatures (fingerprints)bull Duplicates are only marked as deleted
41
Nut
ch ndash
Apa
cheC
on U
S 0
9
Cycles may overlap
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
42
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch distributed search
Multiple search serversSearch front-end dispatches queries and collects search results
bull multiple access methods (API OpenSearch XML JSON)
Page servers build snippetsFault-tolerant
with degraded quality
Search front-end(s)Web app
APIXML
NutchBean
JSON
Search server(Nutch or Solr)
indexed shards
43
Nut
ch ndash
Apa
cheC
on U
S 0
9
Search configurationNutch syntax is limited ndash on purpose
bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins
bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20
Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation
44
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards
Searchfront-endSearch
front-end
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
45
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment distributed searchLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
46
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment map-reduce processing amp distrib search
Fully distributed storage and processingLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
DFS Data NodeDFS Data Node
DFS Data Node
DFS Name NodeDFS Name Node(Job Tracker)
FetcherIndexer
DB mgmt
(Job Tracker)FetcherIndexer
DB mgmt
Task TrackerTask Tracker
Task Tracker
47
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch on a Hadoop cluster
Assumes an up amp running Hadoop clusterhellip which is covered elsewhere
Build nutchjob jar and use it as usual
binhadoop jar nutchjob ltclassNamegt ltargsgt
Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar
Searching is not a map-reduce job ndash often on a separate group of machines
48
Nut
ch ndash
Apa
cheC
on U
S 0
9
49
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch index
50
Nut
ch ndash
Apa
cheC
on U
S 0
9
hellip and press Search
51
Nut
ch ndash
Apa
cheC
on U
S 0
9
Conclusions(This overview is a tip of the iceberg)
NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit
52
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of NutchAvoid code duplicationParsing rarr Tika
bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI
bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper
bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase
bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available
53
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of Nutch (2)Whats left then
bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons
Vision aacute la carte toolkit scalable from1-1000s nodes
54
Nut
ch ndash
Apa
cheC
on U
S 0
9
Summary
QampA
Further informationbull httpluceneapacheorgnutch
34
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow link inversionPages have outgoing links (outlinks)
hellip I know where Im pointing to
Question who points to mehellip I dont know there is no catalog of pageshellip NOBODY knows for sure either
In-degree indicates importance of the pageAnchor text provides important semantic infoPartial answer invert the outlinks that I know about and group by target
src 1
tgt 2
tgt 3
tgt 4tgt 5
tgt 1
tgt 1
src 2
src 3
src 4src 5
src 1
35
Nut
ch ndash
Apa
cheC
on U
S 0
9
Page importance - scoring
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
36
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow page scoringQuery-independent importance factor
bull Affects search rankingbull Affects crawl prioritizationMay include arbitrary decision of the operatorCurrently two systems (plugins + tools)OPIC scoring
bull Doesnt require explicit steps - ldquoonlinerdquobull Difficult to stabilizePageRank scoring with loop detection
bull Periodically run to update CrawlDb scoresbull Computationally intensive esp loop detection
37
Nut
ch ndash
Apa
cheC
on U
S 0
9
Indexing
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
38
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow indexingIndexing plugins
bull Create full-text search index from segment databull Apply adjustments to score per documentbull May further post-process the parsed content (eg
language identification) to facilitate advanced searchIndexers
bull Lucene indexer ndash builds indexes to be served by Nutch search servers
bull Solr indexer ndash submits documents to a Solr instance
39
Nut
ch ndash
Apa
cheC
on U
S 0
9
De-duplication
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
40
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow de-duplicationThe same page may be present in many shards
bull Obsolete versions in older shardsbull Mirrored pages or equivalent (acom rarr wwwacom)Many other pages are almost identical
bull Template-related differences (banners current date)bull Font layout changes minor re-wording
Hmm hellip what is a significant change bull Tricky issue Hint what is the page content
Near-duplicate detection and removalbull Nutch uses approximate page signatures (fingerprints)bull Duplicates are only marked as deleted
41
Nut
ch ndash
Apa
cheC
on U
S 0
9
Cycles may overlap
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
42
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch distributed search
Multiple search serversSearch front-end dispatches queries and collects search results
bull multiple access methods (API OpenSearch XML JSON)
Page servers build snippetsFault-tolerant
with degraded quality
Search front-end(s)Web app
APIXML
NutchBean
JSON
Search server(Nutch or Solr)
indexed shards
43
Nut
ch ndash
Apa
cheC
on U
S 0
9
Search configurationNutch syntax is limited ndash on purpose
bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins
bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20
Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation
44
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards
Searchfront-endSearch
front-end
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
45
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment distributed searchLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
46
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment map-reduce processing amp distrib search
Fully distributed storage and processingLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
DFS Data NodeDFS Data Node
DFS Data Node
DFS Name NodeDFS Name Node(Job Tracker)
FetcherIndexer
DB mgmt
(Job Tracker)FetcherIndexer
DB mgmt
Task TrackerTask Tracker
Task Tracker
47
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch on a Hadoop cluster
Assumes an up amp running Hadoop clusterhellip which is covered elsewhere
Build nutchjob jar and use it as usual
binhadoop jar nutchjob ltclassNamegt ltargsgt
Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar
Searching is not a map-reduce job ndash often on a separate group of machines
48
Nut
ch ndash
Apa
cheC
on U
S 0
9
49
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch index
50
Nut
ch ndash
Apa
cheC
on U
S 0
9
hellip and press Search
51
Nut
ch ndash
Apa
cheC
on U
S 0
9
Conclusions(This overview is a tip of the iceberg)
NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit
52
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of NutchAvoid code duplicationParsing rarr Tika
bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI
bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper
bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase
bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available
53
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of Nutch (2)Whats left then
bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons
Vision aacute la carte toolkit scalable from1-1000s nodes
54
Nut
ch ndash
Apa
cheC
on U
S 0
9
Summary
QampA
Further informationbull httpluceneapacheorgnutch
35
Nut
ch ndash
Apa
cheC
on U
S 0
9
Page importance - scoring
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
36
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow page scoringQuery-independent importance factor
bull Affects search rankingbull Affects crawl prioritizationMay include arbitrary decision of the operatorCurrently two systems (plugins + tools)OPIC scoring
bull Doesnt require explicit steps - ldquoonlinerdquobull Difficult to stabilizePageRank scoring with loop detection
bull Periodically run to update CrawlDb scoresbull Computationally intensive esp loop detection
37
Nut
ch ndash
Apa
cheC
on U
S 0
9
Indexing
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
38
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow indexingIndexing plugins
bull Create full-text search index from segment databull Apply adjustments to score per documentbull May further post-process the parsed content (eg
language identification) to facilitate advanced searchIndexers
bull Lucene indexer ndash builds indexes to be served by Nutch search servers
bull Solr indexer ndash submits documents to a Solr instance
39
Nut
ch ndash
Apa
cheC
on U
S 0
9
De-duplication
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
40
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow de-duplicationThe same page may be present in many shards
bull Obsolete versions in older shardsbull Mirrored pages or equivalent (acom rarr wwwacom)Many other pages are almost identical
bull Template-related differences (banners current date)bull Font layout changes minor re-wording
Hmm hellip what is a significant change bull Tricky issue Hint what is the page content
Near-duplicate detection and removalbull Nutch uses approximate page signatures (fingerprints)bull Duplicates are only marked as deleted
41
Nut
ch ndash
Apa
cheC
on U
S 0
9
Cycles may overlap
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
42
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch distributed search
Multiple search serversSearch front-end dispatches queries and collects search results
bull multiple access methods (API OpenSearch XML JSON)
Page servers build snippetsFault-tolerant
with degraded quality
Search front-end(s)Web app
APIXML
NutchBean
JSON
Search server(Nutch or Solr)
indexed shards
43
Nut
ch ndash
Apa
cheC
on U
S 0
9
Search configurationNutch syntax is limited ndash on purpose
bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins
bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20
Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation
44
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards
Searchfront-endSearch
front-end
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
45
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment distributed searchLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
46
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment map-reduce processing amp distrib search
Fully distributed storage and processingLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
DFS Data NodeDFS Data Node
DFS Data Node
DFS Name NodeDFS Name Node(Job Tracker)
FetcherIndexer
DB mgmt
(Job Tracker)FetcherIndexer
DB mgmt
Task TrackerTask Tracker
Task Tracker
47
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch on a Hadoop cluster
Assumes an up amp running Hadoop clusterhellip which is covered elsewhere
Build nutchjob jar and use it as usual
binhadoop jar nutchjob ltclassNamegt ltargsgt
Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar
Searching is not a map-reduce job ndash often on a separate group of machines
48
Nut
ch ndash
Apa
cheC
on U
S 0
9
49
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch index
50
Nut
ch ndash
Apa
cheC
on U
S 0
9
hellip and press Search
51
Nut
ch ndash
Apa
cheC
on U
S 0
9
Conclusions(This overview is a tip of the iceberg)
NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit
52
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of NutchAvoid code duplicationParsing rarr Tika
bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI
bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper
bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase
bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available
53
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of Nutch (2)Whats left then
bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons
Vision aacute la carte toolkit scalable from1-1000s nodes
54
Nut
ch ndash
Apa
cheC
on U
S 0
9
Summary
QampA
Further informationbull httpluceneapacheorgnutch
36
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow page scoringQuery-independent importance factor
bull Affects search rankingbull Affects crawl prioritizationMay include arbitrary decision of the operatorCurrently two systems (plugins + tools)OPIC scoring
bull Doesnt require explicit steps - ldquoonlinerdquobull Difficult to stabilizePageRank scoring with loop detection
bull Periodically run to update CrawlDb scoresbull Computationally intensive esp loop detection
37
Nut
ch ndash
Apa
cheC
on U
S 0
9
Indexing
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
38
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow indexingIndexing plugins
bull Create full-text search index from segment databull Apply adjustments to score per documentbull May further post-process the parsed content (eg
language identification) to facilitate advanced searchIndexers
bull Lucene indexer ndash builds indexes to be served by Nutch search servers
bull Solr indexer ndash submits documents to a Solr instance
39
Nut
ch ndash
Apa
cheC
on U
S 0
9
De-duplication
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
40
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow de-duplicationThe same page may be present in many shards
bull Obsolete versions in older shardsbull Mirrored pages or equivalent (acom rarr wwwacom)Many other pages are almost identical
bull Template-related differences (banners current date)bull Font layout changes minor re-wording
Hmm hellip what is a significant change bull Tricky issue Hint what is the page content
Near-duplicate detection and removalbull Nutch uses approximate page signatures (fingerprints)bull Duplicates are only marked as deleted
41
Nut
ch ndash
Apa
cheC
on U
S 0
9
Cycles may overlap
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
42
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch distributed search
Multiple search serversSearch front-end dispatches queries and collects search results
bull multiple access methods (API OpenSearch XML JSON)
Page servers build snippetsFault-tolerant
with degraded quality
Search front-end(s)Web app
APIXML
NutchBean
JSON
Search server(Nutch or Solr)
indexed shards
43
Nut
ch ndash
Apa
cheC
on U
S 0
9
Search configurationNutch syntax is limited ndash on purpose
bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins
bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20
Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation
44
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards
Searchfront-endSearch
front-end
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
45
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment distributed searchLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
46
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment map-reduce processing amp distrib search
Fully distributed storage and processingLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
DFS Data NodeDFS Data Node
DFS Data Node
DFS Name NodeDFS Name Node(Job Tracker)
FetcherIndexer
DB mgmt
(Job Tracker)FetcherIndexer
DB mgmt
Task TrackerTask Tracker
Task Tracker
47
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch on a Hadoop cluster
Assumes an up amp running Hadoop clusterhellip which is covered elsewhere
Build nutchjob jar and use it as usual
binhadoop jar nutchjob ltclassNamegt ltargsgt
Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar
Searching is not a map-reduce job ndash often on a separate group of machines
48
Nut
ch ndash
Apa
cheC
on U
S 0
9
49
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch index
50
Nut
ch ndash
Apa
cheC
on U
S 0
9
hellip and press Search
51
Nut
ch ndash
Apa
cheC
on U
S 0
9
Conclusions(This overview is a tip of the iceberg)
NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit
52
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of NutchAvoid code duplicationParsing rarr Tika
bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI
bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper
bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase
bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available
53
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of Nutch (2)Whats left then
bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons
Vision aacute la carte toolkit scalable from1-1000s nodes
54
Nut
ch ndash
Apa
cheC
on U
S 0
9
Summary
QampA
Further informationbull httpluceneapacheorgnutch
37
Nut
ch ndash
Apa
cheC
on U
S 0
9
Indexing
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
38
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow indexingIndexing plugins
bull Create full-text search index from segment databull Apply adjustments to score per documentbull May further post-process the parsed content (eg
language identification) to facilitate advanced searchIndexers
bull Lucene indexer ndash builds indexes to be served by Nutch search servers
bull Solr indexer ndash submits documents to a Solr instance
39
Nut
ch ndash
Apa
cheC
on U
S 0
9
De-duplication
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
40
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow de-duplicationThe same page may be present in many shards
bull Obsolete versions in older shardsbull Mirrored pages or equivalent (acom rarr wwwacom)Many other pages are almost identical
bull Template-related differences (banners current date)bull Font layout changes minor re-wording
Hmm hellip what is a significant change bull Tricky issue Hint what is the page content
Near-duplicate detection and removalbull Nutch uses approximate page signatures (fingerprints)bull Duplicates are only marked as deleted
41
Nut
ch ndash
Apa
cheC
on U
S 0
9
Cycles may overlap
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
42
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch distributed search
Multiple search serversSearch front-end dispatches queries and collects search results
bull multiple access methods (API OpenSearch XML JSON)
Page servers build snippetsFault-tolerant
with degraded quality
Search front-end(s)Web app
APIXML
NutchBean
JSON
Search server(Nutch or Solr)
indexed shards
43
Nut
ch ndash
Apa
cheC
on U
S 0
9
Search configurationNutch syntax is limited ndash on purpose
bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins
bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20
Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation
44
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards
Searchfront-endSearch
front-end
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
45
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment distributed searchLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
46
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment map-reduce processing amp distrib search
Fully distributed storage and processingLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
DFS Data NodeDFS Data Node
DFS Data Node
DFS Name NodeDFS Name Node(Job Tracker)
FetcherIndexer
DB mgmt
(Job Tracker)FetcherIndexer
DB mgmt
Task TrackerTask Tracker
Task Tracker
47
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch on a Hadoop cluster
Assumes an up amp running Hadoop clusterhellip which is covered elsewhere
Build nutchjob jar and use it as usual
binhadoop jar nutchjob ltclassNamegt ltargsgt
Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar
Searching is not a map-reduce job ndash often on a separate group of machines
48
Nut
ch ndash
Apa
cheC
on U
S 0
9
49
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch index
50
Nut
ch ndash
Apa
cheC
on U
S 0
9
hellip and press Search
51
Nut
ch ndash
Apa
cheC
on U
S 0
9
Conclusions(This overview is a tip of the iceberg)
NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit
52
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of NutchAvoid code duplicationParsing rarr Tika
bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI
bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper
bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase
bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available
53
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of Nutch (2)Whats left then
bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons
Vision aacute la carte toolkit scalable from1-1000s nodes
54
Nut
ch ndash
Apa
cheC
on U
S 0
9
Summary
QampA
Further informationbull httpluceneapacheorgnutch
38
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow indexingIndexing plugins
bull Create full-text search index from segment databull Apply adjustments to score per documentbull May further post-process the parsed content (eg
language identification) to facilitate advanced searchIndexers
bull Lucene indexer ndash builds indexes to be served by Nutch search servers
bull Solr indexer ndash submits documents to a Solr instance
39
Nut
ch ndash
Apa
cheC
on U
S 0
9
De-duplication
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
40
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow de-duplicationThe same page may be present in many shards
bull Obsolete versions in older shardsbull Mirrored pages or equivalent (acom rarr wwwacom)Many other pages are almost identical
bull Template-related differences (banners current date)bull Font layout changes minor re-wording
Hmm hellip what is a significant change bull Tricky issue Hint what is the page content
Near-duplicate detection and removalbull Nutch uses approximate page signatures (fingerprints)bull Duplicates are only marked as deleted
41
Nut
ch ndash
Apa
cheC
on U
S 0
9
Cycles may overlap
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
42
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch distributed search
Multiple search serversSearch front-end dispatches queries and collects search results
bull multiple access methods (API OpenSearch XML JSON)
Page servers build snippetsFault-tolerant
with degraded quality
Search front-end(s)Web app
APIXML
NutchBean
JSON
Search server(Nutch or Solr)
indexed shards
43
Nut
ch ndash
Apa
cheC
on U
S 0
9
Search configurationNutch syntax is limited ndash on purpose
bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins
bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20
Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation
44
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards
Searchfront-endSearch
front-end
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
45
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment distributed searchLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
46
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment map-reduce processing amp distrib search
Fully distributed storage and processingLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
DFS Data NodeDFS Data Node
DFS Data Node
DFS Name NodeDFS Name Node(Job Tracker)
FetcherIndexer
DB mgmt
(Job Tracker)FetcherIndexer
DB mgmt
Task TrackerTask Tracker
Task Tracker
47
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch on a Hadoop cluster
Assumes an up amp running Hadoop clusterhellip which is covered elsewhere
Build nutchjob jar and use it as usual
binhadoop jar nutchjob ltclassNamegt ltargsgt
Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar
Searching is not a map-reduce job ndash often on a separate group of machines
48
Nut
ch ndash
Apa
cheC
on U
S 0
9
49
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch index
50
Nut
ch ndash
Apa
cheC
on U
S 0
9
hellip and press Search
51
Nut
ch ndash
Apa
cheC
on U
S 0
9
Conclusions(This overview is a tip of the iceberg)
NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit
52
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of NutchAvoid code duplicationParsing rarr Tika
bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI
bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper
bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase
bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available
53
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of Nutch (2)Whats left then
bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons
Vision aacute la carte toolkit scalable from1-1000s nodes
54
Nut
ch ndash
Apa
cheC
on U
S 0
9
Summary
QampA
Further informationbull httpluceneapacheorgnutch
39
Nut
ch ndash
Apa
cheC
on U
S 0
9
De-duplication
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
40
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow de-duplicationThe same page may be present in many shards
bull Obsolete versions in older shardsbull Mirrored pages or equivalent (acom rarr wwwacom)Many other pages are almost identical
bull Template-related differences (banners current date)bull Font layout changes minor re-wording
Hmm hellip what is a significant change bull Tricky issue Hint what is the page content
Near-duplicate detection and removalbull Nutch uses approximate page signatures (fingerprints)bull Duplicates are only marked as deleted
41
Nut
ch ndash
Apa
cheC
on U
S 0
9
Cycles may overlap
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
42
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch distributed search
Multiple search serversSearch front-end dispatches queries and collects search results
bull multiple access methods (API OpenSearch XML JSON)
Page servers build snippetsFault-tolerant
with degraded quality
Search front-end(s)Web app
APIXML
NutchBean
JSON
Search server(Nutch or Solr)
indexed shards
43
Nut
ch ndash
Apa
cheC
on U
S 0
9
Search configurationNutch syntax is limited ndash on purpose
bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins
bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20
Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation
44
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards
Searchfront-endSearch
front-end
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
45
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment distributed searchLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
46
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment map-reduce processing amp distrib search
Fully distributed storage and processingLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
DFS Data NodeDFS Data Node
DFS Data Node
DFS Name NodeDFS Name Node(Job Tracker)
FetcherIndexer
DB mgmt
(Job Tracker)FetcherIndexer
DB mgmt
Task TrackerTask Tracker
Task Tracker
47
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch on a Hadoop cluster
Assumes an up amp running Hadoop clusterhellip which is covered elsewhere
Build nutchjob jar and use it as usual
binhadoop jar nutchjob ltclassNamegt ltargsgt
Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar
Searching is not a map-reduce job ndash often on a separate group of machines
48
Nut
ch ndash
Apa
cheC
on U
S 0
9
49
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch index
50
Nut
ch ndash
Apa
cheC
on U
S 0
9
hellip and press Search
51
Nut
ch ndash
Apa
cheC
on U
S 0
9
Conclusions(This overview is a tip of the iceberg)
NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit
52
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of NutchAvoid code duplicationParsing rarr Tika
bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI
bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper
bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase
bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available
53
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of Nutch (2)Whats left then
bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons
Vision aacute la carte toolkit scalable from1-1000s nodes
54
Nut
ch ndash
Apa
cheC
on U
S 0
9
Summary
QampA
Further informationbull httpluceneapacheorgnutch
40
Nut
ch ndash
Apa
cheC
on U
S 0
9
Workflow de-duplicationThe same page may be present in many shards
bull Obsolete versions in older shardsbull Mirrored pages or equivalent (acom rarr wwwacom)Many other pages are almost identical
bull Template-related differences (banners current date)bull Font layout changes minor re-wording
Hmm hellip what is a significant change bull Tricky issue Hint what is the page content
Near-duplicate detection and removalbull Nutch uses approximate page signatures (fingerprints)bull Duplicates are only marked as deleted
41
Nut
ch ndash
Apa
cheC
on U
S 0
9
Cycles may overlap
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
42
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch distributed search
Multiple search serversSearch front-end dispatches queries and collects search results
bull multiple access methods (API OpenSearch XML JSON)
Page servers build snippetsFault-tolerant
with degraded quality
Search front-end(s)Web app
APIXML
NutchBean
JSON
Search server(Nutch or Solr)
indexed shards
43
Nut
ch ndash
Apa
cheC
on U
S 0
9
Search configurationNutch syntax is limited ndash on purpose
bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins
bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20
Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation
44
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards
Searchfront-endSearch
front-end
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
45
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment distributed searchLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
46
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment map-reduce processing amp distrib search
Fully distributed storage and processingLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
DFS Data NodeDFS Data Node
DFS Data Node
DFS Name NodeDFS Name Node(Job Tracker)
FetcherIndexer
DB mgmt
(Job Tracker)FetcherIndexer
DB mgmt
Task TrackerTask Tracker
Task Tracker
47
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch on a Hadoop cluster
Assumes an up amp running Hadoop clusterhellip which is covered elsewhere
Build nutchjob jar and use it as usual
binhadoop jar nutchjob ltclassNamegt ltargsgt
Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar
Searching is not a map-reduce job ndash often on a separate group of machines
48
Nut
ch ndash
Apa
cheC
on U
S 0
9
49
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch index
50
Nut
ch ndash
Apa
cheC
on U
S 0
9
hellip and press Search
51
Nut
ch ndash
Apa
cheC
on U
S 0
9
Conclusions(This overview is a tip of the iceberg)
NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit
52
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of NutchAvoid code duplicationParsing rarr Tika
bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI
bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper
bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase
bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available
53
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of Nutch (2)Whats left then
bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons
Vision aacute la carte toolkit scalable from1-1000s nodes
54
Nut
ch ndash
Apa
cheC
on U
S 0
9
Summary
QampA
Further informationbull httpluceneapacheorgnutch
41
Nut
ch ndash
Apa
cheC
on U
S 0
9
Cycles may overlap
Fetcher
Parser
Searcher
IndexerUpdater
Generator
Shards(segments)
LinkDB
CrawlDB
Linkinverter
Injector
URL filters amp normalizers parsingindexing filters scoring plugins
42
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch distributed search
Multiple search serversSearch front-end dispatches queries and collects search results
bull multiple access methods (API OpenSearch XML JSON)
Page servers build snippetsFault-tolerant
with degraded quality
Search front-end(s)Web app
APIXML
NutchBean
JSON
Search server(Nutch or Solr)
indexed shards
43
Nut
ch ndash
Apa
cheC
on U
S 0
9
Search configurationNutch syntax is limited ndash on purpose
bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins
bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20
Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation
44
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards
Searchfront-endSearch
front-end
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
45
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment distributed searchLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
46
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment map-reduce processing amp distrib search
Fully distributed storage and processingLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
DFS Data NodeDFS Data Node
DFS Data Node
DFS Name NodeDFS Name Node(Job Tracker)
FetcherIndexer
DB mgmt
(Job Tracker)FetcherIndexer
DB mgmt
Task TrackerTask Tracker
Task Tracker
47
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch on a Hadoop cluster
Assumes an up amp running Hadoop clusterhellip which is covered elsewhere
Build nutchjob jar and use it as usual
binhadoop jar nutchjob ltclassNamegt ltargsgt
Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar
Searching is not a map-reduce job ndash often on a separate group of machines
48
Nut
ch ndash
Apa
cheC
on U
S 0
9
49
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch index
50
Nut
ch ndash
Apa
cheC
on U
S 0
9
hellip and press Search
51
Nut
ch ndash
Apa
cheC
on U
S 0
9
Conclusions(This overview is a tip of the iceberg)
NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit
52
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of NutchAvoid code duplicationParsing rarr Tika
bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI
bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper
bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase
bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available
53
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of Nutch (2)Whats left then
bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons
Vision aacute la carte toolkit scalable from1-1000s nodes
54
Nut
ch ndash
Apa
cheC
on U
S 0
9
Summary
QampA
Further informationbull httpluceneapacheorgnutch
42
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch distributed search
Multiple search serversSearch front-end dispatches queries and collects search results
bull multiple access methods (API OpenSearch XML JSON)
Page servers build snippetsFault-tolerant
with degraded quality
Search front-end(s)Web app
APIXML
NutchBean
JSON
Search server(Nutch or Solr)
indexed shards
43
Nut
ch ndash
Apa
cheC
on U
S 0
9
Search configurationNutch syntax is limited ndash on purpose
bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins
bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20
Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation
44
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards
Searchfront-endSearch
front-end
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
45
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment distributed searchLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
46
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment map-reduce processing amp distrib search
Fully distributed storage and processingLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
DFS Data NodeDFS Data Node
DFS Data Node
DFS Name NodeDFS Name Node(Job Tracker)
FetcherIndexer
DB mgmt
(Job Tracker)FetcherIndexer
DB mgmt
Task TrackerTask Tracker
Task Tracker
47
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch on a Hadoop cluster
Assumes an up amp running Hadoop clusterhellip which is covered elsewhere
Build nutchjob jar and use it as usual
binhadoop jar nutchjob ltclassNamegt ltargsgt
Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar
Searching is not a map-reduce job ndash often on a separate group of machines
48
Nut
ch ndash
Apa
cheC
on U
S 0
9
49
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch index
50
Nut
ch ndash
Apa
cheC
on U
S 0
9
hellip and press Search
51
Nut
ch ndash
Apa
cheC
on U
S 0
9
Conclusions(This overview is a tip of the iceberg)
NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit
52
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of NutchAvoid code duplicationParsing rarr Tika
bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI
bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper
bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase
bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available
53
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of Nutch (2)Whats left then
bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons
Vision aacute la carte toolkit scalable from1-1000s nodes
54
Nut
ch ndash
Apa
cheC
on U
S 0
9
Summary
QampA
Further informationbull httpluceneapacheorgnutch
43
Nut
ch ndash
Apa
cheC
on U
S 0
9
Search configurationNutch syntax is limited ndash on purpose
bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins
bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20
Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation
44
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards
Searchfront-endSearch
front-end
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
45
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment distributed searchLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
46
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment map-reduce processing amp distrib search
Fully distributed storage and processingLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
DFS Data NodeDFS Data Node
DFS Data Node
DFS Name NodeDFS Name Node(Job Tracker)
FetcherIndexer
DB mgmt
(Job Tracker)FetcherIndexer
DB mgmt
Task TrackerTask Tracker
Task Tracker
47
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch on a Hadoop cluster
Assumes an up amp running Hadoop clusterhellip which is covered elsewhere
Build nutchjob jar and use it as usual
binhadoop jar nutchjob ltclassNamegt ltargsgt
Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar
Searching is not a map-reduce job ndash often on a separate group of machines
48
Nut
ch ndash
Apa
cheC
on U
S 0
9
49
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch index
50
Nut
ch ndash
Apa
cheC
on U
S 0
9
hellip and press Search
51
Nut
ch ndash
Apa
cheC
on U
S 0
9
Conclusions(This overview is a tip of the iceberg)
NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit
52
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of NutchAvoid code duplicationParsing rarr Tika
bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI
bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper
bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase
bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available
53
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of Nutch (2)Whats left then
bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons
Vision aacute la carte toolkit scalable from1-1000s nodes
54
Nut
ch ndash
Apa
cheC
on U
S 0
9
Summary
QampA
Further informationbull httpluceneapacheorgnutch
44
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards
Searchfront-endSearch
front-end
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
45
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment distributed searchLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
46
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment map-reduce processing amp distrib search
Fully distributed storage and processingLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
DFS Data NodeDFS Data Node
DFS Data Node
DFS Name NodeDFS Name Node(Job Tracker)
FetcherIndexer
DB mgmt
(Job Tracker)FetcherIndexer
DB mgmt
Task TrackerTask Tracker
Task Tracker
47
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch on a Hadoop cluster
Assumes an up amp running Hadoop clusterhellip which is covered elsewhere
Build nutchjob jar and use it as usual
binhadoop jar nutchjob ltclassNamegt ltargsgt
Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar
Searching is not a map-reduce job ndash often on a separate group of machines
48
Nut
ch ndash
Apa
cheC
on U
S 0
9
49
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch index
50
Nut
ch ndash
Apa
cheC
on U
S 0
9
hellip and press Search
51
Nut
ch ndash
Apa
cheC
on U
S 0
9
Conclusions(This overview is a tip of the iceberg)
NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit
52
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of NutchAvoid code duplicationParsing rarr Tika
bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI
bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper
bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase
bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available
53
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of Nutch (2)Whats left then
bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons
Vision aacute la carte toolkit scalable from1-1000s nodes
54
Nut
ch ndash
Apa
cheC
on U
S 0
9
Summary
QampA
Further informationbull httpluceneapacheorgnutch
45
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment distributed searchLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
LocalFSStorage
LocalFSStorage
FetcherFetcher
IndexerIndexer
DB mgmtDB mgmt
46
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment map-reduce processing amp distrib search
Fully distributed storage and processingLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
DFS Data NodeDFS Data Node
DFS Data Node
DFS Name NodeDFS Name Node(Job Tracker)
FetcherIndexer
DB mgmt
(Job Tracker)FetcherIndexer
DB mgmt
Task TrackerTask Tracker
Task Tracker
47
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch on a Hadoop cluster
Assumes an up amp running Hadoop clusterhellip which is covered elsewhere
Build nutchjob jar and use it as usual
binhadoop jar nutchjob ltclassNamegt ltargsgt
Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar
Searching is not a map-reduce job ndash often on a separate group of machines
48
Nut
ch ndash
Apa
cheC
on U
S 0
9
49
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch index
50
Nut
ch ndash
Apa
cheC
on U
S 0
9
hellip and press Search
51
Nut
ch ndash
Apa
cheC
on U
S 0
9
Conclusions(This overview is a tip of the iceberg)
NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit
52
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of NutchAvoid code duplicationParsing rarr Tika
bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI
bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper
bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase
bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available
53
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of Nutch (2)Whats left then
bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons
Vision aacute la carte toolkit scalable from1-1000s nodes
54
Nut
ch ndash
Apa
cheC
on U
S 0
9
Summary
QampA
Further informationbull httpluceneapacheorgnutch
46
Nut
ch ndash
Apa
cheC
on U
S 0
9
Deployment map-reduce processing amp distrib search
Fully distributed storage and processingLocal storage on search servers preferred (perf)
Searchfront-endSearch
front-end
Search serverSearch server
Search server
DFS Data NodeDFS Data Node
DFS Data Node
DFS Name NodeDFS Name Node(Job Tracker)
FetcherIndexer
DB mgmt
(Job Tracker)FetcherIndexer
DB mgmt
Task TrackerTask Tracker
Task Tracker
47
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch on a Hadoop cluster
Assumes an up amp running Hadoop clusterhellip which is covered elsewhere
Build nutchjob jar and use it as usual
binhadoop jar nutchjob ltclassNamegt ltargsgt
Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar
Searching is not a map-reduce job ndash often on a separate group of machines
48
Nut
ch ndash
Apa
cheC
on U
S 0
9
49
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch index
50
Nut
ch ndash
Apa
cheC
on U
S 0
9
hellip and press Search
51
Nut
ch ndash
Apa
cheC
on U
S 0
9
Conclusions(This overview is a tip of the iceberg)
NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit
52
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of NutchAvoid code duplicationParsing rarr Tika
bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI
bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper
bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase
bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available
53
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of Nutch (2)Whats left then
bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons
Vision aacute la carte toolkit scalable from1-1000s nodes
54
Nut
ch ndash
Apa
cheC
on U
S 0
9
Summary
QampA
Further informationbull httpluceneapacheorgnutch
47
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch on a Hadoop cluster
Assumes an up amp running Hadoop clusterhellip which is covered elsewhere
Build nutchjob jar and use it as usual
binhadoop jar nutchjob ltclassNamegt ltargsgt
Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar
Searching is not a map-reduce job ndash often on a separate group of machines
48
Nut
ch ndash
Apa
cheC
on U
S 0
9
49
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch index
50
Nut
ch ndash
Apa
cheC
on U
S 0
9
hellip and press Search
51
Nut
ch ndash
Apa
cheC
on U
S 0
9
Conclusions(This overview is a tip of the iceberg)
NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit
52
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of NutchAvoid code duplicationParsing rarr Tika
bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI
bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper
bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase
bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available
53
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of Nutch (2)Whats left then
bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons
Vision aacute la carte toolkit scalable from1-1000s nodes
54
Nut
ch ndash
Apa
cheC
on U
S 0
9
Summary
QampA
Further informationbull httpluceneapacheorgnutch
48
Nut
ch ndash
Apa
cheC
on U
S 0
9
49
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch index
50
Nut
ch ndash
Apa
cheC
on U
S 0
9
hellip and press Search
51
Nut
ch ndash
Apa
cheC
on U
S 0
9
Conclusions(This overview is a tip of the iceberg)
NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit
52
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of NutchAvoid code duplicationParsing rarr Tika
bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI
bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper
bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase
bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available
53
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of Nutch (2)Whats left then
bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons
Vision aacute la carte toolkit scalable from1-1000s nodes
54
Nut
ch ndash
Apa
cheC
on U
S 0
9
Summary
QampA
Further informationbull httpluceneapacheorgnutch
49
Nut
ch ndash
Apa
cheC
on U
S 0
9
Nutch index
50
Nut
ch ndash
Apa
cheC
on U
S 0
9
hellip and press Search
51
Nut
ch ndash
Apa
cheC
on U
S 0
9
Conclusions(This overview is a tip of the iceberg)
NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit
52
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of NutchAvoid code duplicationParsing rarr Tika
bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI
bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper
bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase
bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available
53
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of Nutch (2)Whats left then
bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons
Vision aacute la carte toolkit scalable from1-1000s nodes
54
Nut
ch ndash
Apa
cheC
on U
S 0
9
Summary
QampA
Further informationbull httpluceneapacheorgnutch
50
Nut
ch ndash
Apa
cheC
on U
S 0
9
hellip and press Search
51
Nut
ch ndash
Apa
cheC
on U
S 0
9
Conclusions(This overview is a tip of the iceberg)
NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit
52
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of NutchAvoid code duplicationParsing rarr Tika
bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI
bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper
bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase
bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available
53
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of Nutch (2)Whats left then
bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons
Vision aacute la carte toolkit scalable from1-1000s nodes
54
Nut
ch ndash
Apa
cheC
on U
S 0
9
Summary
QampA
Further informationbull httpluceneapacheorgnutch
51
Nut
ch ndash
Apa
cheC
on U
S 0
9
Conclusions(This overview is a tip of the iceberg)
NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit
52
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of NutchAvoid code duplicationParsing rarr Tika
bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI
bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper
bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase
bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available
53
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of Nutch (2)Whats left then
bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons
Vision aacute la carte toolkit scalable from1-1000s nodes
54
Nut
ch ndash
Apa
cheC
on U
S 0
9
Summary
QampA
Further informationbull httpluceneapacheorgnutch
52
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of NutchAvoid code duplicationParsing rarr Tika
bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI
bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper
bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase
bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available
53
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of Nutch (2)Whats left then
bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons
Vision aacute la carte toolkit scalable from1-1000s nodes
54
Nut
ch ndash
Apa
cheC
on U
S 0
9
Summary
QampA
Further informationbull httpluceneapacheorgnutch
53
Nut
ch ndash
Apa
cheC
on U
S 0
9
Future of Nutch (2)Whats left then
bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons
Vision aacute la carte toolkit scalable from1-1000s nodes
54
Nut
ch ndash
Apa
cheC
on U
S 0
9
Summary
QampA
Further informationbull httpluceneapacheorgnutch
54
Nut
ch ndash
Apa
cheC
on U
S 0
9
Summary
QampA
Further informationbull httpluceneapacheorgnutch