26
Crawling the web, Nutch with Scala Vikas Hazrati @

Harnessing the power of Nutch with Scala

Embed Size (px)

DESCRIPTION

Introduction to N

Citation preview

Page 1: Harnessing the power of Nutch with Scala

Crawling the web, Nutch with Scala

Vikas Hazrati @

Page 2: Harnessing the power of Nutch with Scala

2

about

CTO at Knoldus Software

Co-Founder at MyCellWasStolen.com

Community Editor at InfoQ.com

Dabbling with Scala – last 40 months

Enterprise grade implementations on Scala – 18 months

Page 3: Harnessing the power of Nutch with Scala

3

nutch

Web search software

lucene

solr

crawler link-graph parsing

Page 4: Harnessing the power of Nutch with Scala

4

nutch – but we have google!

transparent

understanding

extensible

Page 5: Harnessing the power of Nutch with Scala

5

nutch – basic architecture

crawler searcher

Page 6: Harnessing the power of Nutch with Scala

6

nutch - architecture

web databaseCrawl dbfetchlists

links

pages

segments

crawler

Recursive

Page 7: Harnessing the power of Nutch with Scala

7

nutch – crawl cyclegenerate – fetch – update cycle

Create crawldb

Inject root URLs In crawldb

Generate fetchlist

Fetch content

Update crawldb

repeat untildepth reached

Update segments

Index fetched pages

deduplication

Merge indexes forsearching

bin/nutch crawl urls -dir crawl -depth 3 -topN 5

Page 8: Harnessing the power of Nutch with Scala

8

nutch - plugins

Create crawldb

Inject root URLs In crawldb

Generate fetchlist

Fetch content

Update crawldb

parser

HTMLParserFilter

URL Filter

scoring filter

generate – fetch – update cycle

Page 9: Harnessing the power of Nutch with Scala

9

nutch – extension points

plugin.xml

build.xml

ivy.xml

// tells Nutch about the plugin

// build the plugin

// plugin dependencies

src // plugin source

Page 10: Harnessing the power of Nutch with Scala

10

nutch - example

<plugin id="KnoldusAggregator" name="Knoldus Parse Filter" version="1.0.0" provider-name="nutch.org"> <runtime> <library name="kdaggregator.jar"> <export name="*" /> </library> </runtime> <requires> <import plugin="nutch-extensionpoints" /> </requires> <extension id="org.apache.nutch.parse.headings" name="Nutch Headings Parse Filter" point="org.apache.nutch.parse.HtmlParseFilter"> <implementation id="KDParseFilter" class="com.knoldus.aggregator.server.plugins.DetailParserFilter"></implementation> </extension></plugin>

Page 11: Harnessing the power of Nutch with Scala

11

public ParseResult filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc) {

LOG.debug("Parsing URL: " + content.getUrl());

} Parse parse = parseResult.get(content.getUrl()); Metadata metadata = parse.getData().getParseMeta(); for (String tag : tags) { metadata.add(TAG_KEY, tag); } return parseResult;

}

Page 12: Harnessing the power of Nutch with Scala

12

scalaI have Java !

concurrency verbose

popular

OO library

Strongly typed

jvm

Page 13: Harnessing the power of Nutch with Scala

13

scalaJava:class Person { private String firstName; private String lastName; private int age;

public Person(String firstName, String lastName, int age) { this.firstName = firstName; this.lastName = lastName; this.age = age; }

public void setFirstName(String firstName) { this.firstName = firstName; } public void String getFirstName() { return this.firstName; } public void setLastName(String lastName) { this.lastName = lastName; } public void String getLastName() { return this.lastName; } public void setAge(int age) { this.age = age; } public void int getAge() { return this.age; }}

Scala:class Person(var firstName: String, var lastName: String, var age: Int)

Source: http://blog.objectmentor.com/articles/2008/08/03/the-seductions-of-scala-part-i

Page 14: Harnessing the power of Nutch with Scala

14

scala

Java – everything is an object unless it is primitive

Scala – everything is an object. period.

Java – has operators (+, -, < ..) and methods

Scala – operators are methods

Java – statically typed – Thing thing = new Thing()Scala – statically typed but uses type inferencingval thing = new Thing

Page 15: Harnessing the power of Nutch with Scala

15

evolution

Page 16: Harnessing the power of Nutch with Scala

16

scala and concurrency

Fine grained coarse grained

Actors

Page 17: Harnessing the power of Nutch with Scala

17

actors

Page 18: Harnessing the power of Nutch with Scala

18

Page 19: Harnessing the power of Nutch with Scala

19

problem context

Aggregator

UGC

Page 20: Harnessing the power of Nutch with Scala

20

solution

Aggregator

Supplier 1

Supplier 2

Supplier 3

Page 21: Harnessing the power of Nutch with Scala

21

Create crawldb

Inject root URLs In crawldb

Generate fetchlist

Fetch content

Update crawldb

Supplier URLs

plugins written in Scala

Page 22: Harnessing the power of Nutch with Scala

22

logic

Crawl the supplier

Is URL interestingParse

Pass extraction to actor

seeddatabase

Page 23: Harnessing the power of Nutch with Scala

23

plugin - scalaclass DetailParserFilter extends HtmlParseFilter {

def filter(content: Content, parseResult: ParseResult, metaTags: HTMLMetaTags, doc: DocumentFragment): ParseResult = {

if (isDetailURL(content.getUrl)) { val rawHtml = content.getContent if (rawHtml.length > 0) processContent(rawHtml) } parseResult }

private def isDetailURL(url: String): Boolean = { val result = url.matches(AggregatorConfiguration.regexEventDetailPages) result }

private def processContent(rawHtml: Array[Byte]) = { (new DetailProcessor).start ! rawHtml }

Page 24: Harnessing the power of Nutch with Scala

24

result

5 suppliers crawled

Crawl cycles run continuously for few days

> 500K seed data collected

All with Nutch and 823 lines of Scala code

Page 25: Harnessing the power of Nutch with Scala

25

demo

in action ….

Page 26: Harnessing the power of Nutch with Scala

26

resources

http://blog.knoldus.com

http://wiki.apache.org/nutch/NutchTutorial

http://www.scala-lang.org/

[email protected]