Harnessing the power of Nutch with Scala

  • Published on
    11-Nov-2014

  • View
    3.460

  • Download
    1

Embed Size (px)

DESCRIPTION

Introduction to N

Transcript

<ul><li> 1. Crawling the web, Nutch with Scala Vikas Hazrati @ </li> <li> 2. about CTO at Knoldus Software Co-Founder at MyCellWasStolen.com Community Editor at InfoQ.com Dabbling with Scala last 40 months Enterprise grade implementations on Scala 18 months 2 </li> <li> 3. nutchWeb search crawler link-graph parsing software solr lucene 3 </li> <li> 4. nutch but we have google! transparent understanding extensible 4 </li> <li> 5. nutch basic architecturecrawler searcher 5 </li> <li> 6. nutch - architecture Recursive segmentscrawler links web database pages fetchlists Crawl db 6 </li> <li> 7. nutch crawl cycle generate fetch update cycleCreate crawldb Inject root URLs In crawldb Update segments Generate fetchlist Index fetched pages Fetch content repeat until depth reached deduplication Update crawldb Merge indexes for searching bin/nutch crawl urls -dir crawl -depth 3 -topN 5 7 </li> <li> 8. nutch - plugins generate fetch update cycleCreate crawldb parser Inject root URLs In crawldb HTMLParserFilter Generate fetchlist Fetch content URL Filter Update crawldb scoring filter 8 </li> <li> 9. nutch extension pointsplugin.xml // tells Nutch about the plugin build.xml // build the plugin ivy.xml // plugin dependencies // plugin source src 9 </li> <li> 10. nutch - example 10 </li> <li> 11. public ParseResult filter(Content content, ParseResultparseResult, HTMLMetaTags metaTags, DocumentFragmentdoc) { LOG.debug("Parsing URL: " + content.getUrl()); } Parse parse = parseResult.get(content.getUrl()); Metadata metadata = parse.getData().getParseMeta(); for (String tag : tags) { metadata.add(TAG_KEY, tag); } return parseResult; } 11 </li> <li> 12. scala I have Java !concurrency verbose popular Strongly typed jvm OO library 12 </li> <li> 13. scalaJava:class Person { private String firstName; private String lastName; private int age; public Person(String firstName, String lastName, int age) { this.firstName = firstName; this.lastName = lastName; this.age = age; } public void setFirstName(String firstName) { this.firstName = firstName; } public void String getFirstName() { return this.firstName; } public void setLastName(String lastName) { this.lastName = lastName; } public void String getLastName() { return this.lastName; } public void setAge(int age) { this.age = age; } public void int getAge() { return this.age; }}Scala:class Person(var firstName: String, var lastName: String, var age: Int)Source: http://blog.objectmentor.com/articles/2008/08/03/the-seductions-of-scala-part-i 13 </li> <li> 14. scalaJava everything is an object unless it is primitiveScala everything is an object. period.Java has operators (+, -, &lt; ..) and methodsScala operators are methodsJava statically typed Thing thing = new Thing()Scala statically typed but uses type inferencingval thing = new Thing 14 </li> <li> 15. evolution 15 </li> <li> 16. scala and concurrencyFine grained coarse grained Actors 16 </li> <li> 17. actors 17 </li> <li> 18. 18 </li> <li> 19. problem contextAggregator UGC 19 </li> <li> 20. solution Supplier 1Aggregator Supplier 2 Supplier 3 20 </li> <li> 21. Create crawldb Inject root URLs In crawldb Supplier URLs Generate fetchlist Fetch content Update crawldb plugins written in Scala 21 </li> <li> 22. logicCrawl the supplier Parse Is URL interesting Pass extraction to actor seed database 22 </li> <li> 23. plugin - scalaclass DetailParserFilter extends HtmlParseFilter { def filter(content: Content, parseResult: ParseResult, metaTags: HTMLMetaTags, doc: DocumentFragment): ParseResult = { if (isDetailURL(content.getUrl)) { val rawHtml = content.getContent if (rawHtml.length &gt; 0) processContent(rawHtml) } parseResult } private def isDetailURL(url: String): Boolean = { val result = url.matches(AggregatorConfiguration.regexEventDetailPages) result } private def processContent(rawHtml: Array[Byte]) = { (new DetailProcessor).start ! rawHtml } 23 </li> <li> 24. result5 suppliers crawledCrawl cycles run continuously for few days&gt; 500K seed data collectedAll with Nutch and 823 lines of Scala code 24 </li> <li> 25. demoin action . 25 </li> <li> 26. resources http://blog.knoldus.comhttp://wiki.apache.org/nutch/NutchTutorial http://www.scala-lang.org/ vikas@knoldus.com 26 </li> </ul>

Recommended

View more >