SLASHPack: Collector Performance Improvement and Evaluation

SLASHPack: CollectorPerformance Improvement andEvaluation

Rudd Stevens

CS 690

Spring 2006

SLASHPack Collector - 5/4/2006 2

Outline

1. Introduction, system overview and design.

2. Performance modifications, re-factoring and re-structuring.

3. Performance testing results and evaluation.


Outline





Introduction

SLASHPack Toolkit(Semi-LArge Scale Hypertext Package)

Sponsored by Prof. Chris Brooks, engineered for initial clients Nancy Montanez and Ryan King.

Collector component Framework for collecting documents.

Evaluate and improve performance.


Contact and Information Sources Contact Information:

Rudd Stevens

rstevens (at) cs.usfca.edu

Project Website: http://www.cs.usfca.edu/~rstevens/slashpack/collector/

Project Sponsor: Professor Christopher Brooks

Department of Computer Science

University of San Francisco

cbrooks (at) cs.usfca.edu


Stages Addition of protocol module for Weblog data set.

Performance testing using the Weblog and HTTP modules. Identify problem areas.

Modify Collector to improve scalability and performance.

Repeat performance testing and evaluate performance improvements.


Implementation Language: Python

Platform: Any Python supported OS. Python 2.4 or later (Developed and tested under Linux.)

Progress: Fully built, newly re-factored for performance and usability.


High level design

SLASHPack designed as a framework.

Modular components, that contain sub-modules.

Collector pluggable for protocol modules, parsers, filters, output writers, etc.


High level design (cont.)


Outline





Performance Testing Large scale text collection.

Weblog data set.Long web crawls.

Performance testing monitoring Python Profiling.Integrated Statistics.

Functionality TestingPython logging.Functionality test runs.


Collector Runtime StatisticsUrlFrontier

Url Frontier size,

current number of links: 3465

Urls requested from frontier: 659

Url Frontier,

current number of server queues: 78

Urls delivered from frontier: 639

Collector

Documents per second:

3.70328865405

Total runtime:

2 Minutes 31.4869761467 Seconds

UrlSentry

Urls filtered using robots: 38

Urls filtered for depth: 9

Urls Processed: 5881

Urls filtered using filters: 165

UrlBookkeeper

Duplicate Urls: 1557

Urls recorded: 4104


Collector Runtime StatisticsDocFingerprinter

Documents Written: 386

Average Document Size (bytes):

20570

HTTP Status Responses:

200: 394 204: 10

301: 8 302: 25

404: 91 403: 7

401: 1 400: 24

500: 1

Duplicate Documents: 51

Total Documents Collected: 561

Documents by mimetype:

text/xml: 1 image/jpeg: 1

text/html: 451 image/gif: 1

text/plain: 106

application/octet-stream: 1


Challenges Large text (XML) files

21 1 GB XML files. ~450,000 files per XML file.~10 Million files, after processing.

Memory/StorageDisk space.Memory usage during processing. (XML)


Weblog raw data <post>

<weblog_url> http://www.livejournal.com/users/chuckdarwin </weblog_url> <weblog_title> ""Evolve!"“ </weblog_title> <permalink>http://www.livejournal.com/users/chuckdarwin/1001264.html</permalink> <post_title> Flickr </post_title> <author_name> Darwin (chuckdarwin) </author_name> <date_posted> 2005-07-09 </date_posted> <time_posted> 000000 </time_posted> <content> <html><head><meta content="text/html; charset=UTF-8" http-equiv="Content-Type"/><title>""Evolve!""</title></head><body><div style="text-align: center;"><font size="+1"><a href="http://www.nytimes.com/2005/07/09/arts/09boxe.html?ei=5088&en=61cfcd5835008b1a&ex=1278561600&partner=rssnyt&emc=rss&pagewanted=print">7/7 and 9/11?</a></font></div></body></html></content><outlinks>

<outlink> <url> http://www.nytimes.com/2005/07/09/arts/09boxe.html </url> <site> http://www.nytimes.com </site> <type> Press </type></outlink>

</outlinks> </post>


Weblog processed data<spdata>

<url>http://www.livejournal.com/users/chuckdarwin</url><date>20060212</date><crawlname>WeblogPosts20050709</crawlname>

<weblog> <weblog_title>”"Evolve!""</weblog_title> <permalink>http://www.livejournal.com/users/chuckdarwin/1001264.html</permalink> <post_title>Flickr</post_title> <author_name>Darwin (chuckdarwin)</author_name> <date_posted>2005-07-09</date_posted>

<time_posted>000000</time_posted> <outlinks> <outlink> <type>Press</type>

<url>http://www.nytimes.com/2005/07/09/arts/09boxe.html</url><site>http://www.nytimes.com</site>

</outlink> </outlinks>

</weblog> <tags></tags> <size>493</size> <mimetype>text/plain</mimetype> <fingerprint>9949bba4ac535d18c3f11db66cdb194e</fingerprint> <content>Jmx0O2h0bWwmZ3Q7CiZsdDtoZWFkJmd0OwombHQ7bWV0YSBjb250ZW50P

….</content></spdata>


Original Design


Problems to Address Overall collection performance

Streamline processing.

Robot file look up Incredibly slow and inefficient. (Not mine!)

Thread interaction Efficient use of threads and queues to process data.

Inefficient code Python code not always the fastest. miniDom XML parsing.

Faster data structures Re-work collection protocols, DNS prefetch. Re-structure URL Frontier, URL Bookkeeper.


New Design


Performance Modifications Structure Re-design (threading)

More queues, more independence.

Robot Parser String creation, debug calls.

URL Frontier More efficient data structures.

Protocol Modules More efficient data structures. Re-factoring for reliable collection.

XML parsing Switch to faster parser, removal of DOM parser.

DNS Pre-fetching More efficient structuring.


New data structures

Dictionary fields for Base data type. (Must be implemented by any data protocol).Now passed in dictionary to storage component.

Key Value Typedatatype user defined datatype name string status HTTP document status stringurl URL of document stringdate collection date stringcrawlname name of current crawl stringsize byte length of content stringmimetype mime type of document stringfingerprint md5sum hash of content stringcontent raw text of document string


Outline





Performance ComparisonInitial Results:

Weblog data setw/o parsing, robots:

161 doc/s, 50 min.

w/ parsing, robots:

3.9 doc/s, 162 min. (killed)

HTTP Web crawl100 docs w/ parsing, robots:

0.2 doc/s,16 min:13s

150 docs w/ parsing, robots:

0.3 doc/s, 21min:3s

Modified Results:

Weblog data setw/o parsing, robots:

170 doc/s, 42 min.

w/ parsing, robots:

186 doc/s, 63 min.

HTTP Web crawl100 docs w/ parsing, robots:

2.2 doc/s, 1min:10s

150 docs w/ parsing, robots:

2.9 doc/s, 1min:14s


Performance Comparison (cont.) Hardware considerations

- HTTP web crawl for 500 documentsPentium 4 2.4GHz 1 GB RAM

3.7 doc/s 3min:18s, 728 docs total

(faster connection)

Pentium 4 2.0GHz, 1GB RAM 3.7 doc/s 4min:25s, 725 docs total

Pentium 4 3.2GHz HT, 2GB RAM 4.3 doc/s 2min:47s, 717 docs total

(faster connection)


Performance Comparison (cont.) Comparison to other web crawlers

(published results, 1999) Google: 33.5 doc/sInternet Archive: 46.3 doc/sMercator: 112 doc/s

Consideration of functionalityMore than just a web crawler.Mime types.


Available Documentation Pydoc API

Generated with Epydoc.

Use and configuration guide (README).Quick start guide.

Full ReportFull specification of Collector, use,

configuration and development background.


Future Work

Addition of pluggable modules.

Improved fingerprint sets.

Improved Python memory management

and threading.


References

Allan Heydon and Marc Najork. Mercator: A scalable, extensible web crawler. http://research.compaq.com/SRC/mercator/papers/www/paper.pdf

Soumen Chakrabati, Mining the Web, 2002. Ch. 2, pages 17-43.

Heritrix, Internet Archive. http://crawler.archive.org/

Python Performance Tips http://wiki.python.org/moin/PythonSpeed/PerformanceTips

Prof. Chris Brooks and the SLASHPack Team.


Conclusion Four stages:

Addition of protocol module for Weblog data set. Performance testing and identifying problem areas. Modify Collector to improve scalability and

performance. Repeat performance testing and evaluate

performance improvements.

Results: Expanded functionality for data types. Modifications improved performance. More stable and flexible design.