Introduction to Web Robots, Crawlers & Spiders

Web Robots, Crawlers, & SpidersWebmaster- Fort Collins, CO

Copyright © XTR Systems, LLC

Introductionto

Web Robots,Crawlers & Spiders

Instructor: Joseph DiVerdi, Ph.D., MBA



Web Robot Defined

• A Web Robot Is a Program– That Automatically Traverses the Web

• Using Hypertext Links

– Retrieving a Particular Document• Then Retrieving All Documents That Are Referenced

– Recursively

• Recursive Doesn't Limit the Definition – To Any Specific Traversal Algorithm– Even If a Robot Applies Some Heuristic to the Selection

& Order of Documents to Visit & Spaces Out Requests Over a Long Time Period

• It Is Still a Robot



Web Robot Defined

• Normal Web Browsers Are Not Robots– Because the Are Operated by a Human– Don't Automatically Retrieve Referenced Documents

• Other Than Inline Images



Web Robot Defined

• Sometimes Referred to As – Web Wanderers– Web Crawlers– Spiders

• These Names Are a Bit Misleading– They Give the Impression the Software Itself Moves

Between Sites• Like a Virus

– This Not the Case• A Robot Visits Sites by Requesting Documents From Them



Agent Defined

• The Term Agent Is (Over) Used These Days• Specific Agents Include:

– Autonomous Agent– Intelligent Agent– User-Agent



Autonomous Agent Defined

• An Autonomous Agent Is a Program– That Automatically Travels Between Sites– Makes Its Own Decisions

• When To Move, When To Stay

– Are Limited to Travel Between Selected Sites– Currently Not Widespread on the Web



Intelligent Agent Defined

• An Intelligent Agent Is a Program– That Helps Users With Certain Activities

• Choosing a Product• Filling Out a Form• Find Particular Items

– Generally Have Little to Do With Networking– Usually Created & Maintained by an Organization

• To Assist Its Own Viewers



User-Agent Defined

• An User-Agent Is a Program– Performs Networking Tasks for a User

• Web User-Agent– Navigator

– Internet Explorer

– Opera

• Email User-Agent– Eudora

• FTP User-Agent– HTML-Kit

– Fetch

– cute_FTP



Search Engine Defined

• A Search Engine Is a Program– That Examines A Database

• Upon Request or Automatically• Delivers Results or Creates Digest

– In the Context of the Web A Search Engine Is• A Program That Examines Databases of HTML

Documents– Databases Gathered by a Robot

• Upon Request• Delivers Results Via HTML Document



Robot Purposes

• Robots Are Used for a Number of Tasks– Indexing

• Just Like a Book Index

– HTML Validation– Link Validation

• Searching for Broken Links

– What's New Monitoring– Mirroring

• Making a Copy of a Primary Web Site• On a Separate Server

– More Local to Some Users

– Shares the Work Load With the Primary Server



Other Popular Names

• All Names for the Same Sort of Program– With Slightly Different Connotations

• Web Spiders– Sounds Cooler in the Media

• Web Crawlers– Webcrawler Is a Specific Robot

• Web Worms– A Worm Is a Replicating Program

• Web Ants– Distributed Cooperating Robots



Robot Ethics

• Robots Have Enjoyed a Checkered History– Certain Robot Programs Can

• And Have in the Past

– Overload Networks & Servers• With Numerous Requests

• This Happens Especially With Programmers – Just Starting to Write a Robot Program

• These Days There Is Sufficient Information on Robots to Prevent Many of These Mistakes– But Does Everyone Read It?



Robot Ethics

• Robots Have Enjoyed a Checkered History– Robots Are Operated by Humans

• Can Make Mistakes in Configuration• Don't Consider the Implications of Actions

• This Means – Robot Operators Need to Be Careful– Robot Authors Need to Make It Difficult for

Operators to Make Mistakes• With Bad Effects



Robot Ethics

• Robots Have Enjoyed a Checkered History– Indexing Robots Build Central Database of

Documents– Which Doesn't Always Scale Well

• To Millions of Documents• On Millions of Sites

– Many Different Problems Occur• Missing Sites & Links• High Server Loads• Broken Links



Robot Ethics

• Robots Have Enjoyed a Checkered History– Majority of Robots Are

• Well Designed• Professionally Operated• Cause No Problems• Provide a Valuable Service

• Robots Aren't Inherently Bad– Nor Are They Inherently Brilliant

• They Just Need Careful Attention



Robot Visitation Strategies

• Generally Start From Historical URL List– Especially Documents With Many or Certain Links

• Server Lists• What's New Pages• Most Popular Sites on the Web

• Other Sources for URLs Are Used– Scans Through USENET Postings– Published Mailing List Archives

• Robot Selects URLs to Visit, Index, & Parse• And Use As a Source for New URLs



Robot Indexing Strategies

• If an Indexing Robot Is Aware of a Document– Robot May Decide to Parse Document– Insert Document Content Into Robot's Database

• Decision Depends on the Robot– Some Robots Index

• HTML Titles• The First Few Paragraphs• Parse the Entire HTML & Index All Words

– With Weightings Depending on HTML Constructs

• Parse the META Tag– Or Other Special Internal Tags



Robot Visitation Strategies

• Many Indexing Services Also Allow Web Developers to Submit URL Manually– Which Is Queued– Visited by the Robot

• Exact Process Depends on Robot Service– Many Services Have a Link to a URL Submission

Form on Their Search Page

• Certain Aggregators Exist– Which Purport to Submit to Many Robots at Once

http://www.submit-it.com/



Determining Robot Activity

• Examine Server Logs– Examine User-Agent, If Available– Examine Host Name or IP Address– Check for Many Accesses in Short Time Period– Check for Robot Exclusion Document Access

• Found at: /robots.txt



Apache Access Log Snippet

"GET /robots.txt HTTP/1.0" 200 0 "-" "Scooter-3.2.EX"

"GET / HTTP/1.0" 200 4591 "-" "Scooter-3.2.EX"

"GET /robots.txt HTTP/1.0" 200 64 "-" "ia_archiver"

"GET / HTTP/1.1" 200 4205 "-" "libwww-perl/5.63"

"GET /robots.txt HTTP/1.0" 200 64 "-" "FAST-WebCrawler/3.5 (atw-crawler at fast dot no; http://fast.no/support.php?c=faqs/crawler)"

"GET /robots.txt HTTP/1.0" 200 64 "-" "Mozilla/3.0 (Slurp/si; [email protected]; http://www.inktomi.com/slurp.html)"



After Robot Visitation

• Some Webmasters Panic After Being Visited– Generally Not a Problem– Generally a Benefit– No Relation to Viruses– Little Relation to Hackers– Close Relation to Lots of Visits



Controlling Robot Access

• Excluding Robots Is Feasible Using Server Authentication Techniques– .htaccess File & Directives

• Deny From 0.0.0.0 (IP Address)• SetEnvIf User-Agent Robot is_a_robot

• Can Increase Server Load• Seldom Required

– More Often (Mis) Desired



Robot Exclusion Standard

• Robot Exclusion Standard Exists– Consists of Single Site-wide File

• /robots.txt• Contains Directives, Comment Lines, & Blank Lines

– Not a Locked Door– More of a "No Entry" Sign– Represents a Declaration of Owner's Wishes– May Be Ignored by Incoming Traffic

• Much Like a Red Traffic Light– If Everyone Follows The Rules, The World's a Better Place



Sample robots.txt File

# /robots.txt file for http://webcrawler.com/

# mail [email protected] for constructive criticism

User-agent: webcrawler

Disallow:

User-agent: lycra

Disallow: /

User-agent: *

Disallow: /tmp

Disallow: /logs



Exclusion Standard Syntax

# /robots.txt file for http://webcrawler.com/

# mail [email protected] for constructive criticism

• Lines Beginning With '#' Are Comments• Comment Lines Are Ignored

– Comments May Not Appear Mid-Line




User-agent: webcrawler

Disallow:

• Specify That the Robot Named 'webcrawler'• Has Nothing Disallowed

– It May Go Anywhere on This Site




User-agent: lycra

Disallow: /

• Specify That the Robot Named 'lycra'• Has All URLs starting with '/' Disallowed

– It May Go Nowhere on This Site– Because All URLs On This Server

• Begin With Slash




User-agent: *

Disallow: /tmp

Disallow: /logs

• Specify That All Robots• Has URLs starting with '/tmp' & '/log' Disallowed

– It May Not Access Any URLs Beginning With Those Strings

• Note The '*' is a Special Token– Meaning "any other User-agent"

• Regular Expressions Cannot Be Used




• Two Common Configuration Errors– Wildcards Are Not Supported

• Do Not Use 'Disallow: /tmp/*'• Use 'Disallow: /tmp'

– Put Only One Path on Each Disallow Line• This May Change in a Future Version of the Standard



robots.txt File Location

• The Robot Exclusion File Must be Placed at The Server's Document Root

• For example:Site URL Corresponding Robots.txt URL

http://www.w3.org/ -> http://www.w3.org/robots.txt

http://www.w3.org:80/ -> http://www.w3.org:80/robots.txt

http://www.w3.org:1234/ -> http://www.w3.org:1234/robots.txt

http://w3.org/ -> http://w3.org/robots.txt



Common Mistakes

• Urls Are Case Sensitive– "/robots.txt" must be all lower-case

• Pointless robots.txt URLshttp://www.w3.org/admin/robots.txt

http://www.w3.org/~timbl/robots.txt

• On a Server With Multiple Users– Like linus.ulltra.com– robots.txt Cannot Be Placed in Individual Users'

Directories– It Must Be Placed in the Server Root

• By the Server Administrator



For Non-System Administrators

• Sometimes Users Have Insufficient Authority to Install a /robots.txt File– Because They Don't Administer the Entire Server

• Use META Tag In individual HTML Documents to Exclude Robots

<META NAME="ROBOTS" CONTENT="NOINDEX">

– Prevents Document From Being Indexed

<META NAME="ROBOTS" CONTENT="NOFOLLOW">

– Prevents Document Links From Being Followed



Bottom Line

• Use Robots Exclusion to Prevent Time Variant Content From Being Improperly Indexed

• Don't Use It to Exclude Visitors• Don't Use It to Secure Sensitive Content

– Use Authentication If It's Important– Use SSL If It's Really Important