Upload
promptcloud
View
136
Download
2
Tags:
Embed Size (px)
Citation preview
The Pains of Web CrawlingPOINTING OUT THE PITFALLS IN THE PROCESS
Web Search Engines & Web Crawlers
Web Crawlers are the building blocks of any search engine.
Each time a request is made, the engine calls upon its Bots to crawl the webpages and return with relevant matches.
How it Works ?
Crawl Web page
Discover Links
Follow Page Links
Extract Informatio
n
Index Links/URL
s
Major Pain Areas
Threat to Privacy Public Attitude Website Interface Strict Legal Policies
Threat to Privacy
Web crawling bots can crawl all websites
Easy machine readable formats
Easy hacking jobs
Individuals expect privacy constraints
Difficulty in implementing & following privacy constraints
Public Attitude
Web Crawling is generally viewed under Negative Light, as Behavioural Data is available at an ever increasing quantity and web crawlers are extremely cheap to set up and operate.
Website owners are manually trying to avoid websites from being crawled, thereby affecting genuine Web Crawlers
Websites Interface
Most websites designed for Human Interactions
Multi level checks like Login, captcha for verification
Unstructured text information
Difficult for web crawlers to access
Legal Policies
Multiple legal policies to adhere which are punishable by law:
Data should not be archived beyond a definite period.
Limited crawling is allowed but large scale is strictly prohibited.
Public content can be crawled only in adherence with copyright policies.
Robots.txt informs you which website to crawl and which not to crawl.
Terms-of-use of a website must be checked to ensure transparency.
Many websites, which are authentication-based, discourage crawling to ensure you are not hitting their servers too hard.
Robots.txt mentions the time-lag that must be ensured between consecutive crawls. Hitting the server consecutively can lead to IP blockage.
Personal information should not be linked with other databases.