9
The Pains of Web Crawling POINTING OUT THE PITFALLS IN THE PROCESS

The pains of web crawling

Embed Size (px)

Citation preview

Page 1: The pains of web crawling

The Pains of Web CrawlingPOINTING OUT THE PITFALLS IN THE PROCESS

Page 2: The pains of web crawling

Web Search Engines & Web Crawlers

Web Crawlers are the building blocks of any search engine.

Each time a request is made, the engine calls upon its Bots to crawl the webpages and return with relevant matches.

Page 3: The pains of web crawling

How it Works ?

Crawl Web page

Discover Links

Follow Page Links

Extract Informatio

n

Index Links/URL

s

Page 4: The pains of web crawling

Major Pain Areas

Threat to Privacy Public Attitude Website Interface Strict Legal Policies

Page 5: The pains of web crawling

Threat to Privacy

Web crawling bots can crawl all websites

Easy machine readable formats

Easy hacking jobs

Individuals expect privacy constraints

Difficulty in implementing & following privacy constraints

Page 6: The pains of web crawling

Public Attitude

Web Crawling is generally viewed under Negative Light, as Behavioural Data is available at an ever increasing quantity and web crawlers are extremely cheap to set up and operate.

Website owners are manually trying to avoid websites from being crawled, thereby affecting genuine Web Crawlers

Page 7: The pains of web crawling

Websites Interface

Most websites designed for Human Interactions

Multi level checks like Login, captcha for verification

Unstructured text information

Difficult for web crawlers to access

Page 8: The pains of web crawling

Legal Policies

Multiple legal policies to adhere which are punishable by law:

Data should not be archived beyond a definite period.

Limited crawling is allowed but large scale is strictly prohibited.

Public content can be crawled only in adherence with copyright policies.

Robots.txt informs you which website to crawl and which not to crawl.

Terms-of-use of a website must be checked to ensure transparency.

Many websites, which are authentication-based, discourage crawling to ensure you are not hitting their servers too hard.

Robots.txt mentions the time-lag that must be ensured between consecutive crawls. Hitting the server consecutively can lead to IP blockage.

Personal information should not be linked with other databases.

Page 9: The pains of web crawling

“Thanks

FEATURED BY: PROMPTCLOUD