17
THE SCRAPING THREAT REPORT 2015 www.scrapesentry.com [email protected]

THE SCRAPING THREAT REPORT 2015 - ScrapeSentry · PDF fileThe Scraping Threat Report 2015 is the ... 49 % of the total scraping traffic originates from the US. ... In 2014 online classifieds

Embed Size (px)

Citation preview

Page 1: THE SCRAPING THREAT REPORT 2015 - ScrapeSentry · PDF fileThe Scraping Threat Report 2015 is the ... 49 % of the total scraping traffic originates from the US. ... In 2014 online classifieds

THE SCRAPING THREAT REPORT

2015

www.scrapesentry.com [email protected]

Page 2: THE SCRAPING THREAT REPORT 2015 - ScrapeSentry · PDF fileThe Scraping Threat Report 2015 is the ... 49 % of the total scraping traffic originates from the US. ... In 2014 online classifieds

The Scraping Threat Report 2015 is the result of months of analysis by ScrapeSentry’s anti-scraping analysts. Data is gathered from the ScrapeSentry Global Scraping Intelligence Platform and includes data from our clients across the globe, from smaller ecommerce companies to large established enterprises like Ladbrokes, easyJet and Move.com.

The data gives us a unique opportunity to identify, analyse and provide comments on the latest trends in scraping and the behavior of scrapers themselves. With this report, we aim to identify statistically significant data and provide insights into the use of scraping globally.

For questions regarding the report or its findings, contact us at [email protected].

What is scraping?Scraping (also known as web scraping, screen scraping or data scraping) is when large amounts of data from a web site is copied manually or with a script or program. Malicious scraping is the systematic theft of intellectual property in the form of data accessible on a web site. This can be illustrated using the example of an online directory. This is published Intellectual Property, in the format of names, addresses, and other business information. It is free for all to use the information as long as they comply with the term and conditions of the site. Unfortunately, scrapers do not care about terms and conditions, and will abuse the service by systematically downloading large amounts of data for personal gain.

The online directory loses control over its data, which has taken time and money to compile, maintain and publish as part of the site owner’s service offering. In a worst-case scenario, the site owner one day wakes up to a new competitor, offering the very same data as itself.

INTRODUCTION

www.scrapesentry.com [email protected]

Page 3: THE SCRAPING THREAT REPORT 2015 - ScrapeSentry · PDF fileThe Scraping Threat Report 2015 is the ... 49 % of the total scraping traffic originates from the US. ... In 2014 online classifieds

17 % increase in scraping attacks in 2014

Continued yearly increase in overall scraping activity for the fifth year in a row

22 % of all site visitors are considered to be scrapers

Professional scrapers utilize hijacked IP addresses through botnets in order to avoid detection

49 % of the total scraping traffic originates from the US. The ratio of total traffic toscraper traffic is worst from traffic originating in China. China accounts for 1.40 % of the total traffic but 17.13 % of the scraper traffic.

Scraping plugins for web browsers and WordPress have made it possible for inexperienced users to scrape websites more easily.

Companies in the travel industry remain top targets for scrapers, closely followed by Online Directories and Online Classifieds.

One notable scraping botnet identified infected more than 1.3 million IP addresses. Home users and businesses were compromised.

HIGHLIGHTS FROM

THE REPORT KEY FINDINGS

www.scrapesentry.com [email protected]

Page 4: THE SCRAPING THREAT REPORT 2015 - ScrapeSentry · PDF fileThe Scraping Threat Report 2015 is the ... 49 % of the total scraping traffic originates from the US. ... In 2014 online classifieds

0

5

10

15

20

25

30

20142013201220112010

Figure 1.1

Average scraping traffic per site in %

17 %20 % 21 %

23 %

27 %

Figure 1.1 shows the average percentage of all site searches that are identified as scraping traffic. For 2014 we identified that the average scraping traffic to our clients was 27 % of the total traffic. Hardest hit client’s received 60% of traffic from scrapers.

We have been accumulating scraping related data since 2010 in our ScrapeSentry Global Intelligence Platform. It makes it possible for us to compare the number of attacks and the methods used.

The amount of scraping traffic identified in 2014 has increased by 17% since last year and by 59 % compared to 2010 when we started to measure this data. Scraping continues to be a growing threat.

THE SCRAPINGTHREAT GROWS

www.scrapesentry.com [email protected]

Page 5: THE SCRAPING THREAT REPORT 2015 - ScrapeSentry · PDF fileThe Scraping Threat Report 2015 is the ... 49 % of the total scraping traffic originates from the US. ... In 2014 online classifieds

0

5

10

15

20

25

Year 2014Year 2013Year 2012Year 2011Year 2010

Figure 1.2

Proportion of unique visitors defined as scrapers by %

Our report shows that an average of 22 % of the site users have been identified as scrapers (figure 1.2).

The number of unique users defined as scrapers is also following the same trend as the scraping traffic.

Scrapers have become more aggressive and elusive, using a larger number of IP addresses to conduct their activity and to avoid detection.

In 2014 we registered a larger number of botnets of infected devices used by scrapers. This past year has seen higher levels of infected devices like computers, servers, and smartphones used to generate HTTP/HTTPS requests in an automated way.

Scrapers using botnets are generally considered the most skilled ones; they know that all anti-bot solutions already blocks “dirty nets”, but they are really cautious when it comes to blocking private IP addresses from trusted broadband providers.

www.scrapesentry.com [email protected]

22 %

18 %

12 %

9 %7 %

Page 6: THE SCRAPING THREAT REPORT 2015 - ScrapeSentry · PDF fileThe Scraping Threat Report 2015 is the ... 49 % of the total scraping traffic originates from the US. ... In 2014 online classifieds

0 5 10 15 20 25 30 35 40

6. Ecommerce

5. Betting

4. Ticketing

3. Online Directories

2. Online Classifieds

1. Travel

Figure 1.3

Average scraping traffic per industry in %

1. Travel industryFigure 1.3 shows that the most scraped is the travel industry with 39 % scraping traffic compared to last year’s 15%. Airlines in particular have been the most affected by scrapers.

Travel industry scrapers can be separated in two major groups, “Fare-Scrapers” and “Automated-Bookers”.

Fare-Scrapers: Scrape prices from a travel website in order to compare it to other sites. Their behaviours are really similar to scrapers of online directories and classifieds.

26 %

25 %

15 %

SCRAPING BYINDUSTRY

www.scrapesentry.com [email protected]

39 %

30 %

30 %

Page 7: THE SCRAPING THREAT REPORT 2015 - ScrapeSentry · PDF fileThe Scraping Threat Report 2015 is the ... 49 % of the total scraping traffic originates from the US. ... In 2014 online classifieds

Since data (prices) are very dynamic, they need to scrape continuously. This means that they need reliable servers, and a very responsive network to be able to crawl in real time.

Automated-Bookers: Several travel industry clients sell their data through an API via their own B2B websites. Scrapers are looking for a method to avoid paying these fees. Saving millions of dollars in fees pays for sophisticated scraping services such as dedicated servers, anonymization services, and botnets.

2. Online classifiedsOnline classifieds (e.g. car sales sites, estate agents, buy-and-sell sites) have been highly targeted by scrapers for years since their data is publicly available and easily accessible. In 2014 online classifieds had 30 % scraping traffic to their sites, which was at 14 % last year. Scrapers are aware that the data published on online classifieds is of great value. Scrapers gather the data and redisplay it on other sites in slightly different ways to drive traffic away from the original source and earn money without having the costs for collecting it.

3. Online directoriesHistorically, online directories has been the main target for scrapers. Even though the sector is suffering less of the overall percentage of scraping we have tracked, online directories still suffered an increase from 22 % to 30 % last year. To attract visitors and advertisers, online directories strive to deliver unique databases. Since the data is the core business for all online directories they are frequent targets of systematic scraping attacks.

Online directories risk losing control over their data which they have invested time and money to gather, maintain and make available as a part of their service offering.

4. Ticketing industry The ticketing industry has also seen a huge increase of scraping traffic in 2014 (27%) compared to 2013 (9%), mostly from entities which resell tickets at a higher price.

This activity shares the same behaviour as the Automated-Bookers we are seeing in the Travel sector. And since the scraping here is all about saving money in the booking process, the scrapers are prepared to invest in the process that helps them save money in the long run.

www.scrapesentry.com [email protected]

Page 8: THE SCRAPING THREAT REPORT 2015 - ScrapeSentry · PDF fileThe Scraping Threat Report 2015 is the ... 49 % of the total scraping traffic originates from the US. ... In 2014 online classifieds

5. Bettng industryThe betting industry experienced 25 % scraping traffic during 2014 a huge increase compared to 2013 (8 %), driven by the World Cup in Brazil.

Sports betting websites are targeted by arbitrage bots taking advantage of differences in odds. These crawlers aim to always win by placing one bet per each outcome on different betting websites. This procedure, also known as sure-betting, is frowned on by bookmakers.

A certain amount of scraping is an effective way for online bookies to discover and rectify bad odds.

6. Ecommerce industryThe Ecommerce sector was slightly more targeted by scrapers 2014 (15%) than during 2013 (12%). Ecommerce companies are seeing an increase in scraping traffic for a somewhat different reason than the other sectors. They are being targeted by competitors for pricing and inventory information in order to automatically adapt their own offerings.

www.scrapesentry.com [email protected]

Page 9: THE SCRAPING THREAT REPORT 2015 - ScrapeSentry · PDF fileThe Scraping Threat Report 2015 is the ... 49 % of the total scraping traffic originates from the US. ... In 2014 online classifieds

31.6 %

13.4 %

2.6 %

2.2 %

1.4 %

By tracing IP addresses we are able to identify the origin country of requests. When examining the data, we need to accept that the first point of origination of a scraper is generally unknown. Scrapers often lease IP addresses close to their target website’s client base, but that does not mean that the scrapers them-selves originate from that country. Nonetheless, some interesting data can be gleaned by analysing traffic from a global perspective.

USA, where hosting is cheap and readily available, perhaps unsurprisingly, accounts for most of the Total Traffic (31.6 %), but when analyzing the origin of Scraper Traffic, we take into account the total amount of traffic in comparison with the traffic we classify as scraping.

For example, while Sweden stands for 13.4 % of the total traffic, the amount of scrapers originated from Sweden is 3.7 %.

China has a total traffic of 1.4 %, while the amount of scrapers is 17.1 %. This means that the countries that have a higher amount of Scraper Traffic than Total Traffic host more scrapers.

ORIGIN OFSCRAPERS

www.scrapesentry.com [email protected]

0 5 10 15 20 25 30 35

China

Great Britain

Poland

Sweden

USA Figure 1.4

Total traffic in %

Page 10: THE SCRAPING THREAT REPORT 2015 - ScrapeSentry · PDF fileThe Scraping Threat Report 2015 is the ... 49 % of the total scraping traffic originates from the US. ... In 2014 online classifieds

In comparison with last year’s report, we can see an increase in scraping from the US, Germany and France, and a decrease in scraping from China and Russia.

www.scrapesentry.com [email protected]

49.4 %

3.7 %

1.5 %

1.9 %

17.1 %

0 10 20 30 40 50

China

Great Britain

Poland

Sweden

USA Figure 1.5

Scraper traffic in %

Page 11: THE SCRAPING THREAT REPORT 2015 - ScrapeSentry · PDF fileThe Scraping Threat Report 2015 is the ... 49 % of the total scraping traffic originates from the US. ... In 2014 online classifieds

We still find it is best to categorize the scrapers into three groups describing their dedication, motivation, and technical ability to attack a target. According to the Global Scraping Intelligence Platform the three levels are:

Amateur Scrapers: These scrapers utilize a small amount of IP addresses and user agent strings, and are blatantly visible in traffic logs. These scrapers have a low amount of dedication and resources, and will jump from site to site.

Professional Scrapers: These scrapers are much more elusive, and usually redistribute what they scrape to other companies for a profit. They have more IP addresses at their disposal, and will change user-agent strings and browsing methods periodically over the course of a couple hours, days or longer periods.

Advanced Scrapers: These scrapers are extremely dedicated and have a wide range of IP addresses. They change their browsing tactics and user-agents moments after a block. CAPTCHAs do not stop these scrapers, as their motivation is usually high enough to solve the problem. For many business, the cost of adding another roadblock like a CAPTCHA is more costly to the end user’s attention. Advanced Scrapers spend their time not looking for information, but rather looking to break the anti-scraping system.

During the course of 2014 there have been many new developments making it easier for an average internet user to scrape and utilize competitor’s data effectively. This comes in the form of better software both free and purchased, plugins, and new scraping companies.

SCRAPINGATTACKLEVELS

www.scrapesentry.com [email protected]

Page 12: THE SCRAPING THREAT REPORT 2015 - ScrapeSentry · PDF fileThe Scraping Threat Report 2015 is the ... 49 % of the total scraping traffic originates from the US. ... In 2014 online classifieds

0 10 20 30 40 50 60 70 80

Advanced

Professional

Amateur

Figure 1.6

Scraping attack levels in %

20 %

75 %

5 %

One technology being exploited is plugins for browsers. They allow scrapers through the browser with minimal technical skills. Web Scraper, a plugin for Chrome, has easy to use tools and visualization of the data. This technology can also scrape any website, so if the user gets blocked the scraper will easily be able to jump to another site without a problem. This has certainly increased the amount of amateur scrapers in the world. The good news is that the browsing pattern and user agents are not overtly dynamic resulting in easy detection.

Another development in 2014 was a WordPress plugin, WP Web Scraper. This plugin utilizes different programming and query languages to gather information. It has methods to display the information onto any WordPress webpage. This plugin is a little more advanced than the Web Scraper for Chrome. These are just a few examples of new scraping software.

Figure 1.6 depicts the percentage of scraping attack levels in 2014. The new easy to use tools for scraping have shown a slight increase in amateur scraping. Since it’s now so easy to become a scraper it is expected that the total amount of scrapers will drastically rise in the future.

www.scrapesentry.com [email protected]

Page 13: THE SCRAPING THREAT REPORT 2015 - ScrapeSentry · PDF fileThe Scraping Threat Report 2015 is the ... 49 % of the total scraping traffic originates from the US. ... In 2014 online classifieds

With the addition of browser plugins, scrapers now can effectively mask their browser and operating system. For example with every generated GET response in Google Chrome’s WebScraper plugin, the user can customize the user agent string and OS. They can even make the user agent string non-existent if they choose to. This is greatly influencing the difficulty of figuring out the users and non-users. Truly showing that scraping detection requires a cognitive being to be able to sift through the users and determine non-users.

Spoofed operating systems Windows is the most spoofed operating system with 60 %. Followed by Linux (22 %) and Mac (18 %) However, there has been an increased in spoofed mobile devices. The most common spoofed mobile device useragent is the iPhone/iPad which is running iOS.

0 10 20 30 40 50 60

Mac OS

Linux

Windows

Figure 1.7

Spoofed operating systems in %

60 %

22 %

18 %

BEHAVIOUROF SCRAPERS

www.scrapesentry.com [email protected]

Page 14: THE SCRAPING THREAT REPORT 2015 - ScrapeSentry · PDF fileThe Scraping Threat Report 2015 is the ... 49 % of the total scraping traffic originates from the US. ... In 2014 online classifieds

0 10 20 30 40 50

Windows Phone

Blackberry OS

Android

iOS

Figure 1.8

Spoofed mobile operating systems in %

0 5 10 15 20 25

Other

Internet Explorer

Safari

Chrome

Firefox

Figure 1.9

Spoofed mobile operating systems in %

48 %

45 %

5 %

2 %

25 %

24 %

23 %

18 %

10 %

Spoofed mobile operating systems Scrapers try to look like a normal legitimate user, either using a Windows machine when using a computer, or a mobile user using iOS or Android.

Spoofed web browsers

www.scrapesentry.com [email protected]

Page 15: THE SCRAPING THREAT REPORT 2015 - ScrapeSentry · PDF fileThe Scraping Threat Report 2015 is the ... 49 % of the total scraping traffic originates from the US. ... In 2014 online classifieds

DK

Figure 2.1

Infected devices per country (countrycodes) by %

In 2014, we identified and tracked a large botnet affecting many of our clients. This botnet infected more than 1.3 million IPs and this number is still growing by 10,000 IPs each week.

Many of the IPs exploited by the botmaster are private home devices, but we have seen that there are a large number of infected machines generating this traffic from many organisations and companies.

APPENDIX:SCRAPING

BOTNET

www.scrapesentry.com [email protected]

DE4%

DK2%

TH2%

DZ2%

IN2%

SA2%

BR2%

EG2%

ES2%

AU3%

FR3%

CA3%

GB4%

US5%

Others 62%

Page 16: THE SCRAPING THREAT REPORT 2015 - ScrapeSentry · PDF fileThe Scraping Threat Report 2015 is the ... 49 % of the total scraping traffic originates from the US. ... In 2014 online classifieds

DK

The Scraping Threat Report 2015 shows a continuously growing scraping activity for the fifth year in a row and an increasing amount of unique visitors defined as scrapers. Professional scrapers have become more aggressive and elusive, changing their behaviour and ip addresses in order to avoid detection.

A huge difference compared to previous years is that scraping is becoming more common among people with less technical background, thanks to user-friendly browser extensions and Wordpress plugins.

Even though traditional automatic blocking tools can stop the most obvious and less advanced attacks, they are quickly becoming ineffective in stopping more professional and advanced scrapers. Motivated scrapers with established revenue streams are constantly changing their behaviour to collect data. They use multiple IP addresses spoofing ordinary browsers and operating systems to imitate an ordinary user.

This is when it becomes important to separate non-legitimate scrapers from legitimate users. Automatic blocking tools see it all as black or white, block or no block, which causes problems for legitimate users as they are browsing the site. To stop motivated scrapers while not interfering with legitimate the answer is a tailored approach that combines the best detection and prevention technologies with 24/7 security monitoring by dedicated anti-scraping analysts. With this kind of solution unauthorized usage will be automatically blocked or create an alert that an anti-scraping analyst will investigate manually.

The most scraped industries all share the same problem. They have a lot of publicly available data and rely on it for their business success. If competitors or other operators steal data and use it for their purposes, it will affect them negatively and in the long run be a threat to their business model.

OURCONCLUSION

www.scrapesentry.com [email protected]

Page 17: THE SCRAPING THREAT REPORT 2015 - ScrapeSentry · PDF fileThe Scraping Threat Report 2015 is the ... 49 % of the total scraping traffic originates from the US. ... In 2014 online classifieds

DK

About ScrapeSentry Inc. ScrapeSentry Inc. is headquartered in Boston, USA and has an office in London, England. The company is a spin-off from Sentor Managed Security Services (Sentor MSS), headquartered in Stockholm, Sweden. Sentor MSS provides innovative security services to clients across the world.

We have been providing anti-scraping services 24/7 since 2006 and have a development team constantly developing our service platform. We have over 30 experts working within research and development to ensure an efficient service today and in the future. We are proud to protect some of the world’s best known online brands.

The ScrapeSentry Anti Scraping ServiceScrapeSentry stops unwanted scrapers from benefitting from our clients' intellectual property. Differentiating good from bad scrapers whether human or bot provides business intelligence to make real decisions that affect our clients' bottom line.

The technology is in its fourth release and is well positioned to handle the most advanced scrapers. The market has been limited to only extreme cases of scraping in the past, but now companies are waking to the idea of taking a defensive position on scraping to match their offensive tactics.

ScrapeSentry is a fully managed anti-scraping service based on a proprietary technology platform and 24/7 services delivered from the Sentor Security Operations Centre (SOC). These Services include monitoring, analysis,investigation, blocking policy development, enforcement, and support.

ABOUTSCRAPESENTRY

www.scrapesentry.com [email protected]