How to Optimize Your Website for Crawl Efficiency

Dawn Anderson @ dawnieando

Indexed Web contains at least 4.73 billion pages (13/11/2015)

05TOO MUCH CONTENTTotal number of websites

2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014

1,000,000,000

750,000,000

500,000,000

250,000,000

SINCE 2013 THE WEB IS THOUGHT TO HAVE INCREASED IN SIZE BY 1/3

Capacity limits on Google’s

crawling system

By prioritising URLs for crawling

By assigning crawl period

intervals to URLs

How have search engines responded?

By creating work ‘schedules’ for Googlebots

06TOO MUCH CONTENT

9 types of Googlebot

THE KEY PERSONAS 02

SUPPORTING ROLESIndexer /

Ranking EngineThe URL Scheduler

History Logs

Link Logs

Anchor Logs

LOOKING AT ‘PAST DATA’

‘Ranks nothing at all’Takes a list of URLs to crawl from URL SchedulerJob varies based on ‘bot’ typeRuns errands & makes deliveries for the URL server, indexer / ranking engine and logsMakes notes of outbound linked pages and additional links for future crawlingTakes notes of ‘hints’ from URL scheduler when crawlingTells tales of URL accessibility status, server response codes, notes relationships between links and collects content checksums (binary data equivalent of web content) for comparison with past visits by history and link logs

03GOOGLEBOT’S JOBS

04ROLES –MAJOR PLAYERS –A ‘BOSS’- URL SCHEDULER

Think of it as Google’s line manager or ‘air traffic controller’ for Googlebots in the web crawling system

Schedules Googlebot visits to URLsDecides which URLs to ‘feed’ to GooglebotUses data from the history logs about past visitsAssigns visit regularity of Googlebot to URLsDrops ‘hints’ to Googlebot to guide on types of content NOT to crawl and excludes some URLs from schedulesAnalyses past ‘change’ periods and predicts future ‘change’ periods for URLs for the purposes of scheduling Googlebot visitsChecks ‘page importance’ in scheduling visitsAssigns URLs to ‘layers / tiers’ for crawling schedules

Scheduler checks URLs for ‘importance’, ‘boost factor’ candidacy, ‘probability of modification’

GOOGLEBOT’S BEEN PUT ON A URL CONTROLLED DIET

09

The URL Scheduler controls the meal planner

Carefully controls the list of URLs Googlebot vits

‘Budgets’ are allocated

£

CRAWL BUDGET –WHAT IS IT? 10

Roughly proportionate to Page Importance (LinkEquity) & speed

Pages with a lot of healthy links get crawled more (Can include internal links??)

Apportioned by the URL scheduler to Googlebots

WHAT IS A CRAWL BUDGET? -‐ An allocation of ‘crawl visit frequency’ apportioned to URLs on a site

But there are other factors affecting frequency of Googlebot visits aside from importance / speed

The vast majority of URLs on the web don’t get a lot of budget allocated to them

Current capacity of the web crawling system is highYour URL is ‘important’Your URL changes a lot with critical material content changeProbability and predictability of critical material content change is high for your URLYour website speed is fast and Googlebot gets the time to visit your URLYour URL has been ‘upgraded’ to a daily or real time crawl layer

12POSITIVE FACTORS AFFECTING GOOGLEBOT VISIT FREQUENCY

Current capacity of web crawling system is lowYour URL has been detected as a ‘spam’ URLYour URL is in an ‘inactive’ base layer segmentYour URLs are ‘tripping hints’ built into the system to detect non-‐critical change dynamic contentProbability and predictability of critical material content change is low for your URLYour website speed is slow and Googlebot doesn’t get the time to visit your URLYour URL has been ‘downgraded’ to an ‘inactive’ base layer segmentYour URL has returned an ‘unreachable’ server response code recently

13NEGATIVE FACTORS AFFECTING GOOGLEBOT VISIT FREQUENCY

FIND GOOGLEBOT 16

AUTOMATE SERVER LOG RETRIEVAL VIA CRON JOB

grep Googlebot access_log>googlebot_access.txt

LOOK THROUGH ‘SPIDER EYES’ VIA LOG ANALYSIS – ANALYSE GOOGLEBOT

17

PREPARE TO BE HORRIFIEDIncorrect URL header response codes (e.g. 302s)301 redirect chainsOld files or XML sitemaps left on server from years agoInfinite/ endless loops (circular dependency)On parameter driven sites URLs crawled which produce same outputURLs generated by spammersDead image files being visitedOld CSS files still being crawled and loading legacy images e.g.

SEARCH ENGINE VIEW EMULATOR 11

http://www.ovrdrv.com/search_view

Lynx Browser -‐ 4 options to view through search engine eyes, human eyes, page source or page anlysis

21LOOK THROUGH ‘SPIDER EYES’

• GSC Crawl Stats

• Google Search Console (all tools)

• Deepcrawl

• Screaming Frog

• Server Log Analysis

• SEMRush (auditing tools)

• Webconfs (header responses / similarity checker)

• Powermapper (birds eye view of site)

• Search Engine View Emulator

18FIX GOOGLEBOT’S JOURNEYSPEED UP YOUR SITE TO ‘FEED’ GOOGLEGOT MORE

TECHNICAL ‘FIXES’ Speed up your site

Implement compression, minification, caching‘Fix incorrect header response codes

Fix nonsensical ‘infinite loops’ generated by database driven parameters or ‘looping’ relative URLs

Use absolute versus relative internal links

Ensure no parts of content is blocked from crawlers (e.g. in carousels, concertinas and tabbed content

Ensure no css or javascript files are blocked from crawlers

Unpick 301 redirect chains

21SPEED TOOLS

SPEED• Yslow

• Pingdom

• Google Page Speed Tests

• Minificiation – JS Compress and CSS Minifier

• Image Compression –Compressjpeg.com, tinypng.com

21URL IMPORTANCE TOOLS

URL IMPORTANCE• GSC Internal links Report (URL

importance)

• Link Research Tools (Strongest sub pages reports)

• GSC Internal links (add site categories and sections as additional profiles)

• Powermapper

STOP YOURSELF ‘VOTING’ FOR THE WRONG INTERNAL LINKS IN YOUR SITE

22‘IT CANNOT BE EMPHASISED ENOUGH HOW IMPORTANT IT IS TO EMPHASISE IMPORTANCE’

Most Important Page 1



ONLINE DEMO OF XML GENERATOR 11

https://www.xml-‐sitemaps.com/generator-‐demo/https://www.xml-‐

sitemaps.com/generator-‐demo/

1. Use XML sitemaps2. Add site sections (e.g. categories) as profiles in Google Search Console for more granularity3. Keep 301 redirections to a minimum4. Use regular expressions on .htaccess files to implement rules and reduce crawl lag5. Look out for redirect chains6. Look out for infinite loops (spider traps)7. Check URL parameters in Google Search Console8. Check if URLs return the exact same content and choose one as the preferred URL9. Block or canonicalise duplicate content10. Use absolute versus relative URLs11. Improve site speed12. Use front facing HTML sitemaps for important pages13. Use noindex on pages which add no value but may be useful for visitors to traverse your site14. Use ‘if modified’ headers to keep Googlebot out of low importance pages15. Build server log analysis into your regular SEO activities

0315 THINGS YOU CAN DO

”WHEN GOOGLEBOT PLAYS ‘SUPERMARKET SWEEP’ YOU WANT TO FILL THE SHOPPING TROLLEY WITH LUXURY ITEMS”

Dawn Anderson @ dawnieando

REMEMBER

Marketing

How to Optimize Your Website for Crawl Efficiency