21
Dawn Anderson @ dawnieando

How to Optimize Your Website for Crawl Efficiency

  • Upload
    semrush

  • View
    1.327

  • Download
    3

Embed Size (px)

Citation preview

Page 1: How to Optimize Your Website for Crawl Efficiency

Dawn  Anderson  @  dawnieando

Page 2: How to Optimize Your Website for Crawl Efficiency

Indexed  Web  contains at  least  4.73  billion   pages (13/11/2015)

05TOO MUCH CONTENTTotal  number  of  websites

2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014

1,000,000,000

750,000,000

500,000,000

250,000,000

SINCE  2013  THE  WEB  IS  THOUGHT  TO  HAVE  INCREASED  IN  SIZE  BY  1/3

Page 3: How to Optimize Your Website for Crawl Efficiency

Capacity  limits  on  Google’s  

crawling  system

By  prioritising  URLs  for  crawling

By  assigning  crawl  period  

intervals  to  URLs

How  have  search  engines  responded?

By  creating  work  ‘schedules’  for  Googlebots

06TOO MUCH CONTENT

Page 4: How to Optimize Your Website for Crawl Efficiency

9  types  of  Googlebot

THE KEY PERSONAS 02

SUPPORTING  ROLESIndexer  /  

Ranking  EngineThe  URL  Scheduler

History  Logs

Link  Logs

Anchor  Logs

LOOKING  AT  ‘PAST  DATA’

Page 5: How to Optimize Your Website for Crawl Efficiency

‘Ranks  nothing  at  all’Takes  a  list  of  URLs  to  crawl  from  URL  SchedulerJob  varies  based  on  ‘bot’   typeRuns  errands  &  makes  deliveries   for  the  URL  server,  indexer  /  ranking  engine  and  logsMakes  notes  of  outbound   linked  pages  and  additional  links   for  future  crawlingTakes  notes  of  ‘hints’   from  URL  scheduler  when  crawlingTells  tales  of  URL  accessibility   status,  server  response  codes,   notes  relationships   between  links  and  collects  content  checksums   (binary   data  equivalent  of  web  content)  for  comparison  with  past  visits   by  history  and  link  logs

03GOOGLEBOT’S JOBS

Page 6: How to Optimize Your Website for Crawl Efficiency

04ROLES –MAJOR PLAYERS –A ‘BOSS’- URL SCHEDULER

Think  of  it  as  Google’s  line  manager  or  ‘air  traffic  controller’  for  Googlebots in  the  web  crawling  system

Schedules  Googlebot visits   to  URLsDecides  which  URLs  to  ‘feed’  to  GooglebotUses  data  from  the  history   logs  about  past  visitsAssigns   visit  regularity  of  Googlebot to  URLsDrops  ‘hints’   to  Googlebot to  guide  on  types  of  content  NOT  to  crawl  and  excludes  some  URLs  from  schedulesAnalyses   past  ‘change’   periods  and  predicts  future  ‘change’  periods   for  URLs  for  the  purposes   of  scheduling  Googlebot visitsChecks   ‘page  importance’  in  scheduling   visitsAssigns  URLs  to  ‘layers  /  tiers’  for  crawling  schedules

Page 7: How to Optimize Your Website for Crawl Efficiency

Scheduler   checks  URLs  for  ‘importance’,   ‘boost  factor’  candidacy,  ‘probability   of  modification’

GOOGLEBOT’S BEEN PUT ON A URL CONTROLLED DIET

09

The  URL  Scheduler  controls  the  meal  planner

Carefully  controls  the  list  of  URLs  Googlebot vits

‘Budgets’  are  allocated

£

Page 8: How to Optimize Your Website for Crawl Efficiency

CRAWL BUDGET –WHAT IS IT? 10

Roughly  proportionate  to  Page  Importance  (LinkEquity)   &  speed

Pages  with  a  lot  of  healthy  links  get  crawled  more  (Can  include  internal  links??)

Apportioned  by  the  URL  scheduler   to  Googlebots

WHAT  IS  A  CRAWL  BUDGET?  -­‐ An  allocation  of  ‘crawl  visit   frequency’  apportioned   to  URLs  on  a  site

But  there  are  other  factors  affecting  frequency   of  Googlebot visits   aside  from  importance  /  speed

The  vast  majority  of  URLs  on  the  web  don’t  get  a  lot  of  budget  allocated  to  them

Page 9: How to Optimize Your Website for Crawl Efficiency

Current  capacity  of  the  web  crawling  system  is  highYour  URL  is  ‘important’Your  URL  changes  a  lot  with  critical  material  content  changeProbability   and  predictability   of  critical  material  content  change  is  high  for  your  URLYour  website  speed   is  fast  and  Googlebot gets  the  time  to  visit  your  URLYour  URL  has  been  ‘upgraded’  to  a  daily   or  real  time  crawl  layer

12POSITIVE FACTORS AFFECTING GOOGLEBOT VISIT FREQUENCY

Page 10: How to Optimize Your Website for Crawl Efficiency

Current  capacity  of  web  crawling  system  is  lowYour  URL  has  been  detected  as  a  ‘spam’  URLYour  URL  is  in  an  ‘inactive’   base  layer  segmentYour  URLs  are  ‘tripping   hints’   built  into  the  system  to  detect  non-­‐critical  change  dynamic  contentProbability   and  predictability   of  critical  material  content  change  is  low  for  your  URLYour  website  speed   is  slow  and  Googlebot doesn’t   get  the  time  to  visit   your  URLYour  URL  has  been  ‘downgraded’   to  an  ‘inactive’  base  layer  segmentYour  URL  has  returned  an  ‘unreachable’   server  response  code  recently

13NEGATIVE FACTORS AFFECTING GOOGLEBOT VISIT FREQUENCY

Page 11: How to Optimize Your Website for Crawl Efficiency

FIND GOOGLEBOT 16

AUTOMATE  SERVER  LOG  RETRIEVAL  VIA  CRON  JOB

grep Googlebot access_log>googlebot_access.txt

Page 12: How to Optimize Your Website for Crawl Efficiency

LOOK THROUGH ‘SPIDER EYES’ VIA LOG ANALYSIS – ANALYSE GOOGLEBOT

17

PREPARE TO BE HORRIFIEDIncorrect  URL  header  response   codes   (e.g.  302s)301  redirect  chainsOld  files  or  XML  sitemaps  left  on  server  from  years  agoInfinite/  endless   loops   (circular  dependency)On  parameter  driven  sites  URLs  crawled  which  produce  same  outputURLs  generated  by  spammersDead  image  files  being  visitedOld  CSS  files   still  being  crawled  and  loading  legacy  images  e.g.

Page 13: How to Optimize Your Website for Crawl Efficiency

SEARCH ENGINE VIEW EMULATOR 11

http://www.ovrdrv.com/search_view

Lynx  Browser  -­‐ 4  options   to  view  through   search  engine  eyes,  human  eyes,  page  source  or  page  anlysis

Page 14: How to Optimize Your Website for Crawl Efficiency

21LOOK THROUGH ‘SPIDER EYES’

• GSC  Crawl  Stats

• Google  Search  Console  (all  tools)

• Deepcrawl

• Screaming  Frog

• Server  Log  Analysis

• SEMRush (auditing  tools)

• Webconfs (header  responses  /  similarity  checker)

• Powermapper (birds  eye  view  of  site)

• Search  Engine  View  Emulator

Page 15: How to Optimize Your Website for Crawl Efficiency

18FIX GOOGLEBOT’S JOURNEYSPEED UP YOUR SITE TO ‘FEED’ GOOGLEGOT MORE

TECHNICAL  ‘FIXES’      Speed  up  your  site

Implement  compression,  minification,  caching‘Fix  incorrect  header  response  codes

Fix  nonsensical  ‘infinite  loops’  generated  by  database  driven  parameters  or  ‘looping’   relative  URLs

Use  absolute  versus  relative  internal  links

Ensure  no  parts  of  content  is  blocked   from  crawlers  (e.g.  in  carousels,  concertinas  and  tabbed  content

Ensure  no  css or  javascript files  are  blocked   from  crawlers

Unpick  301   redirect  chains

Page 16: How to Optimize Your Website for Crawl Efficiency

21SPEED TOOLS

SPEED• Yslow

• Pingdom

• Google  Page  Speed  Tests

• Minificiation – JS  Compress  and  CSS  Minifier

• Image  Compression  –Compressjpeg.com,  tinypng.com

Page 17: How to Optimize Your Website for Crawl Efficiency

21URL IMPORTANCE TOOLS

URL  IMPORTANCE• GSC  Internal  links  Report  (URL  

importance)

• Link  Research  Tools  (Strongest  sub  pages  reports)

• GSC  Internal  links  (add  site  categories  and  sections  as  additional  profiles)

• Powermapper

Page 18: How to Optimize Your Website for Crawl Efficiency

STOP YOURSELF ‘VOTING’ FOR THE WRONG INTERNAL LINKS IN YOUR SITE

22‘IT CANNOT BE EMPHASISED ENOUGH HOW IMPORTANT IT IS TO EMPHASISE IMPORTANCE’

Most Important Page 1

Most  Important  Page  2

Most  Important  Page  3

Page 19: How to Optimize Your Website for Crawl Efficiency

ONLINE DEMO OF XML GENERATOR 11

https://www.xml-­‐sitemaps.com/generator-­‐demo/https://www.xml-­‐

sitemaps.com/generator-­‐demo/

Page 20: How to Optimize Your Website for Crawl Efficiency

1. Use  XML  sitemaps2. Add  site  sections   (e.g.  categories)  as  profiles  in  Google  Search  Console   for  more  granularity3. Keep  301  redirections  to  a  minimum4. Use  regular  expressions   on  .htaccess files  to  implement   rules  and  reduce  crawl  lag5. Look  out  for  redirect  chains6. Look  out  for  infinite   loops  (spider   traps)7. Check  URL  parameters  in  Google  Search  Console8. Check  if  URLs  return  the  exact  same  content  and  choose  one  as  the  preferred  URL9. Block  or  canonicalise duplicate  content10. Use  absolute  versus  relative  URLs11. Improve  site  speed12. Use  front  facing  HTML  sitemaps  for  important  pages13. Use  noindex on  pages  which  add  no  value  but  may  be  useful   for  visitors   to  traverse  your  site14. Use  ‘if  modified’   headers  to  keep  Googlebot out  of  low  importance  pages15. Build  server  log  analysis   into  your  regular  SEO  activities

0315 THINGS YOU CAN DO

Page 21: How to Optimize Your Website for Crawl Efficiency

”WHEN  GOOGLEBOT  PLAYS  ‘SUPERMARKET  SWEEP’  YOU  WANT  TO  FILL  THE  SHOPPING  TROLLEY  WITH  LUXURY  ITEMS”

Dawn  Anderson  @  dawnieando

REMEMBER