55
SEO ‘Crawl Tank’ ‘Death and Resurrection’ WHY YOU SHOULD CARE ABOUT TAKING CARE OF CRAWLS (INTELLIGENT USE OF CRAWL ALLOCATION (BUDGET)) THE QUEST FOR ‘CRAWL RANK’ Dawn Anderson @ dawnieando

SEO Crawl Rank And Crawl Tank - Brighton SEO April 2016

Embed Size (px)

Citation preview

Page 1: SEO Crawl Rank And Crawl Tank - Brighton SEO April 2016

SEO ‘Crawl Tank’ -­‐ ‘Death and Resurrection’

WHY YOU SHOULD CARE ABOUT TAKING CARE OF CRAWLS (INTELLIGENT USE OF CRAWL ALLOCATION (BUDGET))

THE QUEST FOR ‘CRAWL RANK’ Dawn Anderson @ dawnieando

Page 2: SEO Crawl Rank And Crawl Tank - Brighton SEO April 2016

Indexed Web contains at least 4.73 billion pages (13/11/2015)

1THE WEB IS ‘BIG’Total number of websites

2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014

1,000,000,000

750,000,000

500,000,000

250,000,000

SINCE 2013 THE WEB IS THOUGHT TO HAVE INCREASED IN SIZE BY 1/3

Page 3: SEO Crawl Rank And Crawl Tank - Brighton SEO April 2016

2THE ABILITY TO ‘SELF PUBLISH’ EASILY HAS CLEARLY INFLUENCED THIS – WE ALL‘LOVE CONTENT’

IMPORTANT TO NOTE THAT 75% OF WEBSITES ONLINE ARE DORMANT (E.G. PARKED DOMAINS)

IMAGINE HOW MANY UNIQUE URLs COMBINED THIS AMOUNTS TO?

– A LOT

http://www.internetlivestats.com/total-­‐number-­‐of-­‐websites/

Page 4: SEO Crawl Rank And Crawl Tank - Brighton SEO April 2016

Capacity limits on Google’s

crawling system

By prioritising URLs for crawling

By assigning crawl period

intervals to URLs

How have search engines responded?

By creating work ‘schedules’ for Googlebots

3TOO MUCH CONTENT

Page 5: SEO Crawl Rank And Crawl Tank - Brighton SEO April 2016

4HERE’S WHY -­> EVERYTHING HAS A FINITE CAPACITY (EVEN CRAWLING)

“While web pages can be manually selected for crawling, this becomes impracticable as the number of web pages grows. Moreover, to keep within the capacity limits of the crawler, automated selection mechanisms are needed to determine not only which web pages to crawl, but which web pages to avoid crawling. For instance, as of the end of 2003, the WWW is believed to include well in excess of 10 billion distinct documents or web pages, while a search engine may have a crawling capacity that is less than half as many documents.” -­‐ Scheduler for search engine crawler Google PatentUS 8042112 B1, (Zhu et al)

Page 6: SEO Crawl Rank And Crawl Tank - Brighton SEO April 2016

‘Managing items in crawl schedule’ - US 8666964 B1

Include

5SOME GOOGLE CRAWL SCHEDULER PATENTS

‘Scheduling a recrawl’ - US 8386459 B1

‘Web crawler scheduler that utilizes sitemaps from websites’ -US 8037054 B2

‘Document reuse in a search engine crawler’ - US 8707312 B1

‘Minimizing visibility of stale content in web searching including revising web crawl intervals of documents’ - US 8407204 B2

‘Scheduler for search engine crawler’ - US 8042112 B1

‘Distributed crawling of hyperlinked documents’- US 7305610 B1

IT SEEMS PRIORITIZATION AND GOOGLEBOT CRAWL EFFICIENCY ARE IMPORTANT TO SEARCH ENGINES

Page 7: SEO Crawl Rank And Crawl Tank - Brighton SEO April 2016

Crawled multiple times daily

Crawled daily Or bi-­‐daily

Crawled least on a ‘round robin’ basis – only ‘active’ segment is crawledSplit into segments

on random rotation

6“MANAGING ITEMS IN A CRAWL SCHEDULE”(GOOGLE PATENT US 8666964 B1)

Real TimeCrawl

Daily Crawl

Base Layer Crawl

3 layers / tiers URLs are moved in and out of layers based on past visits data (retrieved from logs)

PAGE ‘IMPORTANCE’ AND URL SCHEDULING

Page 8: SEO Crawl Rank And Crawl Tank - Brighton SEO April 2016

10 typesof

Googlebot

THE KEY SEARCH ENGINE (THE APPLIANCE) CHARACTERS

7

SUPPORTING ROLES (LOG MANAGERS & PAGE

RANKERSIndexer /

Ranking EngineThe URL Scheduler

History Logs

Link Logs / Link Maps

Anchor Logs / Anchor Maps

Status Logs

Page Rankers

Page 9: SEO Crawl Rank And Crawl Tank - Brighton SEO April 2016

8THE ‘LOG’ MANAGERS (‘The Clerks’)History Logs

Link LogsJOBS INCLUDE

JOBS INCLUDE Other LogsJOBS INCLUDE

Consider these as ‘record-­keepers’ (record info on the crawled URLS

Retrieves previous copies of documents for comparison with newly retrieved copies for purposes of ’change frequency’ and ‘change weight’ calculation (last modified & update rate)

Include:

“identifies all the links (e.g., URLs, also called outbound links) that are found in the document associated with the record and the text that surrounds the link” (Brawer et al, Google Patent)INFO USED TO MAKE LINK MAPS

• Anchor Logs & Maps

• Status Logs

A LOT MORE INFO ON LOGS AT: Scheduler for Search Engine CrawlerUS 20100241621 A1

Page 10: SEO Crawl Rank And Crawl Tank - Brighton SEO April 2016

9SUPERVISOR -­ TEAM LEADER – ‘THE URL SCHEDULER’

Think of it as Google’s line manager or ‘air traffic controller’ for Googlebots in the web crawling system

JOBSSchedules Googlebot visits to URLsDecides which URLs to ‘feed’ to GooglebotUses data from the history logs about past visitsAssigns visit regularity of Googlebot to URLsDrops ‘hints’ to Googlebot to guide on types of content NOT to crawl and excludes some URLs from schedulesAnalyses past ‘change’ periods and predicts future ‘change’ (BASED ON PAST VISIT DATA) periods for URLs for the purposes of scheduling Googlebot visitsChecks ‘page importance’ in scheduling visits (PRIORITIES)Assigns URLs to ‘layers / tiers’ for crawling schedules (REAL TIME, DAILY, BASE LAYER SEGMENT)

The URL Scheduler controls the meal planner

Scheduler checks URLs for ‘importance’, ‘boost factor’ candidacy, ‘probability of modification’

‘Budgets’ are allocated

Carefully controls the list of URLs Googlebot visits

Page 11: SEO Crawl Rank And Crawl Tank - Brighton SEO April 2016

THE 10 GOOGLEBOTS

ImageVideo News Adsense Adsbot

PAID SEARCH TYPES

10

MEDIA TYPES

Smartphone AppsFeaturephoneMobile Adsense

MOBILE TYPES

BOT TYPES HAVE VARYING DEGREES OF ‘BUSY-NESS’

GOOGLEBOT WEB SEARCH

Crawls images only

QualityChecks

Babybot (’the Noob’)

Page 12: SEO Crawl Rank And Crawl Tank - Brighton SEO April 2016

GOOGLEBOT JOBS 11

JOBS• ‘Ranks nothing at all’• Takes a list of URLs to crawl from URL Scheduler• Job varies based on ‘bot’ type (e.g. Image bot seems a bit of a ‘part

timer’ (images change less frequently))• Runs errands & makes deliveries for the URL server, indexer / ranking

engine and logs• Makes notes of outbound linked pages and additional links for future

crawling (in order for them to be assigned to future crawling schedules)• Takes notes of ‘hints’ from URL scheduler when crawling• Tells tales of URL accessibility status, server response codes, notes

relationships between links and collects content checksums (binary data equivalent of web content) for comparison with past visits by history and link logs

Page 13: SEO Crawl Rank And Crawl Tank - Brighton SEO April 2016

12‘INDEXER’

Looks at all of the evidence from the various logs (and the page rankers) of the search engine to index the URLs

• Uses the combined data collected in order to index the results for a given query

• TAKES DATA FROM THE LOGS TO GENERATE INDEXES

“The indexer(s) 724 use the anchor maps 718 and other logs 716 to generate index(es) 726. The index(es) are used by the search engine to identify documents matching queries entered by users of the search engine.” (Web crawler scheduler that utilizes sitemaps from websitesUS 8037054 B2, Google Patent, Brawer et al, pub 2011)

Page 14: SEO Crawl Rank And Crawl Tank - Brighton SEO April 2016

I ASKED JOHN MUELLER AT WEBMASTER HANGOUT ABOUT URL QUEUES

14

GOOGLE WEBMASTER HANGOUT QUESTION ON ’URL QUEUEING’

BUT WHAT OTHER EVIDENCE DO WE HAVE TO SUPPORT OUT THEORIES?

“URLS ARE NOT ALL CRAWLED IN ORDER, BUT THAT SOME RECEIVE MULTIPLE DAILY CRAWLS, SOME DAILY, SOME WEEKLY AND SOME VERY INFREQUENTLY”https://www.seroundtable.com/google-­‐explains-­‐why-­‐the-­‐search-­‐console-­‐has-­‐reporting-­‐delays-­‐21688.html

LOW IMPORTANCE URLs APPEAR TO BE ‘QUEUED FOR LATER’ AND VISITED INFREQUENTLY WHEN THERE IS SPARE CAPACITY (LOWER PRIORITY) (SCHEDULES)

Page 15: SEO Crawl Rank And Crawl Tank - Brighton SEO April 2016

WHICH APPEARED TO SUPPORT… 15

“Priority scores are computed for each remaining document identifier based on predetermined criteria (e.g., a page importance score of the document).” (Zhu et al, 2011)

PATENT -­‐ Scheduler for search engine crawlerUS 8042112 B1

Page 16: SEO Crawl Rank And Crawl Tank - Brighton SEO April 2016

16CRAWL BUDGET

1. CRAWL BUDGET – “AN ALLOCATION OF CRAWL VISITS TO A HOST”

3. PAGES WITH A LOT OF LINKS GET CRAWLED MORE

4. THE VAST MAJORITY OF URLS ON THE WEB DON’T GET A LOT OF BUDGET ALLOCATED TO THEM (LOW TO 0 PAGERANK URLS).

2. ROUGHLY PROPORTIONATE TO PAGERANK AND HOST SPEED / CAPACITY

Mostly taken from Eric Enge’s (interview with Matt Cutts (@mattcutts) interview from 2010

https://www.stonetemple.com/matt-­‐cutts-­‐interviewed-­‐by-­‐eric-­‐enge-­‐2/

Page 17: SEO Crawl Rank And Crawl Tank - Brighton SEO April 2016

I ASKED SOME STUFF ABOUT CRAWL BUDGET ALLOCATION

17

DISTRIBUTED CRAWLING OF HYPERLINKED DOCUMENTS -­‐ Patent Abstract – “Hyperlinked documents to be crawled are grouped by host and the host to be crawled next is selected according to a stall time of the host. The stall time can indicate the earliest time that the host should be crawled and the stall times can be a predetermined amount of time, vary by host and be adjusted according to actual retrieval times from the host” (Dean et al (Google, 2014))

IT SEEMS – BUDGET IS ASSIGNED TO THE HOST (I.P) AND THEN SHARED BETWEEN THE SITES THERE

Page 18: SEO Crawl Rank And Crawl Tank - Brighton SEO April 2016

I ASKED SOME STUFF ABOUT LINKS AND CRAWL BUDGET (in light of 2012 ‘DISAVOW TOOL’)

18

TIP (IMHO -­ DAWN) –YOU MAY NEED TO RESTRUCTURE / FLATTEN SO ‘BUDGET’ CAN REACH IMPORTANT URLS

“Thanks John” -­‐Waving J

Page 19: SEO Crawl Rank And Crawl Tank - Brighton SEO April 2016

19IT SEEMS THERE MORE FACTORS AFFECTING ‘CRAWL BUDGET??’

Transcript: https://searchenginewatch.com/2016/04/06/webpromos-­‐qa-­‐with-­‐googles-­‐andrey-­‐lipattsev-­‐transcript/

WEB PROMOS Q & A WITH GOOGLES ANDREY LIPATTSEV

Andrev chatting with Ammon J seemed to imply that a lot more things affect crawl frequency now than just PageRank

Page 20: SEO Crawl Rank And Crawl Tank - Brighton SEO April 2016

20

ARE THERE OTHER FACTORS AFFECTING BUDGET AND / OR ‘CRAWL RANK’ AS WELL AS PAGERANK AND SPEED?

I ASKED @johnmu IF I COULD ASK WHETHER THE FACTORS AFFECTING CRAWL BUDGET HAD CHANGED?

JOHN SAID – “Sure…You can always ask” JJ –“But, he didn’t tell me what they were (if any)”

SO I ASKED IF I COULD ASK IF FACTORS AFFECTING CRAWL BUDGET / CRAWL FREQUENCY HAD CHANGED – I.E. ADDITIONAL FACTORS?

Page 21: SEO Crawl Rank And Crawl Tank - Brighton SEO April 2016

22GOOGLE PATENT – ‘NOT ALL ‘CHANGE’ IS CONSIDERED EQUAL’ (CRITICAL & NON-­CRITICAL)“Changes can be described as critical or non-­critical and that determination may depend on the portion of the document changed, or the context of the changes, rather than the amount of text or content changed. Sometimes a change to a document may be insubstantial, e.g., the change of advertisements associated with a document. In this case, it is more appropriate to ignore those accessory materials in a document prior to making content comparisons. In other cases, e.g., as part of a product search, not every piece of information in a document is weighted equally by a potential user. For instance, the user may care more about the unit price of the product and the availability of the product. In this case, it is more appropriate to focus on the changes associated with information that is deemed critical to a potential user rather than something that is less significant, e.g., a change in a product's colour” (Minimizing Visibility of Stale Content in Web Searching Including Revising Web Crawl Intervals of Documents -­‐ Anton Carver, Google Patent -­‐ US 20130226897 A1, pub 2013)

Probability & predictability of future ‘freshness’ (newness or critical material change) (‘CHANGE RATE’ APPEARS TO BE ‘LEARNED’)

’CHANGE RATE & CHANGE WEIGHT THRESHOLDS’

Page 22: SEO Crawl Rank And Crawl Tank - Brighton SEO April 2016

CRITICAL MATERIAL CONTENT CHANGE (IMPORTANT CHANGE) & FEATURE WEIGHTS 21

C = ∑ i = 0 n -­‐ 1 weight i * feature

NOT JUST ‘RANDOM’ CHANGE like Shuffle($variable) or RAND($variable)

NOT ALL ‘FEATURES’ ARE CREATED EQUAL ACCORDING TO THIS LINE IN PATENTS –” weight i * feature”

EXAMPLE FEATURES – E.G. A CHANGE IN PRICE (FEATURE) MAY BE WEIGHTED HIGHER THAN A CHANGE IN COLOUR (FEATURE) – FEATURE WEIGHT PRICE > FEATURE WEIGHT COLOUR

”DEPENDS ON HOW OFTEN THE PAGE CHANGES” IS MENTIONED A LOT IN WEBMASTER HANGOUTS

Minimizing Visibility of Stale Content in Web Searching Including Revising Web Crawl Intervals of Documents -­‐ Anton Carver, Google Patent -­‐ US 20130226897 A1, pub 2013

Page 23: SEO Crawl Rank And Crawl Tank - Brighton SEO April 2016

“BE CONSISTENT” -­ (@johnmu, Nov 2015) 23

SMX MILAN (November 2015), reported here by SERoundtable on quote from Google’s John Mueller @johnmu https://www.seroundtable.com/google-­‐number-­‐one-­‐seo-­‐advice-­‐be-­‐consistent-­‐21196.html

DA -­‐ I HAVE A FEELING CONSISTENCY IS IMPORTANT FOR ‘HISTORY LOGS’ TO ‘LEARN’ CHANGE RATES / THRESHOLDS

Page 24: SEO Crawl Rank And Crawl Tank - Brighton SEO April 2016

URL EXCLUSIONS FOR ‘TRIPPING ‘MINIMUM-­CRAWL-­THRESHOLD’ REVISIT ‘HINTS’ AND ‘SPAM’ URLs

24

‘RANDOM’ CHANGE created programmatically likeShuffle($variable) or RAND($variable) may even be seen as ‘hints’ TO GOOGLEBOT TO ‘NOT’ CRAWL

HINTS = ‘MEH CHANGES’ (E.G. PATTERNS OF ’SAME OLD, SAME OLD STUFF’ DUPLICATES, PROGRAMMATICALLY GENERATED CONTENT)

"Hints may also be employed on pages that are automatically generated and/or contain dynamically generated elements that result in the page having a different checksum every time it is crawled” (Managing Items In A Crawl Schedule, Google Patent -­ US 8666964 B1)

Page 25: SEO Crawl Rank And Crawl Tank - Brighton SEO April 2016

26GOOGLE THINKS CRAWL BUDGET IS IMPORTANT FOR SEO

CIRCA JULY 2015

BUT… NO ONE HAS EVER OFFICIALLY SAID THAT THERE’S ANY KIND OF RANKING BENEFIT FROM POSITIVE CRAWL ACTIVITY

Page 26: SEO Crawl Rank And Crawl Tank - Brighton SEO April 2016

ENTER ‘CRAWL RANK’ -­ A BENEFIT OF CRAWL OPTIMISATION??

27

“The pages that aren’t crawled as often are pages

with little to no PageRank. CrawlRank is the difference in this very large pool of pages.

You win if you get your low PageRank pages crawled more frequently than the competition.”

“I’m still not entirely convinced this is what is happening, but I’m seeing success using this philosophy. “-­‐ A J Kohn @ajkohn

OTHERS SEEM TO BE TRACKING IT TOO – E.G. SEO CLARITY

DOES THE MYTHOLOGICAL ‘CRAWL RANK’ BENEFIT EVEN EXIST?

Page 27: SEO Crawl Rank And Crawl Tank - Brighton SEO April 2016

DOES ‘CRAWL RANK’ STILL APPLY? 28

I ASKED A J KOHN IF HE STILL THOUGHT IT APPLIED NOW?

“Thanks A.J” -­‐Waving J

”I still see evidence that getting pages crawled frequently (within 7-­‐10 days) seems to have an impact on their ability to rank well” (AJ Kohn, 2016)

Page 28: SEO Crawl Rank And Crawl Tank - Brighton SEO April 2016

IS LONG-­TAIL ‘LEAP-­FROGGING’ (AND SOME CLUSTERING) WHAT ‘CRAWL RANK’ LOOKS LIKE?

29

SITES JUMPING OVER EACH OTHER ON ’LONG TAILED QUERIES’ IN AN ENDLESS LAST LAP RACE?

Page 29: SEO Crawl Rank And Crawl Tank - Brighton SEO April 2016

HOW IT APPEARS TO WORK – ‘YOU DON’T ALWAYS HAVE TO FIGHT THE ‘BOSS’ URLS’

30

Why fight with the Hulk when you can be Yoda?Image

Credit: Flickr

Page 30: SEO Crawl Rank And Crawl Tank - Brighton SEO April 2016

EVEN STRONGER DOMAINS HAVE WEAKER URLS 31

THE SITES MAY ALL BE STRONGER THAN YOU BUT THERE ARE A LOT OF PAGES ON BIG SITES WITH NO STRENGTH

YOU WON’T BEAT THE STRONG URLs WITH CRAWL OPTIMISATION ALONE

You are unlikely to beat these URLs with crawl optimisation techniques alone. These URLs are not the intended target for these tactics – TOO STRONG

SAVE SOME BATTLES FOR LATER

Strong URLs

Page 31: SEO Crawl Rank And Crawl Tank - Brighton SEO April 2016

FIGHT AT A URL V URL OR TEMPLATE V TEMPLATE LEVEL WITH LOW TO 0 PAGE RANK URLS 32

PICK OFF THE WEAKER URLS WHEN BATTLING WITH A BIG SITE –LOW TO NO PAGE RANK URLS• TARGETS THE LOW STRENGTH PAGES FURTHER

DOWN IN THE SITES OF COMPETITORS (SUBCATEGORY PAGES E.G. IN ECOMMERCE SITES

• THERE ARE A LOT OF PAGES (MILLIONS WITH LITTLE TO NO PAGE RANK)

• YOU’RE AIMING TO BEAT THOSE

VIRTUALLY NO STRENGTH IN 1,000s OF URLS

POWERFUL WELL KNOWN BRANDS BUT NO STRENGTH LOWER DOWN THE ARCHITECTURE

MANY LOW VOL / DEEP URLs ARE COMPLETE WEEDS ON BEHEMOTH SITES

Weak URLs

Page 32: SEO Crawl Rank And Crawl Tank - Brighton SEO April 2016

25A BIG FACTOR? -­ ‘EMPHASIS OF ‘ URL IMPORTANCE’’ (E.G. ON PARAMETERS)

FULL TRANSCRIPT -­‐ https://www.stonetemple.com/matt-­‐cutts-­‐interviewed-­‐by-­‐eric-­‐enge-­‐2/

THIS WAS IN THE ORIGINAL INTERVIEW WITH MATT CUTTS

ALSO LOTS OF THE PATENTS MENTION “PAGE IMPORTANCE (WHICH MAY INCLUDE PAGERANK)”

Page 33: SEO Crawl Rank And Crawl Tank - Brighton SEO April 2016

WHICH SEEMS TO SUPPORT THIS PAPER BY PAGE ET AL ON IMPORTANCE13

“Thanks Bill” -­‐Waving J

THIS REFERENCES THE PROBLEM OF THE SIZE OF THE WEB AND PRIORITIZES IMPORTANT PAGES

Efficient Crawling Through URL OrderingPage et al

Page 34: SEO Crawl Rank And Crawl Tank - Brighton SEO April 2016

’POINT TO THE NEEDLE IN THE HAY’ –EMPHASISE IMPORTANCE

33

• Googlebot is also ‘hunting’… Hunting for relevant ‘needles’ in 1,000,000,000s of straws of ‘hay’ on the web

• It’s about making your ‘one needle’ stand out in importance in not just your own site’s haystack, but tens of thousands of competing similar straws of hay in other site’s haystacks… (DON’T JUST MAKE YOUR HAYSTACK BIGGER)

“Hey, you Googlebot… This is the needle” via architectural internal linking without blur of duplication or too many redirects or canonicalization

Page 35: SEO Crawl Rank And Crawl Tank - Brighton SEO April 2016

13WHICH OF YOUR URLs ARE IMPORTANT?

“If you don’t consistently indicate via clean internal individual URL importance emphasis, the importance of your URLs, how will Googlebot know which are the most important?”

Page 36: SEO Crawl Rank And Crawl Tank - Brighton SEO April 2016

35INTERNAL LINKS COUNT (A LOT)(RELEVATIVE IMPORTANCE VOTES ON URL IMPORTANCE FROM YOUR OWN SITE)

THESE ARE YOUR ‘VOTES’ TO GOOGLEBOT ON THE IMPORTANCE OF EACH URL

EMPLOY ‘CONSISTENT’ INTERNAL LINK STRATEGIES

THINK OF THESE AS ‘WALL-­‐TIES’ HOLDING YOUR BUILDING (SITE ARCHITECTURE) TOGETHER

STOP VOTING FOR THE WRONG URLSFROM WITHIN YOUR OWN SITE.

WRONG TARGETS RANKING?… CHECK INTERNAL LINKS

From Google Support Pages

Consistent internal & external emphasis of a URLs ’IMPORTANCE’

Page 37: SEO Crawl Rank And Crawl Tank - Brighton SEO April 2016

38

NEGATIVE CONSEQUENCES FROM POOR CRAWL VISITS (E.G. SPIDER TRAPS (INFINITE LOOPS), INDIVIDUAL URLS VISITED LESS AND LESS FREQUENTLY BECAUSE THERE’S TOO MANY)

BUT IS THERE PERHAPS AN OPPOSITE OF ‘CRAWL RANK’? -­ ’CRAWL TANK’??IS THERE ADVERSE EFFECT WHEN CRAWLING GOES BAD?

Page 38: SEO Crawl Rank And Crawl Tank - Brighton SEO April 2016

WELL -­ I’VE SEEN ‘CRAWL TANK’ – ITAIN’T PRETTY

39

SITE SEO DEATH BY TOO MANY URLS AND INSUFFICIENT CRAWL BUDGET TO SUPPORT (EITHER DUMPING A NEW THIN PARAMETER INTO A SITE OR INFINITE LOOP (CODING ERROR) (SPIDER TRAP))

”BEEN THERE, DONE THAT”

Page 39: SEO Crawl Rank And Crawl Tank - Brighton SEO April 2016

IT KIND OF LOOKS A BIT LIKE THIS 40

”BEEN THERE, DONE THAT”

DEFINITELY

Page 40: SEO Crawl Rank And Crawl Tank - Brighton SEO April 2016

41‘EXPONENTIAL URL UNIMPORTANCE’?Your URLs exponentially, CONSISTENTLY confirmed unimportant to queries with each iterative crawl visit to other similar or duplicate content checksum URLs?

MULTPLE RANDOM URLs competing for same query confirm irrelevance of all competing in-­‐site URLs with no dominant relevant IMPORTANT URL?

Page 41: SEO Crawl Rank And Crawl Tank - Brighton SEO April 2016

STILL…SILVER LININGS 42

“EVERY SEO NEEDS A ’FLATLINER’ SITE TO RESURRECT AND MAKE BETTER… “RIGHT?

Page 42: SEO Crawl Rank And Crawl Tank - Brighton SEO April 2016

Going ‘where the action is’ in sites

The ‘need for speed’

Logical structure

Correct ‘response’ codes

XML sitemaps

‘Successful crawl visits

‘Seeing everything’ on a page

Taking ‘hints’

Clear unique single ‘URL fingerprints’ (no duplicates)

Predicting likelihood of ‘future change’

Slow sites

Too many redirects

Being bored (Meh) (‘Hints’ are built in by the search engine systems – Takes ‘hints’)

Being lied to (e.g. On XML sitemap priorities)

Crawl traps and dead ends

Going round in circles (Infinite loops)

Spam URLs

Crawl wasting minor content change URLs

‘Hidden’ and blocked content

Uncrawlable URLs

Duplicate URLs

Not just any change

Critical material change

Predicting future change

Dropping ‘hints’ to Googlebot

Sending GooglebotWhere ‘the action is’

43

LIKES DISLIKES CHANGE IS KEY

BASED ON DATA FROM THE HISTORY LOGS -­ CAN WE INFLUENCE VIA CRAWL OPTIMISATION TO ESCAPE THE ‘BASE LAYER HOME’ OF THE ’UNIMPORTANT’ URLS?

Page 43: SEO Crawl Rank And Crawl Tank - Brighton SEO April 2016

44HERE’S ONE I MADE EARLIER…SOME CAVEATSTHIS IS A PERSONAL PROJECT –MY 20 IN 70: 20:10 MIX

IT’S NOT MOBILE FRIENDLY OR HTTPS (HANGS HEAD IN SHAME), AND YES, IT NEEDS A MAKEOVER… BUT… TIME… , RESOURCES, BUDGET…BLAH BLAH

THERE IS NO ‘BIG BRAND’ MARKETING, VC BACKING, TV OR RADIO ADS (LIKE COMPETITORS) –JUST ME -­‐ ‘CHIPPING AWAY’

90%+ OF TRAFFIC ISNON-­‐BRANDED GENERICORGANIC

Page 44: SEO Crawl Rank And Crawl Tank - Brighton SEO April 2016

URL CRAWL FREQUENCY ’CLOCKING’ 46

Spreadsheet provided by @johnmu during Webmaster Hangout

https://goo.gl/1pToL8

ARE THE URLS THAT YOU WANT BEING CRAWLED ‘REAL TIME’, DAILY OR INFREQUENTLY? (REGULAR LOG ANALYSIS AND INTERVENTION TO EMPHASISE IMPORTANCE)

MY THOUGHTS (DA) -­‐ You need to find out which ones are getting crawled in the ‘real time’ schedule, the ‘daily crawl’ schedule and via random selection in the ‘dross’ (or UNLIKELY TO CHANGE A LOT / UNIMPORTANT) ‘base layer’ section. If it’s not the URLs that you want to be there, then formulate a plan to improve the ‘importance’ of URLS. (NOTE: JOHN DID NOT SAY THIS)

Page 45: SEO Crawl Rank And Crawl Tank - Brighton SEO April 2016

45LOSE THE ‘DEAD WOOD’ SO GOOGLEBOT DETECTS ‘IMPORTANCE’

FIX IT FOR A BETTER CRAWL EMBRACE

THE ‘410 GONE’FLATTENING

ARCHITECTURES, CONSISTENTLY AVOIDING CANNIBALISATION, INTERNAL LINK STRATEGIES, LINKING RELEVANT CONTENT TO RELEVANT CONTENT, UTILISING XML & FRONT FACING SITEMAPS AND STRONG HUB PAGES TO ‘HERD’ GOOGLEBOT AROUND THE SITE

Page 46: SEO Crawl Rank And Crawl Tank - Brighton SEO April 2016

4740,000 TOWNS, CITIES & VILLAGES

40,000+ towns, cities and villages across the UK multiplied by X site categories (THAT’S A LOT OF LONG TAIL QUERY VOLUME)

Page 47: SEO Crawl Rank And Crawl Tank - Brighton SEO April 2016

48FWIW – LONG TAIL CRAWL TECHNIQUES SEEM TO APPLY TO OTHER SEARCH ENGINES TOO

By shortening crawl paths and crawl frequency intervals and emphasing important to subcategory URLs on frequently changed URLs (fresh) it appears you may gain a competitive advantage on long tail queries

Page 48: SEO Crawl Rank And Crawl Tank - Brighton SEO April 2016

IT’S ALIVE… NEEDS WORK… BUT ALIVE 49

CAVEAT: IT’S TOO COMPLEX TO ANSWER WITH A SIMPLE FEW EXAMPLES OF COURSE (TOO MANY FACTORS) – BUT… FOOD FOR THOUGHT

‘CRITICAL MATERIAL CHANGE FREQUENCY’ (FRESHNESS) AND DETECTED URL IMPORTANCE EMPHASIS VIA EXTERNAL OR INTERNAL SIGNALS (INC PAGERANK) SEEM KEY

IS IT ‘CRAWL RANK’ OR ‘EMPHASING URL IMPORTANCE’ BETTER THAN COMPETITORS EMPHASE IMPORTANCE OF LOW TO NO PAGERANK PAGES WHERE FEW OTHER FACTORS SEPARATE?

Page 49: SEO Crawl Rank And Crawl Tank - Brighton SEO April 2016

50CRAWL BUDGET & ‘CRAWL RANK’ – OTHER FACTORS??1. IT APPEARS TO BE APPORTIONED BY THE URL SCHEDULER (BUDGET)

2. PAGES WITH A LOT OF (HEALTHY??) LINKS GET CRAWLED MORE (EXTERNAL AND INTERNAL?) (BUDGET AND RANK?)

3. THERE ARE URL EXCLUSIONS – ( ’HINT TRIPPERS’, OBJECTIONABLE CONTENT AND ‘SPAM URLS’?? ) (BUDGET)4 – ‘CRITICAL MATERIAL CHANGE’ (FRESHNESS) AND THE PROBABILITY AND PREDICTABILITY OF CHANGE CORRELATE (BUDGET)

5 –’CONSISTENT’ EMPHASIS OF URL IMPORTANCE(BUT I THINK THAT THIS WAS ALWAYS THERE) MAY BE ’CRAWL RANK’(BUDGET AND RANK??)

’CRAWL RANK’ -­‐ IS IT CORRELATION OR CAUSATION? (DO IMPORTANT PAGES GET CRAWLED MORE, OR IS IT BECAUSE THEY ARE CRAWLED MORE THEY ARE IMPORTANT?)

Page 50: SEO Crawl Rank And Crawl Tank - Brighton SEO April 2016

CAN WEB PAGES CRAWLED INFREQUENTLY STILL RANK?

36

YESTHEY CAN STILL BE

’IMPORTANT’IT’S THE ONES YOU’RE INDICATING ARE UNIMPORTANT THAT YOU WANT TO KEEP AN EYE ON -­ #JUSTSAYING ;;)

Page 51: SEO Crawl Rank And Crawl Tank - Brighton SEO April 2016

“BE SMART ABOUT YOUR TAGS AND SITE ARCHITECTURE, STAY FRESH AND RELEVANT”(@maileohye, 2016)

37

SLIDE FROM APRIL 2016’S SEJSUMMIT ON SEO INSTRUCTIONS 2016FROM GOOGLE’S @maileohye

Page 52: SEO Crawl Rank And Crawl Tank - Brighton SEO April 2016

52EITHER WAY -­ ARE ALL THE CHECKS AND BALANCES INDICATING YOU ARE STILL ON TRACK?

BECAUSE -­‐ BRINGING A ROCKET BACK ON COURSE IS ‘CHALLENGING’

REGULAR TESTS AND EARLY DIAGNOSIS ARE CRUCIAL –STOP, CHECK AND KEEP CHECKING

‘TANK’ OR ‘RANK’?– YOU DECIDE

Page 53: SEO Crawl Rank And Crawl Tank - Brighton SEO April 2016

TWITTER -­‐ @dawnieandoGOOGLE+ -­‐ +DawnAnderson888LINKEDIN -­‐ msdawnanderson

THANKS FOR LISTENING FOLKS J Dawn Anderson @ dawnieando

ENJOY BRIGHTON SEO

Page 54: SEO Crawl Rank And Crawl Tank - Brighton SEO April 2016

REFERENCEShttp://www.internetlivestats.com/total-­‐number-­‐of-­‐websites/Scheduler for search engine crawler Google PatentUS 8042112 B1, (Zhu et al) -­‐ https://www.google.com/patents/US8707313Managing items in crawl schedule – Google Patent (Alpert) http://www.google.ch/patents/US8666964Document reuse in a search engine crawler -­‐ Google Patent (Zhu et al)https://www.google.com/patents/US8707312Web crawler scheduler that utilizes sitemaps (Brawer et al) -­‐http://www.google.com/patents/US8037054Distributed crawling of hyperlinked documents (Dean et al) -­‐http://www.google.co.uk/patents/US7305610Minimizing visibility of stale content (Carver) -­‐http://www.google.ch/patents/US20130226897

Page 55: SEO Crawl Rank And Crawl Tank - Brighton SEO April 2016

REFERENCESEfficient Crawling Through URL Ordering (Page et al) -­‐ http://oak.cs.ucla.edu/~cho/papers/cho-­‐order.pdfCrawl Optimisation (Blind Five Year Old – A J Kohn -­‐ @ajkohn) http://www.blindfiveyearold.com/crawl-­‐optimizationScheduling a recrawl (Auerbach) -­‐ http://www.google.co.uk/patents/US8386459Scheduler for search engine crawler (Zhu et al) -­‐ http://www.google.co.uk/patents/US8042112Efficient crawling through URL ordering (Page et al) -­‐ http://oak.cs.ucla.edu/~cho/papers/cho-­‐order.pdfGoogle Explains Why The Search Console Reporting Is Not Real Time (SERoundtable) https://www.seroundtable.com/google-­‐explains-­‐why-­‐the-­‐search-­‐console-­‐has-­‐reporting-­‐delays-­‐21688.htmlCrawl Data Aggregation Propagation (Mueller) -­‐ https://goo.gl/1pToL8Matt Cutts Interviewed By Eric Enge -­‐ https://www.stonetemple.com/matt-­‐cutts-­‐interviewed-­‐by-­‐eric-­‐enge-­‐2/Web Promo Q and A with Google’s Andrev Lippatsev -­‐https://searchenginewatch.com/2016/04/06/webpromos-­‐qa-­‐with-­‐googles-­‐andrey-­‐lipattsev-­‐transcript/Google Number 1 SEO Advice – Be Consistent -­‐ https://www.seroundtable.com/google-­‐number-­‐one-­‐seo-­‐advice-­‐be-­‐consistent-­‐21196.html