168
2009

A Guide to Log Analysis with Big Query

Embed Size (px)

Citation preview

2009

God it’s bad.

-$1.5 Billion

Why hasn’t Google seen the changes on my page?

How should I prioritise errors in Search Console?

Are my canonicals being respected?

Does Google think this page is important?

What can you do with logs?

PART 1: THE WHY

Getting logs

Analysing Logs

Processing Logs

PART 2: THE HOW

What does a log look like?

123.65.150.10 - - [23/Aug/2010:03:50:59 +0000] "GET /my_homepage HTTP/1.1" 200 2262 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

IP Address

What does a log look like?

123.65.150.10 - - [23/Aug/2010:03:50:59 +0000] "GET /my_homepage HTTP/1.1" 200 2262 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

Timestamp

What does a log look like?

123.65.150.10 - - [23/Aug/2010:03:50:59 +0000] "GET /my_homepage HTTP/1.1" 200 2262 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

Request type

What does a log look like?

123.65.150.10 - - [23/Aug/2010:03:50:59 +0000] "GET /my_homepageHTTP/1.1" 200 2262 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

Homepage

What does a log look like?

123.65.150.10 - - [23/Aug/2010:03:50:59 +0000] "GET /my_homepage HTTP/1.1" 200 2262 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

Protocol

What does a log look like?

123.65.150.10 - - [23/Aug/2010:03:50:59 +0000] "GET /my_homepage HTTP/1.1" 200 2262 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

Status Code

What does a log look like?

123.65.150.10 - - [23/Aug/2010:03:50:59 +0000] "GET /my_homepage HTTP/1.1" 200 2262 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

Size of the page (in bytes)

What does a log look like?

123.65.150.10 - - [23/Aug/2010:03:50:59 +0000] "GET /my_homepage HTTP/1.1" 200 2262 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html))"

User Agent

What can you do with logs?

PART 1: THE WHY

Getting logs

Analysing Logs

Processing Logs

PART 2: THE HOW

5 things2 3 4 51

1 Diagnose crawling &

indexation issues

2 3 4 51

Number of requests

Five folders Googlebot crawled the most

Five folders Googlebot crawled the most

Number of requests

% of Organic sessions VS % of crawl budget

Sessions Crawl budget

2 Prioritisation

2 3 4 51

example.com/article

Prioritizing

1

Full Print

example.com/article/full

example.com/article/print

Prioritizing

2

example.com/article/pdf

Prioritizing

3

Prioritizing

1

Full Print

3 Spot bugs &

view site health

2 3 4 51

Delayed errors with a limit of 1000

4 How important does Google

see parts of your site?

2 3 4 51

My SEO was as bad as my design

But at least my hair was better

teflsearch.com

teflsearch.com/job-results

teflsearch.com/job-results/country/china

teflsearch.com/jobadvert3455

Average number of times Googlebot crawled a template

1. teflsearch.com

2. teflsearch.com/job-results

3. teflsearch.com/job-results/country/china

4. teflsearch.com/job-advert3455

1. teflsearch.com

2. teflsearch.com/job-results

3. teflsearch.com/job-results/country/china

4. teflsearch.com/job-advert3455

teflsearch.com/job-results

Average number of times Googlebot crawled a template

35%

5 How fresh does it think your

content is?

2 3 4 51

bit.ly/moz-fresh

Average number of times a page template is crawled by Googlebot

●Improve our internal linking●Build trust with last modified date in

sitemap

2 3 4 51

What can you do with logs?

PART 1: THE WHY

Getting logs

Analysing Logs

Processing Logs

PART 2: THE HOW

Talk to a developer and ask for information

Are all the logs in one place?

Hi x

I’m {x} from {y} and we’ve been asked to do some log analysis to understand better how Google is behaving on the website and I was hoping you could help with some questions about the log set-up (as well as with getting the logs!).

What we’d ideally like is 3-6 months of historical logs for the website. Our goal is look at all the different pages search engines are crawling on our website, discover where they’re spending their time, the status code errors they’re finding etc.

There are also some things that are really helpful for us to know when getting logs.

Do the logs have any personal information in?

We’re just concerned about the various search crawler bots like Google and Bing, we don’t need any logs from users, so any logs with emails, or telephone numbers etc. can be removed.

Do you have any sort of caching which would create separate sets of logs?

If there is anything like Varnish running on the server, or a CDN which might create logs in different location to the rest of your server? If so then we will need those logs as well as just those from the server. (Although we’re only concerned about a CDN if it’s caching pages, or serving from the same hostname; if you’re just using Cloudflare for example to cache external images then we don’t need it).

Are there any sub parts of your site which log to a different place?

Have you got anything like an embedded Wordpress blog which logs to a different location? If so then we’ll need those logs as well.

Do you log hostname?

It’s really useful for us to be able to see hostname in the logs. By default a lot of common server logging set-ups don’t log hostname, so if it’s not turned on, then it would be very useful to have that turned on now for any future analysis.

Is there anything else we should know?

Best,

{x}

Email for a developer

So we might have something that looks like this

What can you do with logs?

PART 1: THE WHY

Getting logs

Analysing Logs

Processing Logs

PART 2: THE HOW

BigQuery

BigQuery

Google’s online database for data analysis.

1. Ask powerful questions

2. Repeatable

3. Scaleable

4. Combine with crawl data

5. Easy to set-up

6. Easy to learn

What do we want from analysing our logs?

9,000,000 rows of data for 2 months.

400 - 800 queries

What can you do with logs?

PART 1: THE WHY

Getting logs

Analysing Logs

Processing Logs

PART 2: THE HOW

Format the logs so we can import them into BigQuery

Separate the Googlebot logs from all the other logs

Screaming Frog Log Analyser

Code something

Screaming Frog Log Analyser

Code something

bit.ly/logs-code

What can you do with logs?

PART 1: THE WHY

Getting logs

Analysing Logs

Processing Logs

PART 2: THE HOW

Our data in BQ

We make sure we got what we wanted

THE QUESTION: What is the total number of requests

Googlebot makes each day to our site?

Our first SQL query

SELECTtimestamp

FROM[mydata.log_analysis]

Our first SQL query

SELECTtimestamp

FROM[mydata.log_analysis]

Our first SQL query

SELECTDATE(timestamp)

FROM[mydata.log_analysis]

Our first SQL query

SELECTDATE(timestamp)

FROM[mydata.log_analysis]

Our first SQL query

SELECTDATE(timestamp) as date

FROM[mydata.log_analysis]

Our first SQL query

SELECTDATE(timestamp) as date

FROM[mydata.log_analysis]

Our first SQL query

SELECTDATE(timestamp) as date,count(*)

FROM[mydata.log_analysis]

Our first SQL query

SELECTDATE(timestamp) as date,count(*)

FROM[mydata.log_analysis]

GROUP BYdate

Our first SQL query

SELECTDATE(timestamp) as date,count(*) as number_of_requests

FROM[mydata.log_analysis]

GROUP BYdate

Our first SQL query

SELECTDATE(timestamp) as date,count(*) as number_of_requests

FROM[mydata.log_analysis]

GROUP BYdate

Comparing logs to GSC crawl volume

Number of requests

Run queries

Find something weird

Go look at crawl & website

Our data in BQ

1 Diagnose crawling &

indexation issues

2 Prioritisation

3 Spot bugs &

view site health

4 How important does Google

see parts of your site?

5 How fresh does it think

your content is?

1 Diagnose crawling &

indexation issues

4 How important does Google

see parts of your site?

What are the top 20 URLs crawled by

Google over our logs?

Login is my top crawled page and then search?

What are the top 20 page_path_1 folders

crawled by Google over our logs?

Location folders are taking more than 70% of my budget

Getting data by the day

Page Number of Googlebot Requests

page1 200,000

page2 120,000

Number of Googlebot requests day by day

3 Spot bugs &

view site health

How many of each status code does

Google find per day over our logs?

Number of Googlebot requests day by day

What are most requested 404 URLs by

Googlebot over the past 30 days?

Boy does it want that ad-tech snippet

5 How fresh does it think your

content is?

How many times on average is each page

in a page template crawled a day?

Average number of times a page template is crawled by Googlebot

How long does it take for a page to be discovered after being published?

How long does it take for a page to be discovered after being published?

What are the top 20 combinations of page_path_1 & path_path_2 folders

crawled by Google over the time period of our logs?

How long does it take for a page to be discovered after being published?

What are the top 20 combinations of page_path_1 & path_path_2 folders

crawled by Google over the time period of our logs?

Which pages have requests from Googlebot, which don’t appear in our crawl?

How long does it take for a page to be discovered after being published?

What are the top 20 combinations of page_path_1 & path_path_2 folders

crawled by Google over the time period of our logs?

Which pages have requests from Googlebot, which don’t appear in our crawl?

What are the top non-canonical pages being crawled?

How long does it take for a page to be discovered after being published?

What are the top 20 combinations of page_path_1 & path_path_2 folders

crawled by Google over the time period of our logs?

Which pages have requests from Googlebot, which don’t appear in our crawl?

What are the top non-canonical pages being crawled?

Which are most crawled parameters on the website?

How long does it take for a page to be discovered after being published?

What are the top 20 combinations of page_path_1 & path_path_2 folders

crawled by Google over the time period of our logs?

Which pages have requests from Googlebot, which don’t appear in our crawl?

What are the top non-canonical pages being crawled?

Which are most crawled parameters on the website?

How often are the most visited parameters crawled each day?

How long does it take for a page to be discovered after being published?

What are the top 20 combinations of page_path_1 & path_path_2 folders

crawled by Google over the time period of our logs?

Which pages have requests from Googlebot, which don’t appear in our crawl?

What are the top non-canonical pages being crawled?

Which are most crawled parameters on the website?

How often are the most visited parameters crawled each day?

Which directories have the most 301 & 404 error codes?

How long does it take for a page to be discovered after being published?

What are the top 20 combinations of page_path_1 & path_path_2 folders

crawled by Google over the time period of our logs?

Which pages have requests from Googlebot, which don’t appear in our crawl?

What are the top non-canonical pages being crawled?

Which are most crawled parameters on the website?

How often are the most visited parameters crawled each day?

Which directories have the most 301 & 404 error codes?

Which pages are crawled with parameters and without parameters?

How long does it take for a page to be discovered after being published?

What are the top 20 combinations of page_path_1 & path_path_2 folders

crawled by Google over the time period of our logs?

Which pages have requests from Googlebot, which don’t appear in our crawl?

What are the top non-canonical pages being crawled?

Which are most crawled parameters on the website?

How often are the most visited parameters crawled each day?

Which directories have the most 301 & 404 error codes?

Which pages are crawled with parameters and without parameters?

Which pages are only partly downloaded?

How many hits does each section get, when the sections are classified in an

external dataset?

How long does it take for a page to be discovered after being published?

What are the top 20 combinations of page_path_1 & path_path_2 folders

crawled by Google over the time period of our logs?

Which pages have requests from Googlebot, which don’t appear in our crawl?

What are the top non-canonical pages being crawled?

Which are most crawled parameters on the website?

How often are the most visited parameters crawled each day?

Which directories have the most 301 & 404 error codes?

Which pages are crawled with parameters and without parameters?

Which pages are only partly downloaded?

How many hits does each section get, when the sections are classified in an

external dataset?

What percentage of a directory was crawled over the past 30 days?

How long does it take for a page to be discovered after being published?

What are the top 20 combinations of page_path_1 & path_path_2 folders

crawled by Google over the time period of our logs?

Which pages have requests from Googlebot, which don’t appear in our crawl?

What are the top non-canonical pages being crawled?

Which are most crawled parameters on the website?

How often are the most visited parameters crawled each day?

Which directories have the most 301 & 404 error codes?

Which pages are crawled with parameters and without parameters?

Which pages are only partly downloaded?

How many hits does each section get, when the sections are classified in an

external dataset?

What percentage of a directory was crawled over the past 30 days?

What are the total number of requests across two different time periods?

That’s a lot of questions

bit.ly/logs-resource

bit.ly/logs-resource

bit.ly/logs-resource

bit.ly/logs-resource

In Summary

This is the thing you’re probably not doing

bit.ly/logs-resource

@dom_woodman

bit.ly/logs-resource

@dom_woodman