How Google Works: A Ranking Engineer's Perspective By Paul Haahr

How Google WorksA Ranking Engineer’s PerspectivePaul HaahrSMX WestMarch 3, 2016

GoogleSearchToday

Mobile First

Features

• spelling suggestions

• autocomplete

• related searches

• related questions

• calculator

• knowledge graph

• answers

• featured snippets

• maps

• images

• videos

• in-depth articles

• movie showtimes

• sports scores

• weather

• flight status

• package tracking

• …

Erin Simon

not legally mandated, but typically we call it "knowledge graph" externally rather than "knowledge panels"

Kara Berman

Agreed from a PR perspective.

Paul Haahr

Fixed. (I thought knowledge graph was the underlying data and knowledge panels were the presentation. We do refer to them as knowledge panels, at least in the local case. E.g., https://support.google.com/business/answer/6331288?hl=en. But it's no issue to change, so I changed it.)

Ranking

10 Blue Links

What documents do we show?

What order do we show them in?

Lifeof aQuery

Two Parts of a Search Engine• Ahead of time (before the query)• Query processing

Before the Query• Crawl the web• Analyze the crawled pages

• Extract links• Render contents• Annotate semantics• …

• Build an index

The Index• Like the index of a book• For each word, a list of pages it appears on• Broken up into groups of millions of pages

• At Google, these are called “shards”• 1000s of shards for the web index

• Plus per-document metadata

Kara Berman

Is this something we've said before? If not, do we want to share it?

Paul Haahr

I'm pretty sure Jeff, at least, has talked about it. (E.g., http://web.stanford.edu/class/cs276/Jeff-Dean-Stanford-CS276-April-2015.pdf) And it raised no issues for the engineers who reviewed it.

Query Processing• Query understanding and expansion

• Retrieval and scoring

• Post-retrieval adjustments

Query Understanding• Does the query name any known entities?

• [san jose convention center]• [matt cutts]

• Are there useful synonyms?• [gm trucks]: “gm” → “general motors”• [gm corn]: “gm” → “genetically modified”

• Context matters

Retrieval and Scoring• Send the query to all the shards• Each shard

• Finds matching pages• Computes a score for query+page• Sends back the top N pages by score

• Combine all the top pages• Sort by score

Post-retrieval adjustments• Host clustering, sitelinks• Is there too much duplication?• Spam demotions, manual actions• …

Erin Simon

I'd avoid using the phrase "manual actions." is there another way you could talk about what you're doing so that it sounds less like we're deliberately interfering with the fair and neutral process of the algorithm? maybe something like 'legally mandated removals' to indicate that we are not just tweaking results for our own reasons.

Paul Haahr

This is meant to be about spam (consider the audience) and not legal removals. I originally had "manual penalties" and Cody said, per Larry's request, they now always use the terminology "manual actions." But I reverse the order to "Spam demotions, manual actions" to make it clear that they're related.

What do ranking engineers do? (version 1)

Write code for those servers

ScoringSignals

Signal• A piece of information used in scoring• Query independent – feature of page

• PageRank, language, mobile friendliness, ...

• Query dependent – feature of page & query• keyword hits, synonyms, proximity, …


Look for new signals.

Combine old signals in new ways.

Metrics

“If you can not measure it, you can not improve it.”

–Lord Kelvin (sort of)

Key Metrics• Relevance

• Does a page usefully answer the user’s query?• Ranking’s top-line metric

• Quality• How good are the results we show?

• Time to result (faster is better)• ...

Higher results matter• “Position weighed”• “Reciprocally ranked” metrics

• Position 1 is worth 1• Position 2 is worth ½• Position 3 is worth ⅓• Position 4 is worth ¼• …


Optimize for our metrics

But where do themetrics come from?

Evaluation

How do we measure ourselves?• Live Experiments• Human Rater Experiments

LiveExperiments

Live Experiments• A/B experiments on real traffic

• Similar to what many other websites do

• Look for changes in click patterns• Harder to understand than you might expect

• A lot of traffic is in one experiment or another

Interpreting Live Experiments• Both pages P1 and P2 answer user’s need• For P1, answer is on the page• For P2, answer is on the page and in the snippet• Algorithm A puts P1 before P2 user clicks on P⇒ 1 “good”⇒• Algorithm B puts P2 before P1 no click “bad”⇒ ⇒

• Do we really think A is better than B?

HumanRaterExperiments

Human Rater Experiments• Show real people experimental search results• Ask how good the results are• Ratings aggregated across raters• Published guidelines explain criteria for raters• Tools support doing this in an automated way

Result Rating Task

Two Scales• Needs Met

• Does this page address the user’s need?• Our current relevance metric

• Page Quality• How good is the page?

MobileFirst

Mobile First Rating

“Needs Met rating tasks ask [raters] to focus on mobile user needs and think

about how helpful and satisfying the result is for the mobile users.”

How do we make it mobile-centric?• More mobile queries than desktop in samples• Pay attention to user’s location• Tools display mobile user experience• Raters visit websites on smartphones

NeedsMetRating

Needs Met Rating• Fully Meets• Highly Meets• Moderately Meets• Slightly Meets• Fails to Meets

(Following examples are from Rater Guidelines)

FullyMeets

(Very)HighlyMeets

HighlyMeets

(More)HighlyMeets

ModeratelyMeets

SlightlyMeets

Fails toMeet

PageQualityRating

Page Quality Concepts• Expertise• Authoritativeness• Trustworthiness

High Quality Pages• A satisfying amount of high quality main content

• The page and website are expert, authoritative, and trustworthy for the topic of the page

• The website has a good reputation for the topic of the page

Low Quality Pages• The quality of the main content is low

• There is an unsatisfying amount of main content

• The author does not have expertise or is not trustworthy or authoritative for the topic

• The website has a negative reputation

• The secondary content is distracting or unhelpful

OptimizingOurMetrics

Ranking engineers• Team of a few hundred computer scientists• Focused on our metrics and signals• Run lots of experiments• Make lots of changes

Development Process• Idea• Repeat until ready:

• Write code• Generate data• Run experiments• Analyze

• Launch report by Quantitative Analyst• Launch review


Move results with good ratings up.

Move results with bad ratings down.

WhatGoesWrong?

(And how do we fix it?)

Two kinds of problems• Systematically bad ratings• Metrics don’t capture things we care about

BadRatings

[texas farm fertilizer]• User is looking for a

brand of fertilizer

• Unlikely to want to go to the manufacturer’s headquarters

• Rater average called map of headquarters almost “Highly Meets”

Patterns of Losses• Look for things we think are bad in results

• Either live or from experiments

• Create examples for rater guidelines

New rater example

MissingMetrics

Low Quality Content in 2009-2011• Lots of complaints about low quality content• But our relevance metric kept going up

• Low quality pages can be very relevant• We thought we were doing great

• ⇒ We weren’t measuring what we needed to

Quality Metric• Gets directly at the quality issue• Not the same as relevance• Enabled development of quality-related signals

When theMetricsMissSomething


Fix rater guidelines ordevelop new metrics

(when necessary)

Thank you!

Questions?

Marketing

How Google Works: A Ranking Engineer's Perspective By Paul Haahr