Making the Web Searchable - Keynote ICWE 2015

Making the Web Searchable

P R E S E N T E D B Y P e t e r M i k a , Y a h o o L a b s J u n e 2 5 , 2 0 1 5 ⎪

2

Agenda

Web Search› How it works… and where it fails

Semantic Web› The promise and the reality

Semantic Search› Research and Applications at Yahoo

What’s next?› More intelligence!

Search is really fast, but not particularly intelligent

What it’s like to be a machine?

Roi Blanco

What it’s like to be a machine?

↵⏏☐ģ

✜Θ♬♬ţğ √∞ §®ÇĤĪ✜★♬☐✓✓ţğ★✜

✪✚✜ΔΤΟŨŸÏĞÊϖυτρ℠≠⅛⌫Γ≠=⅚ ©§ ★✓♪ΒΓΕ℠

✖Γ♫⅜±⏎↵⏏☐ģğğğμλκσςτ⏎⌥°¶§ ΥΦΦΦ ✗✕☐

⏎↵⏏☐ģğğğ

Real problem

Search needs (more) intelligence

Current paradigm: ad-hoc document retrieval › Set of documents that each may fulfill the information need› Relevance of those documents is clearly established by

• Textual similarity

• Authority

Queries outside this paradigm are hard› Semantic gap between the query and the content

• Queries that require a deeper understanding of the world at large

› Complex needs where no document contains the answer

Analysis of aggregate behavior is a challenge› We may answer two queries perfectly, without knowing how they are related

Semantic gap› Ambiguity

• jaguar

• paris hilton

› Secondary meaning

• george bush (and I mean the beer brewer in Arizona)

› Subjectivity

• reliable digital camera

• paris hilton sexy

› Imprecise or overly precise searches

• jim hendler

Complex needs› Missing information

• brad pitt zombie

• florida man with 115 guns

• 35 year old computer scientist living in barcelona

› Category queries

• countries in africa

• barcelona nightlife

› Transactional or computational queries

• 120 dollars in euros

• digital camera under 300 dollars

• world temperature in 2020

Examples of hard queries

Are there even any true keyword queries?

Users may have stopped asking them

The Semant ic Web

10

Enter the Semantic Web

A social-technical ecosystem where the meaning of content is explicit and sharedamong agents (humans and machines)› Shared identifiers for real-world entities› Standards for exchanging structured data

• Data modeled as a graph

› Shared formal, schema languages

• Names of entity and relationship types

• Constraints on the entities, relationships and attributes

04/15/202311

“At the doctor's office, Lucy instructed her Semantic Web agent through her handheld Web browser. The agent promptly retrieved information about Mom's prescribed treatment from the doctor's agent, looked up several lists of providers, and checked for the ones in-plan for Mom's insurance within a 20-mile radius of her home and with a rating of excellent or very good on trusted rating services. It then began trying to find a match between available appointment times (supplied by the agents of individual providers through their Web sites) and Pete's and Lucy's busy schedules.” › The Semantic Web. Tim Berners-Lee, James Hendler,

Ora Lassila. Appeared in: Scientific American 284(5):34-43 (May 2001)

Huge expectations

In a perfect world, the Semantic Web is the end-game for IR

#ROI_BLANCO

#ROI_BLANCO

#ROI_BLANCO

13

The view from IR: skepticism

Mismatch to the IR problem› End-users

• Do not know the identifiers of things

• Not aware of the schema of the data

• Can’t be expected to learn complex query languages

› Focus on document retrieval

• Limited recognition of the value of direct answers (initially)

“Who is going to do the annotation?”› Focus still largely on text› Automated annotation/extraction tools produce poor results

Early experiments using external knowledge are unsuccessful› Query/document expansion using thesauri etc.

14

The Semantic Web off to a slow start

Complex technology › Set of standards sold as a stack

• URI, RDF, RDF/XML, RDFa, JSON-LD, OWL, RIF, SPARQL, OWL-S, POWDER …

› Not very developer friendly

Ideals of knowledge representation hard to enforce on the Web

No clear value proposition› Chicken and egg problem

• No users/use cases, hence no data

• No data, because no users/use cases

15

… and became practical

Simplified technology› Fewer, more developer friendly representations› Focusing on the lower layers of the stack› Data first, schemas/logic second

Giving up ideals about knowledge representation› Shared identifiers› Logical consistency› Distinction between real-world entities and web resources

Motivation for adoption› Search engines, Facebook, Pinterest, Twitter etc. investing in information extraction

16

Two important achievements

Linked Open Data› Social movement to (re)publish existing datasets

• 100B+ triples of data

• Encyclopedic, governmental, geo, scientific datasets

• Impact: background knowledge– Basis for knowledge graphs in search engines

Metadata inside HTML pages› Facebook’s OGP and schema.org

• Over 15% of all pages have schema.org markup (2013)

• Personal information, images, videos, reviews, recipes etc.

• Impact: remove the need for automated extraction

17

Example

18

View source

19

Caveat: not your perfect Semantic Web

Outdated, incorrect or incomplete data› Lack of write access or feedback mechanisms

Mistakes made by tools› Noisy information extraction› Entity linking (reconciliation)

Limited or no reuse of identifiers Metadata not always representative of content

Semantic Search at Yahoo

20

Semantic Search research (2007-)

Emergence of the Semantic Search field › Intersection of IR, NLP, DB and SemWeb

• ESAIR at SIGIR

• SemSearch at ESWC/WWW

• EOS and JIWES at SIGIR

• Semantic Search at VLDB

Exploiting semantic understanding in the retrieval process› User intent and resources are represented using semantic models

• Semantic models typically differ across NLP, DB and Semantic Web

› Semantic models are exploited in the matching and ranking of resources

Semantic Search – a process view

Query

Construction

•Keywords•Forms•NL•Formal language

Query

Processing

• IR-style matching & ranking•DB-style precise matching•KB-style matching & inferences

Result

Presentati

on

•Query visualization•Document and data presentation•Summarization

Query

Refinemen

t

• Implicit feedback•Explicit feedback• Incentives

Document Representation

Knowledge Representation

Semantic ModelsResources

Documents

Result presentation using metadata

Personal and private homepageof the same person(clear from the snippet but it could be also automaticallyde-duplicated)

Conferences he plans to attend and his vacations from homepageplus bio events from LinkedIn

Geolocation

“Microsearch”internal prototype(2007)

Yahoo SearchMonkey (2008)

1. Extract structured data› Semantic Web markup

• Example:

<span property=“vcard:city”>Santa Clara</span>

<span property=“vcard:region”>CA</span>

› Information Extraction

2. Presentation› Fixed presentation templates

• One template per object type

› Applications

• Third-party modules to display data (SearchMonkey)

Effectiveness of enhanced results

Explicit user feedback› Side-by-side editorial evaluation (A/B testing)

• Editors are shown a traditional search result and enhanced result for the same page

• Users prefer enhanced results in 84% of the cases and traditional results in 3% (N=384)

Implicit user feedback› Click-through rate analysis

• Long dwell time limit of 100s (Ciemiewicz et al. 2010)

• 15% increase in ‘good’ clicks

› User interaction model

• Enhanced results lead users to relevant documents (IV) even though less likely to clicked than textual (III)

• Enhanced results effectively reduce bad clicks!

See› Kevin Haas, Peter Mika, Paul Tarjan, Roi Blanco: Enhanced results for web search. SIGIR 2011: 725-

734

Adoption among consumers of web content Google announces Rich Snippets - June, 2009

› Faceted search for recipes - Feb, 2011

Bing tiles – Feb, 2011 Facebook’s Like button and the Open Graph Protocol (2010)

› Shows up in profiles and news feed› Site owners can later reach users who have liked an object

schema.org

Collaborative effort sponsored by large consumers of Web data› Bing, Google, and Yahoo! as initial founders (June, 2011)› Yandex joins schema.org in Nov, 2011

Agreement on a shared set of schemas for the Web› Available at schema.org in HTML and machine readable formats› Free to use under W3C Royalty Free terms

http://www.schema.org/docs/terms.html

Yahoo’s Knowledge Graph

Chicago Cubs

Chicago

Barack Obama

Carlos Zambrano

10% off ticketsfor

plays for

plays in

lives in

Brad Pitt

Angelina Jolie

Steven Soderbergh

George Clooney

Ocean’s Twelve

partner

directs

casts in

E/R

casts in

takes place in

Fight Club

casts in

Dust Brotherscasts

in

music by

Nicolas Torzec: Making knowledge reusable at Yahoo!: a Look at the Yahoo! Knowledge Base (SemTech 2013)

Building the Knowledge Graph

Information extraction› Automated information extraction

• e.g. wrapper induction

› Metadata from HTML pages

• Focused crawler

› Public datasets (e.g. Dbpedia)› Proprietary data

Data fusion› Manual mapping from the source

schemas to the ontology› Supervised entity reconciliation

Ontology management › Editorially maintained OWL ontology

with 300+ classes › Covering the domains of interest of

Yahoo

Curation and quality assessment› Editors and user feedback still play a

large role

Bellare et al: WOO: A Scalable and Multi-tenant Platform for Continuous Knowledge Base Synthesis. PVLDB 2013Welch et al.: Fast and accurate incremental entity resolution relative to an entity knowledge base. CIKM 2012

Entity linking/entity retrieval› Identifying the most relevant entity to

the query

Entity recommendation› Given that the user is interested in one

entity, which entity to recommend next?

Roi Blanco, Berkant Barla Cambazoglu, Peter Mika, Nicolas Torzec: Entity Recommendations in Web Search. ISWC 2013

Entity displays in web search

34

The importance of entities

Entity mention query = <entity> {+ <intent>}› ~70% of queries contain a named entity (entity mention queries)

• brad pitt height

› ~50% of queries have an entity focus (entity seeking queries)

• brad pitt attacked by fans

› ~10% of queries are looking for a class of entities

• brad pitt movies

› Jeffrey Pound, Peter Mika, Hugo Zaragoza: Ad-hoc object retrieval in the web of data. WWW 2010: 771-780

Intent is typically an additional word or phrase • Disambiguate, most often by type e.g. brad pitt actor

• Specify action or aspect e.g. brad pitt net worth, toy story trailer

brad pitt height

how tall istall …

Inverted index› Inspired by text retrieval

• Match individual keywords

• Score and aggregate

Parsing› Inspired by text parsing

• Find potential mentions of entities (spots) in query

• Score candidates for each spot

Two broad approaches to entity retrieval

brad

(actor) (boxer) (city)

(actor) (boxer) (lake)

pitt

brad pitt

(actor) (boxer)

Retrieval-based approach

Experimented with different index structures› Horizontal: one field for text and one for property name

› Vertical: One field per property

› Combination: one field per property weight (best performance in both AND/OR mode)

Horizontal

Vertical

R-Vertical

37

Retrieval-based approach

Ranking based on BM25F› R. Blanco, P. Mika, S. Vigna: Effective and Efficient Entity Search in RDF Data. ISWC 2011

› 42% improvement in MAP over best method in SemSearch 2010

› <100ms time for simple conjunctive queries

Open source implementation and demo using WebDataCommons data› glimmer.research.yahoo.com

› https://github.com/yahoo/Glimmer/

https://github.com/yahoo/Glimmer/

https://github.com/yahoo/Glimmer/

38

Entity linking approach

Large-scale entity/alias dictionaries› Alias mining from usage data, Wikipedia etc.

Dynamic segmentation Novel method for scoring alias matches

› Completely unsupervised› Combination of

• Keyphraseness: how likely is a segment to be an entity mention?

• Commonness: How likely that a linked segment refers to a particular entity?

• Context-model based on word2vec representation

› Roi Blanco, Giuseppe Ottaviano and Edgar Meij. Fast and space-efficient entity linking in queries. WSDM 2015

Results: effectiveness

39

Significant improvement over external baselines and internal system› Measured on public Webscope dataset Yahoo Search Query Log to Entities

Search over Bing, top Wikipedia result

State-of-the-art in literature

A trivial search engine over Wikipedia

Our method: Fast Entity Linker (FEL)

FEL + context

http://webscope.sandbox.yahoo.com/catalog.php?datatype=l&did=66



40

Two orders of magnitude faster than state-of-the-art› Simplifying assumptions at scoring time› Adding context independently› Dynamic pruning

Small memory footprint› Compression techniques, e.g. 10x

reduction in word2vec storage

Results: efficiency

Related entity recommendations

Some users are short on time › Need for direct answers› Query expansion, question-answering, information boxes, rich results…

Other users have time at their hand› Long term interests such as sports, celebrities, movies and music› Long running tasks such as travel planning

Example user sessions

Spark system for related entity recommendations

Machine learned ranking

Features from the Knowledge Graph and large-scale text sources› Unary

• Popularity features from text: probability, entropy, wiki id popularity …

• Graph features: PageRank on the entity graph, wikipedia, web graph

• Type features: entity type

› Binary

• Co-occurrence features from text: conditional probability, joint probability …

• Graph features: common neighbors …

• Type features: relation type

Regression model using Gradient Boosted Decision Trees (GBDT)› Trained on editorial data (cf. clicks)

45

Evaluation

1. 10-fold cross-validation

2. Side-by-side testing› More appropriate for judging sets of results

• “Blondie and Mickey Gilley are 70’s performers and do not belong on a list of 60’s musicians.”

3. Online evaluation (bucket testing)› Small % of search traffic redirected to test system, another small % to the baseline system› Data collection over at least a week, looking for stat. significant differences that are also stable

over time› Metrics

• Coverage and Click-through Rate (CTR)

• Searches per browser-cookie (SPBC)

• Other key metrics should not impacted negatively, e.g. Abandonment and retry rate, Daily Active Users (DAU), Revenue Per Search (RPS), etc.

Click-through rate (CTR) before and after the new system

Before release:

Gradually degrading performance due to lack of fresh data

After release:

Learning effect:users are starting to use the tool again

What’s next?

Summary

Information Retrieval› Reached the limits of the ad-hoc text retrieval paradigm› Needs to go beyond syntactic representations

Semantic Web› Provides means for knowledge representation and reasoning across the Web› Adoption has been slow, but picking up steadily

Applications in Web Search› Entity-based experiences

• Rich results, information boxes and related entities

› Question-answering

49

Search needs even more intelligence

Representation› Modeling the World, not just what is on the Web› Modeling personal information and preferences› Modeling of intents (actions that can be taken on the World)

Understanding› Need better understanding of context› User profile, history and current state

Retrieval› (Guided) interaction› Predictive search

Q&A

Many thanks to members of the Semantic Search team at Yahoo Labs London and to Yahoos around the world

Contact me› [email protected]› @pmika› http://www.slideshare.net/pmika/

http://www.slideshare.net/pmika/

http://www.slideshare.net/pmika/

Internet

Making the Web Searchable - Keynote ICWE 2015