69
Making the Web Searchable Peter Mika Senior Researcher and Data Architect Yahoo! Inc.

Питер Мика "Making the web searchable"

  • Upload
    yandex

  • View
    1.600

  • Download
    0

Embed Size (px)

DESCRIPTION

Научно-технический семинар «Веб-семантика: учим поисковых роботов «понимать» тексты» в петербургском офисе Яндекса, 9 октября 2012 г. Питер Мика, старший научный сотрудник Yahoo! Research, Барселона.

Citation preview

Page 1: Питер Мика "Making the web searchable"

Making the Web Searchable

Peter Mika

Senior Researcher and Data Architect

Yahoo! Inc.

Page 2: Питер Мика "Making the web searchable"

- 2 -

Agenda

•  Web Directions –  Convergence of Search and Online Media

•  Semantic technologies (th)at work –  Semantics for search

•  RDFa, microdata

–  Semantics for data integration •  RDF, OWL, SPARQL

•  Take home: use what works!

Page 3: Питер Мика "Making the web searchable"

More than just ten blue links

Page 4: Питер Мика "Making the web searchable"

- 4 -

It used to be pretty simple…

Page 5: Питер Мика "Making the web searchable"

- 5 -

Yahoo! today is a global network of online media sites

Page 6: Питер Мика "Making the web searchable"

- 6 -

Information box with content from and links to Yahoo! Travel

... with search as an important entry point to content

Points of interest in Vienna, Austria

Since Aug, 2010, ‘regular’ search results are ‘Powered by Bing’

Faceted search for Shopping results

Information from the Knowledge Graph

Page 7: Питер Мика "Making the web searchable"

- 7 -

Conversely, online media as an entry point to search

Hovering over an underlined phrase triggers a search for related news items.

Page 8: Питер Мика "Making the web searchable"

- 8 -

Aggregation across space: hyperlocal pages

Hyperlocal: showing content from across Yahoo that is relevant to a particular neighbourhood.

Page 9: Питер Мика "Making the web searchable"

- 9 -

Aggregation across entity types: special events

Page 10: Питер Мика "Making the web searchable"

- 10 -

Personalization

Yahoo’s Content Optimization Relevance Engine (CORE) technology uses machine learning to predict click behavior based on user profile

Display advertizing is also personalized by default. Users can opt-out of behavioral targeting through AdChoices.

Page 11: Питер Мика "Making the web searchable"

- 11 -

Contextualization Show related content

Social discovery: connect with friends watching the same

Page 12: Питер Мика "Making the web searchable"

- 12 -

Convergence of search and online media

•  Complex answers in search –  Using structured data, not just text –  Search over owned content and the best of the Web

•  Aggregation –  Content aggregation around events, persons, other entities –  From creating topic pages to creating entire new websites

•  Personalization and contextualization –  Understand user interests at a fine grained level –  Build and carry user profiles across search and media

•  Common to these is a need for a more advanced understanding of the Web and our content

Page 13: Питер Мика "Making the web searchable"

Semantic technologies for Search

Page 14: Питер Мика "Making the web searchable"

- 14 -

Search is really fast, without necessarily being intelligent

Page 15: Питер Мика "Making the web searchable"

- 15 -

State of Search

•  Improvements in search are harder and harder to come by –  Machine learning using hundreds of signals

•  From text to the web graph

–  Heavy investment in computational power •  e.g. real-time indexing and instant search

•  Remaining challenges are not computational, but in modeling human understanding –  A machine is intelligent if it reasons and acts the way we would –  But could Watson explain why the answer is Toronto?

•  How do we teach the computer about our world? –  How do we give meaning to documents and data?

Page 16: Питер Мика "Making the web searchable"

- 16 -

Not just search…

Page 17: Питер Мика "Making the web searchable"

- 17 -

What it’s like to be a machine?

Roi Blanco

Page 18: Питер Мика "Making the web searchable"

- 18 -

What it’s like to be a machine?

ë✜Θ♬♬ţğ

ë✜Θ♬♬ţğ√∞ñ§®ÇĤĪ✜★¤♬☐✓✓ ţğ★¤✜èééééñ

u✪✚✜ΔΤΟŨŸÏĞÊϖυτρ℠≠⅛⌫¤Γ ≠=⅚©§★✓♪ΒΓΕññ¤℠

¢✖Γ♫⅜±⏎↵⏏v☐ģğğğμλκσςτn nnnu⏎ñ⌥°¶§ΥΦΦΦ✗✕☐vuwwwww

Page 19: Питер Мика "Making the web searchable"

- 19 -

If machines are dumb, how to make their job easier?

•  HTML is intended for human consumption –  A mix of text, data and styling

•  Let’s make it easier to process for machines –  Languages to publish data in HTML

•  Agree between publishers and search engines on the meaning of certain symbols (ontologies)

•  e.g. ⏎⅙¥ means that this page describes a Person –  Annotate HTML pages using these symbols –  (This is just an example… the actual markup is human readable)

•  For data in particular, agree on what the types of objects are in the world, and what their attributes are –  e.g. between §℗ and §⌥⌘ is the age of the Person

•  Leverage this understanding for more precise matching and ranking

Page 20: Питер Мика "Making the web searchable"

- 20 -

Semantic Web

•  Publish information in a way that is easier to process for machines •  Web of Data instead of Web of Documents •  Two main architectural challenges

–  A common format for sharing data –  Sharing the meaning of data

•  Through social means (shared schemas) •  By using powerful schema languages

•  Semantic Web standards from W3C –  Languages (RDF, OWL, RIF) –  Serializations (RDF/XML, RDFa) –  Protocols (SPARQL, HTTP)

•  Semantic Web research into knowledge representation and reasoning, data integration, data quality and many other topics

•  Community efforts to publish data and develop schemas

Page 21: Питер Мика "Making the web searchable"

- 21 -

Resource Description Framework (RDF)

•  Each resource (thing, entity) is identified by a URI –  Globally unique identifiers

•  RDF represents knowledge as a set of triples –  Each triple is a single fact about the entity (an attribute or a

relationship)

•  A set of triples forms an RDF graph

example:roi

“Roi Blanco”

name

type foaf:Person RDF document

Page 22: Питер Мика "Making the web searchable"

- 22 -

Linking across the Web

example:roi

“Roi Blanco”

name foaf:Person

sameAs

#roi2 worksWith

#peter

[email protected]

email

type

type

Roi’s homepage

Yahoo!’s website

Friend-of-a-Friend ontology

knows

Page 23: Питер Мика "Making the web searchable"

- 23 -

History of metadata in HTML

•  1995: HTML meta tags •  1998: RDF/XML

–  RDF/XML in HTML –  RDF linked from HTML

•  2003: Web 2.0 –  Tagging, machine tags –  Microformats

•  2005: eRDF •  2008: RDFa 1.0 •  2011: RDFa 1.1,

Microdata

Page 24: Питер Мика "Making the web searchable"

- 24 -

HTML meta tags

<HTML> <HEAD profile="http://dublincore.org/documents/dcq-html/"> <META name="DC.author" content="Peter Mika"> <LINK rel="DC.rights copyright" href="http://

www.example.org/rights.html" /> <LINK rel="meta" type="application/rdf+xml" title="FOAF" href= "http://www.cs.vu.nl/~pmika/foaf.rdf">

</HEAD> … </HTML>

Page 25: Питер Мика "Making the web searchable"

- 25 -

Microformats (µf)

•  Agreements on the way to encode describe certain objects in HTML (persons, events, recipes…) –  Reuse of semantic-bearing HTML elements, e.g. class –  Based on existing standards, e.g. hCard –  Minimal: small number of types, most common attributes

•  Community centered around microformats.org –  Centralized process, but not a formal standards body –  Wiki for specifications, mailing list

Page 26: Питер Мика "Making the web searchable"

- 26 -

Example: the hCard microformat

<cite class="vcard"> <a class="fn url" rel="friend colleague met” href="http://meyerweb.com/"> Eric Meyer</a> </cite> wrote a post (<cite> <a href="http://meyerweb.com/eric/thoughts/2005/12/16/tax-relief/"> Tax Relief</a></cite>) about an unintentionally humorous letter he received from the <span class="vcard”> <a class="fn org url" href="http://irs.gov/"> Internal Revenue Service</a> </span>.

<div class="vcard"> <a class="email fn" href="mailto:[email protected]">Joe Friday</a> <div class="tel">+1-919-555-7878</div> <div class="title">Area Administrator, Assistant</div> </div>

Page 27: Питер Мика "Making the web searchable"

- 27 -

Microformats: limitations

•  Syntax shared with HTML –  You need to implement extraction for each microformat separately

•  Lack of formal schemas –  Limited reuse, extensibility of schemas –  Unclear which combinations are allowed

•  Lack of a datatype system •  No unique identifiers (URIs)

–  No linking, e.g. sameAs

•  Always appears in the HTML <body> –  Not always clear how it relates to the main topic of the page

•  Instability •  Everything is a draft… •  Varying degrees of support

Page 28: Питер Мика "Making the web searchable"

- 28 -

RDFa

•  W3C recommendation for embedding RDF data in HTML –  A set of new HTML attributes to be used in head or body –  A specification of how to extract the data from these attributes –  RDFa is just a syntax, you have to choose (or create) a vocabulary

separately •  Addresses the limitations of microformats

–  Syntax different from HTML –  Semantic Web schema languages (reuse, extend schemas) –  Unique identifiers for objects (interlinking, sameAs) –  Markup in head or body

•  Alternative to publishing data as RDF/XML (Linked Data) –  Search engine friendly

•  See also –  http://rdfa.info/

Page 29: Питер Мика "Making the web searchable"

- 29 -

RDFa evolution

•  RDFa 1.0 is a W3C Recommendation since October, 2008 •  RDFa 1.1 is a small update on RDFa to reduce complexity, make it

compatible with HTML5 –  Recommendation (June 7, 2012) –  Updated version of the RDFa Primer (June 7, 2012) –  HTML+RDFa Working Draft (Sept 11, 2012)

•  New in RDFa 1.1 –  New vocab attribute to define the default namespace for the

document or subtree –  The prefix attribute as a recommended replacement of xmlns –  You can use URIs even where only CURIEs were allowed before

•  RDFa API for accessing RDFa data in a webpage in the browser from JavaScript –  Currently Working Draft (April 19, 2011)

Page 30: Питер Мика "Making the web searchable"

- 30 -

RDFa intro: metadata in the header

•  More info in the <html prefix="og: http://ogp.me/ns#"> <head> <title>The Trouble with Bob</title> <meta property="og:title" content="The Trouble with Bob" /> <meta property="og:type" content="text" /> <meta property="og:image" content="http://example.com/alice/bob-ugly.jpg" /> ... </head>

Page 31: Питер Мика "Making the web searchable"

- 31 -

RDFa intro: links with a flavor

•  More info in the All content on this site is licensed under <a rel="license" href="http://creativecommons.org/licenses/by/3.0/"> a Creative Commons License </a>.

Page 32: Питер Мика "Making the web searchable"

- 32 -

RDFa links: talking about subjects other than the page

•  More info in the The trouble with Bob is that he takes much better photos than me: <div about="http://example.com/bob/photos/sunset.jpg"> <img src="http://example.com/bob/photos/sunset.jpg" /> <span property="og:title">Beautiful Sunset</span> by <span property="dc:creator">Bob</span>. </div>

Page 33: Питер Мика "Making the web searchable"

- 33 -

RDFa links: talking about subjects other than the page

•  More info in the

<div typeof=”foaf:Person"> <p property=”foaf:name"> Alice Birpemswick </p> <p> Email: <a rel=”foaf:mbox” href="mailto:[email protected]"> [email protected] </a> </p> <p> Phone: <a rel=”foaf:phone" href="tel:+1-617-555-7332">+1 617.555.7332</a> </p> </div>

Page 34: Питер Мика "Making the web searchable"

- 34 -

The process of annotating with RDFa

•  Find a vocabulary that fits your needs and supported by your consumers –  A vocabulary describes a set of types and attributes within a given domain –  If you don’t find a good candidate, extend an existing one or create a new one

•  Annotate your page –  Before you start, you might want to validate your page for (X)HTML

conformance using the W3C’s (X)HTML Validator to reduce the chance of errors. Choose Document Type XHTML + RDFa.

–  Use an HTML or XML editor that supports DTDs, or an RDFa editor such as RDFaCE

–  Use the RDFa Distiller to validate which data can be extracted from your page. –  If you fancy, use the RDF Validator to graphically visualize the RDF graph that

is outputted.

•  Put the annotated page online –  The data will be extracted by your favorite search engine the next time your

page is crawled and indexed –  The data will be available to browser extensions, bookmarklets etc.

•  See http://rdfa.info/rdfa-implementations for new tools and APIs

Page 35: Питер Мика "Making the web searchable"

- 35 -

Example: Yahoo! Enhanced Results (was: SearchMonkey)

•  First major adopter of RDFa –  Launched in May, 2008

•  Guide for publishers to mark-up their pages for common types of objects –  Product, Local, News,

Video, Events, Documents, Discussion, Games

•  Using popular microformats and RDF vocabularies –  Copy-paste code –  Validator

•  Yahoo as a consumer –  Enhanced Results

Page 36: Питер Мика "Making the web searchable"

- 36 -

Example: Google’s Rich Snippets

•  Launched in May, 2009 •  Google encourages publishers to use popular microformats

and its own RDFa vocabulary –  data-vocabulary.org

•  Validator to check if the markup is correct •  Google displays enhanced results based on this metadata

–  Rich Snippets

Page 37: Питер Мика "Making the web searchable"

- 37 -

Example: Facebook’s Like and the Open Graph Protocol

•  Launched April, 2010 •  The ‘Like’ button provides publishers with a way to promote

their content on Facebook and build communities –  Shows up in profiles and news feed –  Site owners can later reach users who have liked an object –  Facebook Graph API allows 3rd party developers to access the

data

•  Open Graph Protocol is an RDFa-based format that allows to describe the object that the user ‘Likes’

Page 38: Питер Мика "Making the web searchable"

- 38 -

Example: Facebook’s Open Graph Protocol

•  RDF vocabulary to be used in conjunction with RDFa –  Simplify the work of developers by restricting the freedom in RDFa

•  Activities, Businesses, Groups, Organizations, People, Places, Products and Entertainment

•  Only HTML <head> accepted

<html xmlns:og="http://opengraphprotocol.org/schema/"> <head>

<title>The Rock (1996)</title> <meta property="og:title" content="The Rock" /> <meta property="og:type" content="movie" /> <meta property="og:url" content="http://www.imdb.com/title/tt0117500/" /> <meta property="og:image" content="http://ia.media-imdb.com/images/rock.jpg" /> …

</head> ...

Page 39: Питер Мика "Making the web searchable"

- 39 -

Example: rNews

•  RDFa vocabulary for news articles –  Easier to implement than

NewsML –  Easier to consume for

news search and other readers, aggregators

•  Under development at the IPTC –  Version 0.5

Page 40: Питер Мика "Making the web searchable"

- 40 -

Microdata

•  Developed by the HTML5 working group at the W3C –  RDFa was perceived as too complex and thus error prone

•  Currently a companion document to HTML5 (working draft) •  Incompatible with RDFa

<div itemscope itemid=“http://www.yahoo.com/resource/person”> <p>My name is <span itemprop="name">Neil</span>.</p> <p>My band is called <span itemprop="band">Four Parts Water</span>. I was born on <time itemprop="birthday" datetime="2009-05-10">May 10th 2009</time>. <img itemprop="image" src=”me.png" alt=”me”> </p> </div

Page 41: Питер Мика "Making the web searchable"

- 41 -

Competing formats, competing schemas

•  Multiple incompatible formats: microformats, RDFa, microdata –  Varying degrees of adoption –  Not all formats are supported by all search engines

•  Multiple competing schemas (ontologies) –  Different schemas for marking up the same information (RDFa

and microdata) •  Major search engines support different existing alternatives or create

their own (Google, Facebook)

–  Not clear which schemas have adoption, who is responsible for maintaining them

–  Slow convergence

Page 42: Питер Мика "Making the web searchable"

- 42 -

schema.org

•  Agreement on a shared set of schemas for common types of web content –  Bing, Google, and Yahoo! as initial founders (June, 2011) –  Similar in intent to sitemaps.org

•  Use a single format to communicate the same information to all three search engines

•  schema.org covers areas of interest to all search engines –  Business listings (local), creative works (video), recipes,

reviews

Page 43: Питер Мика "Making the web searchable"

- 43 -

schema.org evolution

•  Yandex joins schema.org in Nov, 2011 –  Yandex.Slovari, Yandex.Spravochnik, Yandex.Kartinki, Yandex.Video

•  RDFa Lite 1.1 –  Subset of the features of RDFa 1.1 –  W3C Recommendation since June, 2012

•  Two W3C task forces within the SW Interest Group (SWIG) –  Web schemas TF for ongoing collaborations on schema extensions, mappings, tooling etc.

•  schema.org discussions are at [email protected]

–  HTML Data TF finished in December, 2011 •  HTML Data Guide •  Microdata RDF: Transformation from HTML+Microdata to RDF

•  Growing number of 3rd party contributions

–  rNews (news) –  GoodRelations (e-commerce) –  Health and Life Sciences –  Technical Publishing

Page 44: Питер Мика "Making the web searchable"

- 44 -

Documentation and OWL ontology

Page 45: Питер Мика "Making the web searchable"

- 45 -

Current state of semantic search

•  Limited usage in commercial search engines –  Enhanced results

–  Faceted search •  Google’s Recipe Search

–  Navigation to related entities •  Yahoo’s Vertical Intent Search

•  Positive SEO effects –  Enhanced results are clicked more –  Enhanced results help users find relevant results

•  Increased adoption of data markup

Page 46: Питер Мика "Making the web searchable"

- 46 -

Semantic Search development

•  Research –  RDF indexing and ranking –  Searching over annotated web pages –  Search result summarization –  Question answering –  Task completion –  Semantic log analysis

•  Prototype ‘pure’ RDF search engines –  Sindice and Sig.ma from DERI

Page 47: Питер Мика "Making the web searchable"

- 47 -

Current state of metadata on the Web

•  31% of webpages, 5% of domains contain some metadata

–  Analysis of the Bing Crawl (US crawl, January, 2012) –  RDFa is most common format

•  By URL: 25% RDFa, 7% microdata, 9% microformat •  By eTLD (PLD): 4% RDFa, 0.3% microdata, 5.4% microformat

–  Adoption is stronger among large publishers •  Especially for RDFa and microdata

•  See also –  P. Mika, T. Potter. Metadata Statistics for a Large Web Corpus,

LDOW 2012 –  H.Mühleisen, C.Bizer.

Web Data Commons - Extracting Structured Data from Two Large Web Corpora, LDOW 2012

Page 48: Питер Мика "Making the web searchable"

- 48 -

Exponential growth in RDFa data

Percentage of URLs with embedded metadata in various formats

Five-fold increase between March, 2009 and October, 2010

Another five-fold increase between October 2010 and January, 2012

Page 49: Питер Мика "Making the web searchable"

Semantic technologies for Data Integration

Page 50: Питер Мика "Making the web searchable"

- 50 -

Today’s world is a Web of Pages

Page 51: Питер Мика "Making the web searchable"

- 51 -

All these pages come from structured knowledge about people, places, and things

MLB team

Chicago Cubs

Is a

Chicago

Barack Obama

Carlos Zambrano

10% off tickets

for

plays for plays in

from

Page 52: Питер Мика "Making the web searchable"

- 52 -

This underlying world is WOO—the Web of Objects

MLB team

Chicago Cubs

Is a

Chicago

Barack Obama

Carlos Zambrano

10% off tickets

for

plays for plays in

from

Page 53: Питер Мика "Making the web searchable"

- 53 -

Today our knowledge of this world is siloed, incomplete, inconsistent, inaccurate, and hard to reuse

MLB team

Chicago Cubs

isa

Chicago

Scott Roy

Carlos Zambrano

10% off tickets for

plays for plays in

from

Spo

rts

Ent

erta

inm

ent

Fina

nce

Loca

l

Sho

ppin

g

Upc

omin

g

Page 54: Питер Мика "Making the web searchable"

- 54 -

Our vision is a single shared knowledge base—accurate, scalable, and easy to reuse

MLB team

Chicago Cubs

isa

Chicago

Barack Obama

Carlos Zambrano

10% off tickets for

plays for plays in

from

Page 55: Питер Мика "Making the web searchable"

- 55 -

Knowledge comes from many sources

Entities

Attr

ibut

es

Show times and other information for US movies from source B

Harry Potter and the Deathly Hallows part II

Show times

Show times for Harry Potter and the Deathly Hallows part II

Page 56: Питер Мика "Making the web searchable"

- 56 -

Combining these requires working with complementary, parallel, and overlapping sources

Attr

ibut

es

Entities

Cast information for global movies from Wikipedia

Cast information for US movies from source A Cast and show time

information for global movies from licensed feeds

Page 57: Питер Мика "Making the web searchable"

- 57 -

There is a tremendous opportunity to do this directly from Web pages, reverse engineering the Web

Attr

ibut

es

Entities

Information from structured data extraction on billions of Web pages

Page 58: Питер Мика "Making the web searchable"

- 58 -

Semantic technologies for data integration

•  Semantic Web provides the basic technologies for Linked Data –  URIs as unique identifiers

•  Retrieve data from the (internal) web •  Follow links in the data that is returned

–  RDF as a common data format –  OWL as a powerful schema language for validation and

reasoning –  SPARQL for queries, reasoning and transformations

Page 59: Питер Мика "Making the web searchable"

- 59 -

Components

•  Data is ingested from web extraction, feeds, editorial content (billions of objects)

•  Data integration using Hadoop clusters –  Schema matching to the WOO ontology –  Object reconciliation –  Blending

•  Data quality assessment •  Information extraction

–  Text, e.g. news content –  Webpages

•  Enrichment –  Feature computation based on user behavior, social signals and web content

•  Serving and ranking –  Selecting the right objects to show by query, user, geography etc.

Page 60: Питер Мика "Making the web searchable"

- 60 -

WOO ontology

•  Primary use case is data validation –  During information extraction and throughout the WOO

platform –  No reasoning

•  OWL2 ontology –  Automatic documentation –  Change management –  Conversion to Yahoo internal schema language –  Protégé OWL as editorial tool

Page 61: Питер Мика "Making the web searchable"

- 61 -

WOO ontology cntd.

•  Covers Yahoo’s domains of interest –  Movies, Music, TV,

Business listings, Events, Finance, Sports, Autos, …

–  250 classes and 800 properties (Sept, 2011)

–  Available only internally •  Developed over 1.5

years by Yahoo’s editorial team

•  Aligned with schema.org –  schema.org covers only a

subset of the WOO ontology

Page 62: Питер Мика "Making the web searchable"

- 62 -

Value #1 — Breadth, depth, and accuracy at scale

Real entities Dups, errors, and outdated entities

Up-to-date correct entities

Incorrect store URL

No photo

We show many entities we shouldn’t

No business hours

WOO improves our breadth, depth, and accuracy by combining knowledge from alternative sources, and by modernizing how we do matching, blending, and de-duping

Page 63: Питер Мика "Making the web searchable"

- 63 -

Value #2 — Agility launching new experiences

Answers instead of links

WOO lets us quickly create entity centric DD modules using the existing knowledge in the KB

Related knowledge in context

The integrated KB lets us show relevant knowledge from one Yahoo property on other properties and off network

Emerging markets and tail pages

The KB gets us deep into the tail by combining and blending knowledge from many sources

Page 64: Питер Мика "Making the web searchable"

- 64 -

Other potential benefits

•  Dynamic interlinking of content –  E.g. direct links from Yahoo! News to background information

in Yahoo! Music about an artist

•  Dynamic composition of web pages –  Topic-entity pages

•  Better understanding of user intent –  Semantic analysis of query logs –  Semantic analysis of navigation paths

•  Exposure of Yahoo! content using standard technologies –  Linking to external sources to make it part of the Linked Data

cloud

Page 65: Питер Мика "Making the web searchable"

- 65 -

Innovative media companies are moving in this direction

Courtesy of Silver Oliver (BBC)

Page 66: Питер Мика "Making the web searchable"

- 66 -

Innovative media companies are moving in this direction

Courtesy of Evan Sandhaus (NYT).

Page 67: Питер Мика "Making the web searchable"

- 67 -

Take home: use what works!

•  The W3C’s semantic technology stack is daunting –  The basics are simple:

•  URIs for entity identifiers, RDF for data exchange

•  Standards for embedding data in HTML –  Useful in search and at other points of content consumption

•  Standards for expressing the meaning of data –  Useful in data integration

•  Do your bit!

Page 68: Питер Мика "Making the web searchable"

- 68 -

The End

•  Credits to many people from Yahoo! around the world •  Contact me at

–  [email protected] –  @pmika

Page 69: Питер Мика "Making the web searchable"