23
Adam Rae Vanessa Murdock, Adrian Popescu, Hugues Bouchard SIGIR 2012, Portland, Oregon, Entities Session

Mining the Web for Points of Interest

Embed Size (px)

Citation preview

Page 1: Mining the Web for Points of Interest

Adam Rae Vanessa Murdock, Adrian Popescu, Hugues Bouchard

SIGIR 2012, Portland, Oregon, Entities Session Adam Rae [email protected] Vanessa Murdock Adrian Popescu Hugues Bouchard

Page 2: Mining the Web for Points of Interest

Mining the Web for Points of Interest

Using social media to increase our knowledge of the world

I’m at Adam’s Bar…

?

!

Page 3: Mining the Web for Points of Interest

Contents

§ Motivation

§ Point Of Interest (POI) extraction using user generated data

§ POI localisation using social media

§ Conclusions

Page 4: Mining the Web for Points of Interest

Motivation § Geographic Points of Interest are valuable

representations of important places in the world around us.

§ Browsing and search of POIs increasingly important ›  Web search ›  Mobile ›  Navigation

Page 5: Mining the Web for Points of Interest

Where do POIs come from?

§ Editing listings coming from NMAs, commercial directories etc. ›  Costly process ›  Expensive to maintain freshness ›  Coverage

§ Do they reflect the kind of places that people are interested in looking for?

Page 6: Mining the Web for Points of Interest

Can we get them from the web? § Un/semi-structured mentions of POIs throughout

text on web ›  Lots of context

§ Structured mentions of POIs in micro blogging

systems and Wikipedia articles ›  Easy to extract

Page 7: Mining the Web for Points of Interest

When is a POI not a POI?

1  The White House is at 1600 Pennsylvania Avenue, Washington DC.

2  The White House released a statement today suggesting the moon is made of cheese.

3  The people living in the white house at the end of the street turned out to be Martians.

Page 8: Mining the Web for Points of Interest

Europe According to Foursquare

Page 9: Mining the Web for Points of Interest

The World According to Foursquare

Page 10: Mining the Web for Points of Interest

The World According to Gowalla

Page 11: Mining the Web for Points of Interest

The World According to Wikipedia

Page 12: Mining the Web for Points of Interest

Can we bootstrap using social media?

§ Train Conditional Random Fields (CRF) using web snippets bootstrapped from structured mentions in micro-blog entries ›  Extract POI, use as query to search engine ›  Resultant snippets filtered to those that contain POI ›  Sanitise

§ Also from geocoded Wikipedia articles (according to Yago2)

Page 13: Mining the Web for Points of Interest

Ground Truth Data § Created by manual assessors given explicit

instructions ›  1,337 examples of POIs in (some) context ›  1,066 unique POIs ›  Inter-assessor agreement:

Ground Truth Assessor

Precision Recall F-Measure

1 0.749 0.792 0.770

2 0.814 0.716 0.762

Page 14: Mining the Web for Points of Interest

Sequential Tagging Model

p Y | X,λ( ) =1

Z(X)exp λ jFj (Y,X)

j∑$

% & &

'

( ) )

argmaxΛ1

Z(X)exp λ jFj (Y,X)

j∑%

& ' '

(

) * *

+ , -

. -

/ 0 -

1 -

Page 15: Mining the Web for Points of Interest

Features § Lexical ›  Word identity, shape, position, etc.

§ Grammatical ›  Part of Speech, Apache OpenNLP

§ Statistical ›  Normalised Point-wise Mutual Information of mobile

search query logs § Geographic ›  Gazetteer attributes from Yahoo! Placemaker ›  http://developer.yahoo.com/geo/placemaker/

Page 16: Mining the Web for Points of Interest

Process Overview

… was only after he had left the Marriott Hotel that he remembered…

Geocoded Wikipedia Articles

Check-Ins (Foursquare)

Check-Ins(Gowalla)

Wikipedia Bootstrapped Raw Web Snippets

Foursquare Bootstrapped Raw Web

Snippets

Gowalla Bootstrapped Raw Web Snippets

Wikipedia based POI Tagger

Foursquare based POI

Tagger

Gowalla based POI Tagger

ExtractPOI

Mentions

Sear

ch E

ngin

e (B

ing)

Snip

pet P

roce

ssin

g

CRF

Mod

el T

rain

ing

Extract Article Titles

Page 17: Mining the Web for Points of Interest

Results

Training Data Testing Data Precision Recall

Y! Placemaker Manual Data 0.237 0.228

Wikipedia Manual Data 0.514 0.337

Foursquare Manual Data 0.276 0.655

Gowalla Manual Data 0.360 0.414

Wikipedia 10-fold CV 0.879 0.955

Foursquare 10-fold CV 0.689 0.468

Gowalla 10-fold CV 0.857 0.868

Page 18: Mining the Web for Points of Interest

Language Modelling § Partition the world into 1km cells § For each, create model from Flickr photos taken

in that area

§ Treat problem as IR, match a POI (query) against the cells (document) ›  Return centroid of of best matching cell €

P t |θL( ) =cuser(t,L)

L

L = cuser(ti,L)ti ∈L∑

Page 19: Mining the Web for Points of Interest

Performance

Placemaker Cascade Geo Scope # Examples Placemaker POIs

0.29 0.29 0.29 134

Placemaker Other Locs

4.19 2.90 2.12 131

All Known Locs

1.17 0.82 0.79 265

New Locations

- 439.0 5.88 130

All Data - 1.20 0.96 395

Page 20: Mining the Web for Points of Interest

Conclusions and Implications

§ POIs are valuable, but useful ones difficult to define

§ Generating evaluation data is hard

§ Can use web snippets bootstrapped with check-ins, and articles on Wikipedia to train POI tagger ›  Up to 88% precision on unlabelled data ›  Reflect the POIs users visit ›  Easily updated ›  Can be located accurately using hybrid gazetteer + Flickr

language model technique

Page 21: Mining the Web for Points of Interest

Benefits of this approach § Discover POIs: ›  that we already know about (replace/extend existing

sources) ›  we didn’t already know about (novel POIs) ›  of more diverse types (increasing coverage) ›  that are fresher

§ Increase relevance of local and hyperlocal search using wisdom of the crowds

Page 22: Mining the Web for Points of Interest

Research Areas - Automatic POI detection in UGC -  Learning how users refer to places -  Localising media - Generating evaluation data

-  (This is hard) - Multi-source combination - Quality & Credibility

Page 23: Mining the Web for Points of Interest

Adam Rae [email protected]

Vanessa Murdock Adrian Popescu

Hugues Bouchard

Thank you