Learning to Extract Local Events from the Webciir.cs.umass.edu/~jfoley/sigir2015_foley_bendersky_josifovski.pdf · Learning to Extract Local Events from the Web John Foley, Michael

Learning to Extract Local Events from the Web

John Foley, Michael Bendersky, and Vanja Josifovski

August 11, 2015

[email protected]@[email protected]

Entertain me!

● Similar to the TREC Contextual Suggestion track– Cities are queries

– Venues are the documents● Marked relevant based on preferences

● Events are different than venues– There are possibly many events in a venue

– Many events and venues on the same page● e.g. the homepage of a local band that plays in a few

different towns

Event (n.)

An event occurs at a certain location, has a start date and time, and a title or description.

● What?● When?● Where?

Why don't you…?

● Existing approaches rely upon:– Repetitive structure

● Table-based approaches● Wrapper induction

– Human annotation (time, $$)

– Expensive visual features● Region Extraction

Linked Data / Social Media Pages?

Linked Data / Social Media Pages?

Linked Data

Tweet DESIGNER GARAGE SALE Where

Shed 6 , Auckland When Saturday 26 Mar 2011

10:00 a.m. Price STANDARD - ADULT - R18

$15.00 A glass of wine will be served The

Designer Garage Sale - VIP Limited Entry Pass,

including a glass of Icon Methode Traditionelle -

limited to 100 tickets.

Example: Garage Sale

Our Approach:

Have you been collecting items for the next

Forest Heights Community Garage Sale? This

year’s spring community garage sale will be

held on Saturday, June 2nd from 9:00 am to

3:00 pm. Forest Heights homeowners who wish

to participate in the garage sale register online

for the event….

Linked Data

Tweet DESIGNER GARAGE SALE Where

Shed 6 , Auckland When Saturday 26 Mar 2011

10:00 a.m. Price STANDARD - ADULT - R18

$15.00 A glass of wine will be served The

Designer Garage Sale - VIP Limited Entry Pass,

including a glass of Icon Methode Traditionelle -

limited to 100 tickets.

Example: Garage Sale

Our Approach:

Have you been collecting items for the next

Forest Heights Community Garage Sale? This

year’s spring community garage sale will be

held on Saturday, June 2nd from 9:00 am to

3:00 pm. Forest Heights homeowners who wish

to participate in the garage sale register online

for the event….

Everyone Small Community

Learning to extract from Linked Data

● Linked Data for Information Extraction workshop (LD4IE) at ISWC– Learning regular expressions for the extraction of

product attributes from e-commerce microdata (Petrovski et al. 2014)

● Learning a dictionary from RDF tuples which is then applied to extraction (learned XPaths)– Self training wrapper induction with linked data

(Gentile et al. 2014)

Event Extraction Model

Event FieldsEvent RegionsEvent PagesThe Internet

? E?

Document Scoring Field ScoringEvent Scoring

Field-First Intuition

● Assume P(E|webpage) is high:– If we see a date-time and a place together.

– It's probably an event.


? E?

Document Scoring Field ScoringEvent Scoring

Existing Work on Fields

● When?– Dates and times are studied through TimeML

– SUTime, Heideltime, etc.

● Where?– Addresses are somewhat standardized

● Rule-based approaches

– NER LOCations (Entities)

● What?– More abstract, more domain-specific

Field Scoring

● Generate candidates– Use rule-based methods for When and Where

– For What, consider every small HTML tag

● (Optionally) Classify candidates– Assign scores to each candidate

– Trained on linked-data examples

● Output:– Scored set of fields on a page

Event Record Grouping Algorithm

Library Book Sale

Tues. Aug 18 1p-7p

Next Tues. from 1p - 7pm the library will be holding its yearly book sale. Come

support your local...

Posted in 2015

Event Extraction Model


? E?

Document Scoring Region Scoring Field Scoring

Experimental Setup

● Clueweb12– 700 million pages

– Semantic Web annotations● 149,000 pages in 2700 domains with Event markup● 900,000 events on those pages

– Duplicate detection and field requirements● 430,000 unique events after exact matches removed● 217,000 with complete markup (What, Where & When)

– Test set ● Only pages with no semantic web annotations

Collecting Judgments● Judgments fairly quick:

o Just event: ($0.05) 998 judgments mean = 1.7 minutes, median = 0.7 minutes.

o Event and all fields: ($0.10) 655 judgments mean = 4.2 minutes, median = 2.2 minutes

Field Scoring Evaluation

● Dataset– Top 30,000 pages by document score

● Methods– No classification

● Rule-based approach for Where and When● Any tag is a candidate for What

– What classification● Rank What fields before grouping algorithm

– What-When-Where classification● Rank all fields before grouping algorithm

● Event Pools– Each method generated different event candidates, so we judged

each by random sampling.

Field Scoring Methods

Event Prediction Results

● Recall– Cannot properly be measured without labeling 700

million pages

– We show recall as a percentage of the Schema.org marked-up data we had available for training.

● Precision– Break results into four levels of performance

Very High High Medium Low

New Events 25,833 201,531 452,274 1,575,909Event Precision 0.92 0.85 0.65 0.55% Training Data 12% 93% 208% 725%

Precision Evaluation

Summary

● We've presented an automatic approach to extracting events with pretty good precision behavior– Doubled our recall at 85% Precision

– Bottom up field classification and grouping algorithm

– No training labels created for this task

● Can we do better?

Supervised Extension● 1.1 million events predicted

at Low Precisiono 800 judgments from M+

(Train & Validate)o 300 in L (Evaluate)o ~30 hours of annotation

effort for 30% improvement in precision

o Simple features

Summary




● We can improve with supervision– 30 hours of labeling led to a 30% improvement in

precision on another million events

So what about coverage?

So what about coverage?

California et. al.

English in Europe

City coverage evaluation

● 200 random cities

● Judged to a depth of 5

● Pacific, Missouri

● Fox Point, Wisconsin

● Palo Alto, California

● Lahore, Punjab

● Harrogate, North Yorkshire

● Duncan, British Columbia

● Docklands, Victoria, Australia

● Charleston, South Carolina,

● Edmonton, Alberta, Canada

● Accrington, England

● Long Beach, California

Metric Score

MRR 0.78

P@1 0.71

P@2 0.70

P@3 0.69

P@4 0.70

P@5 0.70

Summary




● We can improve with supervision– 30 hours of labeling led to a 30% improvement in

precision on another million events

● We have improved city coverage– 70% precision for 5 results on 200 random cities

Thank you.Learning to Extract Local Events from the Web

Documents

Learning to Extract Local Events from the Webciir.cs.umass.edu/~jfoley/sigir2015_foley_bendersky_josifovski.pdf · Learning to Extract Local Events from the Web John Foley, Michael