55
EXPECTATION: EXtracting asPECT-based Travel informATION By Aditya Chetan Advisors: Dr. Paramita Mirza and Dr. Andrew Yates

EXPECTATION tracting asPECT-based Travel informATION

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: EXPECTATION tracting asPECT-based Travel informATION

EXPECTATION: EXtracting asPECT-based Travel informATION

By Aditya Chetan

Advisors: Dr. Paramita Mirza and Dr. Andrew Yates

Page 2: EXPECTATION tracting asPECT-based Travel informATION

Overview

● Motivation● Data collection and Analysis

○ WikiVoyage○ TripAdvisor

● Seed word extraction○ What are seed words?○ How to extract them?

● Weakly supervised Aspect Extraction○ Student-teacher model

2

Page 3: EXPECTATION tracting asPECT-based Travel informATION

Motivation● Existing Travel Recommender Systems:

○ Can: Make bookings, give rating-based recommendations

3

Page 4: EXPECTATION tracting asPECT-based Travel informATION

Motivation● Existing Travel Recommender Systems:

○ Can: Make bookings, give rating-based recommendations

○ Can’t: Consider fine-grained aspect-based preferences from unstructured feedback

4

Page 5: EXPECTATION tracting asPECT-based Travel informATION

Consider travel history

5

Page 6: EXPECTATION tracting asPECT-based Travel informATION

Consider an aspect-level preference like “warm” CLIMATE

6

Page 7: EXPECTATION tracting asPECT-based Travel informATION

-ve sentiment towards “parasailing”, an adventure sport (ACTIVITY)

7

Page 8: EXPECTATION tracting asPECT-based Travel informATION

Make suggestions using previous utterances

8

Page 9: EXPECTATION tracting asPECT-based Travel informATION

Capture multiple aspects. (CLIMATE, ACTIVITY, FOOD, etc.)

9

Page 10: EXPECTATION tracting asPECT-based Travel informATION

My focus:TravelDB Construction:

Creating aspect-based profiles of locations

10

Page 11: EXPECTATION tracting asPECT-based Travel informATION

Why TravelDB?

● Resource for providing recommendations for destinations

● Existing data sources do not focus on Travel-related aspects like

ACTIVITY, FOOD, etc.

● Re-usable, transparent asset when providing recommendations

ACTIVITY FOOD SHOP CLIMATE

New Delhibus tour, museums, forts, tomb

north indian, vegan food, indian street food

sarees, indian textiles, handmade goods

hellish 😰

11

Page 12: EXPECTATION tracting asPECT-based Travel informATION

Challenges

● Lack of structured data sources

(like IMDb, etc.) with

travel-related information

● Unstructured data is not

annotated.

Table of contents on the Wikipedia

article of New Delhi

(https://en.wikipedia.or

g/wiki/New_Delhi)

12

Page 13: EXPECTATION tracting asPECT-based Travel informATION

Challenges

● Lack of structured data sources

(like IMDb, etc.) with

travel-related information

● Unstructured data is not

annotated.

Table of contents on the Wikipedia

article of New Delhi

(https://en.wikipedia.or

g/wiki/New_Delhi)

13

Page 14: EXPECTATION tracting asPECT-based Travel informATION

Roadmap till now

Data Collection & Analysis

● Selecting and scraping data sources.

● Cleaning and preprocessing.

● Selecting aspects to focus on.

14

Page 15: EXPECTATION tracting asPECT-based Travel informATION

Roadmap till now

Data Collection & Analysis

● Selecting and scraping data sources.

● Cleaning and preprocessing.

● Selecting aspects to focus on.

Seed Word Extraction

“Seed words” are representative words for an aspect.

Eg. ACTIVITY has key-phrases like “hiking”, “museums”, etc.

15

Page 16: EXPECTATION tracting asPECT-based Travel informATION

Roadmap till now

Extracting Aspects for new locations

● Use seed words as weak supervision to classify text segments into 4 aspects:

● ACTIVITY● FOOD● SHOP● CLIMATE

Data Collection & Analysis

● Selecting and scraping data sources.

● Cleaning and preprocessing.

● Selecting aspects to focus on.

Seed Word Extraction

“Seed words” are representative words for an aspect.

Eg. ACTIVITY has key-phrases like “hiking”, “museums”, etc.

16

Page 17: EXPECTATION tracting asPECT-based Travel informATION

I. Data Collection & Analysis

● We collected data from two different sources:

a. WikiVoyage Data Dump: Collection of travel-related information of various places, written by volunteers.

b. TripAdvisor Discussion Forums: Location-specific discussion forums on travel.

17

Page 18: EXPECTATION tracting asPECT-based Travel informATION

I. i. WikiVoyage Data Dump

● Articles have summary, plus sectioned headings for different aspects of Travel. Eg. “See & Do”, “Eat & Drink”, etc. in WikiText.

● Wrote a tool that could extract content from the dump.18

Page 19: EXPECTATION tracting asPECT-based Travel informATION

I. i. WikiVoyage Data Dump ● WikiVoyage is volunteer-driven, so headings can have variations. Eg.

“See”, “See & Do”, “Do”, etc. We cluster them based on lexical similarity.

19

Page 20: EXPECTATION tracting asPECT-based Travel informATION

I. i. WikiVoyage Data Dump ● WikiVoyage is volunteer-driven, so headings can have variations. Eg.

“See”, “See & Do”, “Do”, etc. We cluster them based on lexical similarity.

● Heading Clusters are then assigned to an aspect of travel.

See & Do, See, Do, Things to do, ...

Drinks, Eat & Drink, Eat/Drink, Eat, ...

Buy, Pricing...

ACTIVITY FOOD SHOPPING 20

Page 21: EXPECTATION tracting asPECT-based Travel informATION

I. i. WikiVoyage Data Dump

● Total no. of locations: 6382● No. of locations with all aspects present: 4902

ASPECT No. of locations

ACTIVITY 6367

FOOD 6346

SHOPPING 4925

CLIMATE 638221

Page 22: EXPECTATION tracting asPECT-based Travel informATION

I. i. WikiVoyage Data Dump

● Total no. of locations: 6382● No. of locations with all aspects present: 4902

ASPECT No. of locations

ACTIVITY 6367

FOOD 6346

SHOPPING 4925

CLIMATE 6382

● CLIMATE data on WikiVoyage extremely limited

● Collected by linking WikiVoyage articles to Wikipedia

22

Page 23: EXPECTATION tracting asPECT-based Travel informATION

I. ii. TripAdvisor Discussion Forums

23

Page 24: EXPECTATION tracting asPECT-based Travel informATION

I. ii. TripAdvisor Discussion Forums

● Some statistics from the dataset:

○ Total no. of locations: 187

○ Total no. of users: 149779

24

Page 25: EXPECTATION tracting asPECT-based Travel informATION

II. Seed word extraction ● What are seed words?

○ Describing aspects is hard. Easier to “define by example”■ Eg. ACTIVITY : “cross country skiing”, “trekking”, “museums”,

“art gallery”, etc.

25

Page 26: EXPECTATION tracting asPECT-based Travel informATION

II. Seed word extraction ● What are seed words?

○ Describing aspects is hard. Easier to “define by example”■ Eg. ACTIVITY : “cross country skiing”, “trekking”, “museums”,

“art gallery”, etc.

● Seed words are important w.r.t. travel for most locations○ Seed words are mostly the key phrases but not vice versa

Eg. “Louvre” : keyphrase for Paris, but not a seed word for ACTIVITY as it is exclusive to Paris.

26

Page 27: EXPECTATION tracting asPECT-based Travel informATION

II. Seed word extraction

ACTIVITY

FOOD

SHOPPING

CLIMATE

WikiVoyage articles

Seed word extraction pipeline 27

Page 28: EXPECTATION tracting asPECT-based Travel informATION

II. Seed word extraction

ACTIVITY

FOOD

SHOPPING

CLIMATE

WikiVoyage articles

ASPECT kp1, kp2, kp3, ...YAKE*

For every ASPECTFor every article

Seed word extraction pipeline

* YAKE is an unsupervised keyphrase extraction algorithm. (Campos et. al. ECIR 2018)

28

Page 29: EXPECTATION tracting asPECT-based Travel informATION

II. Seed word extraction

ACTIVITY

FOOD

SHOPPING

CLIMATE

WikiVoyage articles

ASPECT kp1, kp2, kp3, ...YAKE

For every ASPECTFor every article

ACTIVITY FOOD

CLIMATESHOPPING

kp1, kp2, kp3, ...

kp1, kp2, kp3, ...

kp1, kp2, kp3, ...

kp1, kp2, kp3, ...

ASPECT Vocabularies

Seed word extraction pipeline 29

Page 30: EXPECTATION tracting asPECT-based Travel informATION

II. Seed word extraction

ACTIVITY

FOOD

SHOPPING

CLIMATE

WikiVoyage articles

ASPECT kp1, kp2, kp3, ...YAKE

For every ASPECTFor every article

ACTIVITY FOOD

CLIMATESHOPPING

kp1, kp2, kp3, ...

kp1, kp2, kp3, ...

kp1, kp2, kp3, ...

kp1, kp2, kp3, ...

ASPECT Vocabularies

ASPECT

ASPECT content from every WikiVoyage article

Seed word extraction pipeline For every ASPECT

30

Page 31: EXPECTATION tracting asPECT-based Travel informATION

II. Seed word extraction

ACTIVITY

FOOD

SHOPPING

CLIMATE

WikiVoyage articles

ASPECT kp1, kp2, kp3, ...YAKE

For every ASPECTFor every article

ACTIVITY FOOD

CLIMATESHOPPING

kp1, kp2, kp3, ...

kp1, kp2, kp3, ...

kp1, kp2, kp3, ...

kp1, kp2, kp3, ...

ASPECT Vocabularies

ASPECT

ASPECT content from every WikiVoyage article

TF-IDF with ASPECT Vocabulary

TF-IDF matrix

Seed word extraction pipeline For every ASPECT

31

Page 32: EXPECTATION tracting asPECT-based Travel informATION

II. Seed word extraction

ACTIVITY

FOOD

SHOPPING

CLIMATE

WikiVoyage articles

ASPECT kp1, kp2, kp3, ...YAKE

For every ASPECTFor every article

ACTIVITY FOOD

CLIMATESHOPPING

kp1, kp2, kp3, ...

kp1, kp2, kp3, ...

kp1, kp2, kp3, ...

kp1, kp2, kp3, ...

ASPECT Vocabularies

ASPECT

ASPECT content from every WikiVoyage article

TF-IDF with ASPECT Vocabulary

TF-IDF matrix

Rank by mean TF-IDF score

rank across articles

1 kp32 kp13 kp2

Final ranked list of seed

words

Seed word extraction pipeline For every ASPECT

32

Page 33: EXPECTATION tracting asPECT-based Travel informATION

II. Seed word extraction

ASPECT Representative Examples No. of seed words

ACTIVITY museum, art gallery, rock climbing, guided tours, unesco world heritage

44014

FOOD restaurant, live music, ice cream, vegan restaurant, selection of beers

19098

SHOPPING shopping, gift shop, variety of shops, homemade red wine

7673

CLIMATE precipitation, annual rainfall, oceanic climate, humid subtropical climate, humid continental climate

2222

33

Page 34: EXPECTATION tracting asPECT-based Travel informATION

III. Weakly supervised Aspect Extraction (WAE)

● Seed words give a good anecdotal definition for aspects. But are not exhaustive.

34

Page 35: EXPECTATION tracting asPECT-based Travel informATION

III. Weakly supervised Aspect Extraction (WAE)

● Seed words give a good anecdotal definition for aspects. But are not exhaustive.

● Need: A framework that can:○ Extract aspect-related words from new utterances.

35

Page 36: EXPECTATION tracting asPECT-based Travel informATION

III. Weakly supervised Aspect Extraction (WAE)

● Seed words give a good anecdotal definition for aspects. But are not exhaustive.

● Need: A framework that can:○ Extract aspect-related words from new utterances.○ Perform this task under minimal supervision.

36

Page 37: EXPECTATION tracting asPECT-based Travel informATION

III. Weakly supervised Aspect Extraction (WAE)

● Seed words give a good anecdotal definition for aspects. But are not exhaustive.

● Need: A framework that can:○ Extract aspect-related words from new utterances.○ Perform this task under minimal supervision.

● Experimenting with a framework based on Karamanolakis et. al. [1].● It utilises iterative co-training that trains teacher-student model pairs that

provide feedback to each other.

[1] Leveraging Just a Few Keywords for Fine-Grained Aspect Detection Through Weakly Supervised Co-Training. Giannis Karamanolakis, Daniel Hsu, Luis Gravano. Proceedings of EMNLP-IJCNLP 2019 37

Page 38: EXPECTATION tracting asPECT-based Travel informATION

III. Weakly supervised Aspect Extraction (WAE)

● Task

Let there be k aspects (including a general aspect) and for every aspect i, a set of seed words di is given.

Now, train a classifier such that, for a fresh text segment, s, it predict the aspect, ks that it belongs to.

This training task is completely unsupervised.

38

Page 39: EXPECTATION tracting asPECT-based Travel informATION

III. Weakly supervised Aspect Extraction (WAE)

Teacher Model (BoSW CLF)

ith segment

ASPECT probabilities

Equations are from [1] 39

Page 40: EXPECTATION tracting asPECT-based Travel informATION

III. Weakly supervised Aspect Extraction (WAE)

Teacher Model (BoSW CLF) Student Model (CLF)

ith segment

ASPECT probabilities

Equations are from [1] 40

Page 41: EXPECTATION tracting asPECT-based Travel informATION

III. Weakly supervised Aspect Extraction (WAE)

Teacher Model (BoSW CLF) Student Model (CLF)

ith segment

ASPECT probabilities

Equations are from [1] 41

Page 42: EXPECTATION tracting asPECT-based Travel informATION

III. Weakly supervised Aspect Extraction (WAE)

● Advantages○ Learn a student model that is better than teacher

42

Page 43: EXPECTATION tracting asPECT-based Travel informATION

III. Weakly supervised Aspect Extraction (WAE)

● Advantages○ Learn a student model that is better than teacher○ Get a quality score for each seed word

43

Page 44: EXPECTATION tracting asPECT-based Travel informATION

III. Weakly supervised Aspect Extraction (WAE)

● Advantages○ Learn a student model that is better than teacher○ Get a quality score for each seed word○ Modular framework

44

Page 45: EXPECTATION tracting asPECT-based Travel informATION

III. Weakly supervised Aspect Extraction (WAE)

● Advantages○ Learn a student model that is better than teacher○ Get a quality score for each seed word○ Modular framework○ Unsupervised!

45

Page 46: EXPECTATION tracting asPECT-based Travel informATION

III. Weakly supervised Aspect Extraction (WAE)

● Testing the feasibility of the framework○ Initial results look

promising.○ Noisy seed words →

Class imbalance → Bad Teacher

○ Student performance held back by a bad teacher 46

Page 47: EXPECTATION tracting asPECT-based Travel informATION

Goals for the future

● Short term○ Evaluate performance on hand labelled test set

47

Page 48: EXPECTATION tracting asPECT-based Travel informATION

Goals for the future

● Short term○ Evaluate performance on hand labelled test set○ Improve the Student by replacing BERT embeddings with ALBERT

48

Page 49: EXPECTATION tracting asPECT-based Travel informATION

Goals for the future

● Short term○ Evaluate performance on hand labelled test set○ Improve the Student by replacing BERT embeddings with ALBERT ○ Replace Teacher model with Snorkel[2] model based on

handcrafted labeling functions

[2] Snorkel: Rapid Training Data Creation with Weak Supervision. Ratner et. al. VLDB 2018 49

Page 50: EXPECTATION tracting asPECT-based Travel informATION

Goals for the future

● Short term○ Evaluate performance on hand labelled test set○ Improve the Student by replacing BERT embeddings with ALBERT ○ Replace Teacher model with Snorkel[2] model based on

handcrafted labeling functions○ Shift from using sentence to Elementary Discourse Units (EDUs) for

input segments■ Eg. The food looks nice but the restaurant was very costly

→ The food looks nice <s> but the restaurant was very costly

[2] Snorkel: Rapid Training Data Creation with Weak Supervision. Ratner et. al. VLDB 2018 50

Page 51: EXPECTATION tracting asPECT-based Travel informATION

Goals for the future

● Mid term○ Extend this to a multi-label setting (an utterance can have multiple

aspects)■ Eg. I really like the warm weather, and the skydiving facilities

were great too!

51

Page 52: EXPECTATION tracting asPECT-based Travel informATION

Goals for the future

● Mid term○ Extend this to a multi-label setting (an utterance can have multiple

aspects)■ Eg. I really like the warm weather, and the skydiving facilities

were great too!○ Extend this to multi-task setting by incorporating sentiment and

post-level saliency classification.

52

Page 53: EXPECTATION tracting asPECT-based Travel informATION

Goals for the future

● Long term○ Extend this framework to perform a sequence labelling task for

granular ASPECT-related keyphrase extraction and sentiment classification.

53

Page 54: EXPECTATION tracting asPECT-based Travel informATION

Goals for the future

● Long term○ Extend this framework to a sequence labelling task for granular

ASPECT-related keyphrase extraction and sentiment classification.■ Eg. Given, “The view from the top was amazing but the trek was

not so great” we want to get to,

The view from the top was amazing but the trek was not so great

* A * * * * O * * A * O O O

* + * * * * * * * - * * * *

54

Page 55: EXPECTATION tracting asPECT-based Travel informATION

Thank you!

Questions?

55