Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
EXPECTATION: EXtracting asPECT-based Travel informATION
By Aditya Chetan
Advisors: Dr. Paramita Mirza and Dr. Andrew Yates
Overview
● Motivation● Data collection and Analysis
○ WikiVoyage○ TripAdvisor
● Seed word extraction○ What are seed words?○ How to extract them?
● Weakly supervised Aspect Extraction○ Student-teacher model
2
Motivation● Existing Travel Recommender Systems:
○ Can: Make bookings, give rating-based recommendations
3
Motivation● Existing Travel Recommender Systems:
○ Can: Make bookings, give rating-based recommendations
○ Can’t: Consider fine-grained aspect-based preferences from unstructured feedback
4
Consider travel history
5
Consider an aspect-level preference like “warm” CLIMATE
6
-ve sentiment towards “parasailing”, an adventure sport (ACTIVITY)
7
Make suggestions using previous utterances
8
Capture multiple aspects. (CLIMATE, ACTIVITY, FOOD, etc.)
9
My focus:TravelDB Construction:
Creating aspect-based profiles of locations
10
Why TravelDB?
● Resource for providing recommendations for destinations
● Existing data sources do not focus on Travel-related aspects like
ACTIVITY, FOOD, etc.
● Re-usable, transparent asset when providing recommendations
ACTIVITY FOOD SHOP CLIMATE
New Delhibus tour, museums, forts, tomb
north indian, vegan food, indian street food
sarees, indian textiles, handmade goods
hellish 😰
11
Challenges
● Lack of structured data sources
(like IMDb, etc.) with
travel-related information
● Unstructured data is not
annotated.
Table of contents on the Wikipedia
article of New Delhi
(https://en.wikipedia.or
g/wiki/New_Delhi)
12
Challenges
● Lack of structured data sources
(like IMDb, etc.) with
travel-related information
● Unstructured data is not
annotated.
Table of contents on the Wikipedia
article of New Delhi
(https://en.wikipedia.or
g/wiki/New_Delhi)
13
Roadmap till now
Data Collection & Analysis
● Selecting and scraping data sources.
● Cleaning and preprocessing.
● Selecting aspects to focus on.
14
Roadmap till now
Data Collection & Analysis
● Selecting and scraping data sources.
● Cleaning and preprocessing.
● Selecting aspects to focus on.
Seed Word Extraction
“Seed words” are representative words for an aspect.
Eg. ACTIVITY has key-phrases like “hiking”, “museums”, etc.
15
Roadmap till now
Extracting Aspects for new locations
● Use seed words as weak supervision to classify text segments into 4 aspects:
● ACTIVITY● FOOD● SHOP● CLIMATE
Data Collection & Analysis
● Selecting and scraping data sources.
● Cleaning and preprocessing.
● Selecting aspects to focus on.
Seed Word Extraction
“Seed words” are representative words for an aspect.
Eg. ACTIVITY has key-phrases like “hiking”, “museums”, etc.
16
I. Data Collection & Analysis
● We collected data from two different sources:
a. WikiVoyage Data Dump: Collection of travel-related information of various places, written by volunteers.
b. TripAdvisor Discussion Forums: Location-specific discussion forums on travel.
17
I. i. WikiVoyage Data Dump
● Articles have summary, plus sectioned headings for different aspects of Travel. Eg. “See & Do”, “Eat & Drink”, etc. in WikiText.
● Wrote a tool that could extract content from the dump.18
I. i. WikiVoyage Data Dump ● WikiVoyage is volunteer-driven, so headings can have variations. Eg.
“See”, “See & Do”, “Do”, etc. We cluster them based on lexical similarity.
19
I. i. WikiVoyage Data Dump ● WikiVoyage is volunteer-driven, so headings can have variations. Eg.
“See”, “See & Do”, “Do”, etc. We cluster them based on lexical similarity.
● Heading Clusters are then assigned to an aspect of travel.
See & Do, See, Do, Things to do, ...
Drinks, Eat & Drink, Eat/Drink, Eat, ...
Buy, Pricing...
ACTIVITY FOOD SHOPPING 20
I. i. WikiVoyage Data Dump
● Total no. of locations: 6382● No. of locations with all aspects present: 4902
ASPECT No. of locations
ACTIVITY 6367
FOOD 6346
SHOPPING 4925
CLIMATE 638221
I. i. WikiVoyage Data Dump
● Total no. of locations: 6382● No. of locations with all aspects present: 4902
ASPECT No. of locations
ACTIVITY 6367
FOOD 6346
SHOPPING 4925
CLIMATE 6382
● CLIMATE data on WikiVoyage extremely limited
● Collected by linking WikiVoyage articles to Wikipedia
22
I. ii. TripAdvisor Discussion Forums
23
I. ii. TripAdvisor Discussion Forums
● Some statistics from the dataset:
○ Total no. of locations: 187
○ Total no. of users: 149779
24
II. Seed word extraction ● What are seed words?
○ Describing aspects is hard. Easier to “define by example”■ Eg. ACTIVITY : “cross country skiing”, “trekking”, “museums”,
“art gallery”, etc.
25
II. Seed word extraction ● What are seed words?
○ Describing aspects is hard. Easier to “define by example”■ Eg. ACTIVITY : “cross country skiing”, “trekking”, “museums”,
“art gallery”, etc.
● Seed words are important w.r.t. travel for most locations○ Seed words are mostly the key phrases but not vice versa
Eg. “Louvre” : keyphrase for Paris, but not a seed word for ACTIVITY as it is exclusive to Paris.
26
II. Seed word extraction
ACTIVITY
FOOD
SHOPPING
CLIMATE
WikiVoyage articles
Seed word extraction pipeline 27
II. Seed word extraction
ACTIVITY
FOOD
SHOPPING
CLIMATE
WikiVoyage articles
ASPECT kp1, kp2, kp3, ...YAKE*
For every ASPECTFor every article
Seed word extraction pipeline
* YAKE is an unsupervised keyphrase extraction algorithm. (Campos et. al. ECIR 2018)
28
II. Seed word extraction
ACTIVITY
FOOD
SHOPPING
CLIMATE
WikiVoyage articles
ASPECT kp1, kp2, kp3, ...YAKE
For every ASPECTFor every article
ACTIVITY FOOD
CLIMATESHOPPING
kp1, kp2, kp3, ...
kp1, kp2, kp3, ...
kp1, kp2, kp3, ...
kp1, kp2, kp3, ...
ASPECT Vocabularies
Seed word extraction pipeline 29
II. Seed word extraction
ACTIVITY
FOOD
SHOPPING
CLIMATE
WikiVoyage articles
ASPECT kp1, kp2, kp3, ...YAKE
For every ASPECTFor every article
ACTIVITY FOOD
CLIMATESHOPPING
kp1, kp2, kp3, ...
kp1, kp2, kp3, ...
kp1, kp2, kp3, ...
kp1, kp2, kp3, ...
ASPECT Vocabularies
ASPECT
ASPECT content from every WikiVoyage article
Seed word extraction pipeline For every ASPECT
30
II. Seed word extraction
ACTIVITY
FOOD
SHOPPING
CLIMATE
WikiVoyage articles
ASPECT kp1, kp2, kp3, ...YAKE
For every ASPECTFor every article
ACTIVITY FOOD
CLIMATESHOPPING
kp1, kp2, kp3, ...
kp1, kp2, kp3, ...
kp1, kp2, kp3, ...
kp1, kp2, kp3, ...
ASPECT Vocabularies
ASPECT
ASPECT content from every WikiVoyage article
TF-IDF with ASPECT Vocabulary
TF-IDF matrix
Seed word extraction pipeline For every ASPECT
31
II. Seed word extraction
ACTIVITY
FOOD
SHOPPING
CLIMATE
WikiVoyage articles
ASPECT kp1, kp2, kp3, ...YAKE
For every ASPECTFor every article
ACTIVITY FOOD
CLIMATESHOPPING
kp1, kp2, kp3, ...
kp1, kp2, kp3, ...
kp1, kp2, kp3, ...
kp1, kp2, kp3, ...
ASPECT Vocabularies
ASPECT
ASPECT content from every WikiVoyage article
TF-IDF with ASPECT Vocabulary
TF-IDF matrix
Rank by mean TF-IDF score
rank across articles
1 kp32 kp13 kp2
Final ranked list of seed
words
Seed word extraction pipeline For every ASPECT
32
II. Seed word extraction
ASPECT Representative Examples No. of seed words
ACTIVITY museum, art gallery, rock climbing, guided tours, unesco world heritage
44014
FOOD restaurant, live music, ice cream, vegan restaurant, selection of beers
19098
SHOPPING shopping, gift shop, variety of shops, homemade red wine
7673
CLIMATE precipitation, annual rainfall, oceanic climate, humid subtropical climate, humid continental climate
2222
33
III. Weakly supervised Aspect Extraction (WAE)
● Seed words give a good anecdotal definition for aspects. But are not exhaustive.
34
III. Weakly supervised Aspect Extraction (WAE)
● Seed words give a good anecdotal definition for aspects. But are not exhaustive.
● Need: A framework that can:○ Extract aspect-related words from new utterances.
35
III. Weakly supervised Aspect Extraction (WAE)
● Seed words give a good anecdotal definition for aspects. But are not exhaustive.
● Need: A framework that can:○ Extract aspect-related words from new utterances.○ Perform this task under minimal supervision.
36
III. Weakly supervised Aspect Extraction (WAE)
● Seed words give a good anecdotal definition for aspects. But are not exhaustive.
● Need: A framework that can:○ Extract aspect-related words from new utterances.○ Perform this task under minimal supervision.
● Experimenting with a framework based on Karamanolakis et. al. [1].● It utilises iterative co-training that trains teacher-student model pairs that
provide feedback to each other.
[1] Leveraging Just a Few Keywords for Fine-Grained Aspect Detection Through Weakly Supervised Co-Training. Giannis Karamanolakis, Daniel Hsu, Luis Gravano. Proceedings of EMNLP-IJCNLP 2019 37
III. Weakly supervised Aspect Extraction (WAE)
● Task
Let there be k aspects (including a general aspect) and for every aspect i, a set of seed words di is given.
Now, train a classifier such that, for a fresh text segment, s, it predict the aspect, ks that it belongs to.
This training task is completely unsupervised.
38
III. Weakly supervised Aspect Extraction (WAE)
Teacher Model (BoSW CLF)
ith segment
ASPECT probabilities
Equations are from [1] 39
III. Weakly supervised Aspect Extraction (WAE)
Teacher Model (BoSW CLF) Student Model (CLF)
ith segment
ASPECT probabilities
Equations are from [1] 40
III. Weakly supervised Aspect Extraction (WAE)
Teacher Model (BoSW CLF) Student Model (CLF)
ith segment
ASPECT probabilities
Equations are from [1] 41
III. Weakly supervised Aspect Extraction (WAE)
● Advantages○ Learn a student model that is better than teacher
42
III. Weakly supervised Aspect Extraction (WAE)
● Advantages○ Learn a student model that is better than teacher○ Get a quality score for each seed word
43
III. Weakly supervised Aspect Extraction (WAE)
● Advantages○ Learn a student model that is better than teacher○ Get a quality score for each seed word○ Modular framework
44
III. Weakly supervised Aspect Extraction (WAE)
● Advantages○ Learn a student model that is better than teacher○ Get a quality score for each seed word○ Modular framework○ Unsupervised!
45
III. Weakly supervised Aspect Extraction (WAE)
● Testing the feasibility of the framework○ Initial results look
promising.○ Noisy seed words →
Class imbalance → Bad Teacher
○ Student performance held back by a bad teacher 46
Goals for the future
● Short term○ Evaluate performance on hand labelled test set
47
Goals for the future
● Short term○ Evaluate performance on hand labelled test set○ Improve the Student by replacing BERT embeddings with ALBERT
48
Goals for the future
● Short term○ Evaluate performance on hand labelled test set○ Improve the Student by replacing BERT embeddings with ALBERT ○ Replace Teacher model with Snorkel[2] model based on
handcrafted labeling functions
[2] Snorkel: Rapid Training Data Creation with Weak Supervision. Ratner et. al. VLDB 2018 49
Goals for the future
● Short term○ Evaluate performance on hand labelled test set○ Improve the Student by replacing BERT embeddings with ALBERT ○ Replace Teacher model with Snorkel[2] model based on
handcrafted labeling functions○ Shift from using sentence to Elementary Discourse Units (EDUs) for
input segments■ Eg. The food looks nice but the restaurant was very costly
→ The food looks nice <s> but the restaurant was very costly
[2] Snorkel: Rapid Training Data Creation with Weak Supervision. Ratner et. al. VLDB 2018 50
Goals for the future
● Mid term○ Extend this to a multi-label setting (an utterance can have multiple
aspects)■ Eg. I really like the warm weather, and the skydiving facilities
were great too!
51
Goals for the future
● Mid term○ Extend this to a multi-label setting (an utterance can have multiple
aspects)■ Eg. I really like the warm weather, and the skydiving facilities
were great too!○ Extend this to multi-task setting by incorporating sentiment and
post-level saliency classification.
52
Goals for the future
● Long term○ Extend this framework to perform a sequence labelling task for
granular ASPECT-related keyphrase extraction and sentiment classification.
53
Goals for the future
● Long term○ Extend this framework to a sequence labelling task for granular
ASPECT-related keyphrase extraction and sentiment classification.■ Eg. Given, “The view from the top was amazing but the trek was
not so great” we want to get to,
The view from the top was amazing but the trek was not so great
* A * * * * O * * A * O O O
* + * * * * * * * - * * * *
54
Thank you!
Questions?
55