Upload
faculty-of-computer-science
View
403
Download
0
Embed Size (px)
Citation preview
Adrian Iftene, Alexandru Lucian Gînscă
ICCCC 2012, 8-12 May, Băile Felix, Oradea, Romania
“Al. I. Cuza”, University of Iasi, Romania
Faculty of Computer Science
System overview
Data acquisition
Topic detection
Data processing
Identification of opinions
Results
Visualization
Conclusions
ICCCC 2012, 8-12 May, Băile Felix, Oradea
3 ICCCC 2012, 8-12 May, Băile Felix, Oradea
4 ICCCC 2012, 8-12 May, Băile Felix, Oradea
Scenario: Street protests in Romania (between 13 and 26 January, 2012)
Crawler component, RSS feeds
Scraping: removed links, photos, menus, special characters
Data locally stored
5 ICCCC 2012, 8-12 May, Băile Felix, Oradea
The topic is very important in detecting articles reffering to a crisis situation
Latent Dirichlet Allocation: state of the art topic model
Problems: • The number of topics needs to be specified from start
• The results are lists of representative words for each topic resulting for a need for human intervention in interpreting them
Solution: WordNet based similarity measures • WuPalmer
• Lin
• Resnik (best results)
6 ICCCC 2012, 8-12 May, Băile Felix, Oradea
Computing the similarity between 2 sets of words
T1, T2 = two sets of words.
sim(t1, t2) = one of the Wu and Palmer, Resnik or Lin similarity measures.
7 ICCCC 2012, 8-12 May, Băile Felix, Oradea
LDA results for our street protests corpus when tracking 3 topics
8 ICCCC 2012, 8-12 May, Băile Felix, Oradea
Language specific resources that contain cities (Iasi, Bucuresti, Ploiesti, etc.), regions (Bucovina, Moldova, Transilvania, etc.) (Iftene et al., 2011)
Introducing a more localized approach: new resources and rules for street (Iasi, Bulevardul Independentei, Bucuresti, Calea Victoriei, etc.) and smaller inner city regions identification (Pacurari district, center of Iasi, Arch of Triumph Square)
Example of Rules: to identify streets (Street + entity, Boulevard + entity, etc.), to identify small regions (the area between street A and street B or the area of the building A)
9 ICCCC 2012, 8-12 May, Băile Felix, Oradea
538 files with 2,806 entities of "street" and “area” types
The overall quality of NE identification component is around 92% and the quality of NE classification component is around 67%
Problems:
◦ incorrect spelling
◦ anaphora resolution
◦ ambigous situations when from the context we cannot conclude that the NE is a person name or a street name
10 ICCCC 2012, 8-12 May, Băile Felix, Oradea
Rule based opinion mining system (Gînscă et al., 2011)
Easily adaptible from a crisis scenario to another – in opposition with a statistical approach
Use of manually built resources to identify opinion keywords (good, bad etc.), amplifiers (most, more etc.), diminishers (less, etc.), negation (not, never etc.)
Calculate the valences for groups of feelings and pairing named entities with scores based on the distance, punctuation and context
Use a dedicated vocabulary for a specific crisis situation with 21 initial words (protest, conflict, fight, etc.) + similar words from WordNet (synonyms, hypernyms, etc.)
11 ICCCC 2012, 8-12 May, Băile Felix, Oradea
Greedy approach – adding iteratively intermediate green points to the current path until solution cannot be improved
Advantages – we reduce the search space for optimal routes and the Greedy solution is obtained very fast
Disavantages – the Greedy solution is closed to the optimal solution
12 ICCCC 2012, 8-12 May, Băile Felix, Oradea
Cumulated sentiment values by days
-40
-30
-20
-10
0
10
20
30
13 14 15 16 17 18 19 20 21 22 23 25
13 ICCCC 2012, 8-12 May, Băile Felix, Oradea
Location type entities mentions by day
0
50
100
150
200
250
13 14 15 16 17 18 19 20 21 22 23 25
14 ICCCC 2012, 8-12 May, Băile Felix, Oradea
GoogleMaps API
Our algorithm is able to find another path (longer) which passes near the red islands and prefers the ways near the green islands
Thus, at every step is possible to insert penalties when the partial solution crosses red islands (with potential risks) and add bonuses when the partial solution crosses green islands (without potential risk)
15 ICCCC 2012, 8-12 May, Băile Felix, Oradea
16 ICCCC 2012, 8-12 May, Băile Felix, Oradea
When we haven’t green islands we must specify another method to select intermediate points in order to improve the quality of current solution
If in the cases of streets and boulevards the GoogleMaps API is able to put these entities on the map, for specific squares and areas it is not able to do this. In such cases we built an additional resource which specifies the GIS coordinates for them
17 ICCCC 2012, 8-12 May, Băile Felix, Oradea
We present a system that can be easily adapted from a crisis situation to another (changing the dictionaries, changing the interest topics)
Efficient topic identification using LDA
Suggestive visualization using GoogleAPI
18 ICCCC 2012, 8-12 May, Băile Felix, Oradea
19 ICCCC 2012, 8-12 May, Băile Felix, Oradea