31
PhD Report Svitlana Vakulenko TU Wien February 15, 2016

Vakulenko PhD Status Report - 16 February 2016

Embed Size (px)

Citation preview

Page 1: Vakulenko PhD Status Report - 16 February 2016

PhD Report

Svitlana Vakulenko

TU Wien

February 15, 2016

Page 2: Vakulenko PhD Status Report - 16 February 2016

Overview

I Status of thesis

I Relation to other work

I Next steps and ideas

Page 3: Vakulenko PhD Status Report - 16 February 2016

Status of thesis

So far so. . .

I 2014I Topic ModelingI Event Extraction

I 2015I Target-dependent Sentiment AnalysisI Information Diffusion

I 2016I Breaking News DetectionI . . .

Page 4: Vakulenko PhD Status Report - 16 February 2016

Topic Modeling

[Vakulenko et al., 2014, Herbst et al., 2014, Reuter et al., 2014]

I @ University of Liechtenstein

I Method: Latent Dirichlet Allocation (LDA) [Blei et al., 2003]

I Datasets: iTunes, case studies, sustainability reports

Page 5: Vakulenko PhD Status Report - 16 February 2016

Topic Modeling: Results [Vakulenko et al., 2014] 1

Figure : Correspondence chart showing the overlap of LDA topics andiTunes categories

1https://ai.wu.ac.at/~vakulenko/

Page 6: Vakulenko PhD Status Report - 16 February 2016

Event Extraction

[Katsios et al., 2015]

I Summer School @ NCSR Demokritos

I Project: REVEAL EU-FP7 2013-2016

I Method: Relation Extraction (ClausIE)

I Datasets: FACup, SNOW, World Cup (tweets)

Page 7: Vakulenko PhD Status Report - 16 February 2016

Event Extraction: Results [Katsios et al., 2015]

Figure : Relations extracted from FACup dataset

Page 8: Vakulenko PhD Status Report - 16 February 2016

Target-dependent Sentiment Analysis

I @ MODUL University Vienna

I Method: POS-, Dependency parsing, ML Classifier (LogisticRegression)

I Datasets: MPQA (news articles), JDPA (product reviews)

Page 9: Vakulenko PhD Status Report - 16 February 2016

Target-dependent Sentiment Analysis: Results

Page 10: Vakulenko PhD Status Report - 16 February 2016

Information Diffusion

I @ MODUL University Vienna

I Project: PHEME EU-FP7 2014-2017

I Method: Relation Extraction

I Dataset: news articles, tweets

Page 11: Vakulenko PhD Status Report - 16 February 2016

Information Diffusion: Results

Figure : s: president barack obama – p: state D – o:

Page 12: Vakulenko PhD Status Report - 16 February 2016

Breaking News Detection

I @ MODUL University Vienna

I Project: InVID EU-Horizon 2016-2019

I WP: Social Media Mining

I Task: Emergent Topic Detection

I Dataset: tweets

Page 13: Vakulenko PhD Status Report - 16 February 2016

Status of thesis

Topics Events

Breaking News Sentiment Analysis

Information Diffusion

Page 14: Vakulenko PhD Status Report - 16 February 2016

Relation to other work

I State of the ArtI Requirements

I NewsworthinessI Scalability

I MethodologyI Data acquisitionI Topic modelingI Event extractionI First story detection

Page 15: Vakulenko PhD Status Report - 16 February 2016

State of the Art

SNOW 2014 Data Challenge confirmed newsworthy topic detectionto be a challenging task [Papadopoulos et al., 2014]2:F-score: 0.4 Precision: 0.56 Recall: 0.36 [Ifrim et al., 2014]The limitations of the current state-of-the-art approaches include

I early topic detection

I topic relevance

I topic representation

I performance evaluation of the topic detection methods.

The most recent results reported in the relatedwork [Martin et al., 2015]

2[Van Canneyt et al., 2014, Martin and Goker, 2014, Burnside et al., 2014,Petkos et al., 2014]

Page 16: Vakulenko PhD Status Report - 16 February 2016

Requirements: Newsworthiness

I a set of topics for a given time slot ‘covered in mainstreamnews sites’ [Papadopoulos et al., 2014]

I ’the combination of novelty andsignificance‘ [Martin et al., 2015]

One common method to find novel (emerging or recent trending)topics from a data stream is looking for bursts in frequentoccurrences of keywords and phrases(n-grams) [Martin et al., 2015, Martin and Goker, 2014,Fujiki et al., 2004, Cataldi et al., 2010, Aiello et al., 2013].

Page 17: Vakulenko PhD Status Report - 16 February 2016

Requirements: Scalability

I an important requirement when dealing with the data streamsof a high volume and velocity, e.g. Twitter

I BNgram approach [Martin and Goker, 2014]: 2 minutes pertopic model for a 15-minutes dataset of tweets

Page 18: Vakulenko PhD Status Report - 16 February 2016

Methodology: Data acquisition

I Twitter is the major source of news streamdata [Hu et al., 2012].

I Only a few studies focus on other data sources than Twitterstream, e.g.Wikipedia [Osborne et al., 2012, Steiner et al., 2013].

I New: integration of other social media APIs and cross-mediaretrieval, e.g.:tweets → topics(events) → (youtube) → videos

Page 19: Vakulenko PhD Status Report - 16 February 2016

Methodology: Topic modeling

Topic detection approaches often involve

I topic clustering

I topic ranking

I topic labeling

[Petkos et al., 2014, Martin and Goker, 2014,Van Canneyt et al., 2014, Martin et al., 2015, Ifrim et al., 2014,Elbagoury et al., 2015].

Page 20: Vakulenko PhD Status Report - 16 February 2016

Methodology: Event extraction

News are often centered around specific events (happenings),which provide a natural way to group the newsstories [Wu et al., 2015].There exist several on-line services that mine events from newsarticles in different languages:

I European Media Monitor3 [Pouliquen et al., 2008];

I GDELT project4 [Leetaru and Schrodt, 2013];

I Event Registry5 [Leban et al., 2014, Rupnik et al., 2015]

A few approaches to extract open-domain events from tweets wereproposed [Popescu et al., 2011, Ritter et al., 2012,Katsios et al., 2015], but neither of them supports cross-linguallinking.

3http://emm.newsbrief.eu4http://www.gdeltproject.org/5http://eventregistry.org

Page 21: Vakulenko PhD Status Report - 16 February 2016

Methodology: First story detection

The task of first story detection (FSD) was proposed to identifythe first story about a certain event from a documentstream [Petrovic et al., 2012]. The state-of-the-art FSDapproaches use similarity metrics over documents, such as TF-IDFvectors or Locality Sensitive Hashing (LSH)[Petrovic et al., 2012, Phuvipadawat and Murata, 2010], todetermine if candidate documents are close to existing documentsor could constitute a new event.

Page 22: Vakulenko PhD Status Report - 16 February 2016

Next steps and ideas

I Project: InVID EU-Horizon 2016-2019

I WP: Social Media Mining

I Deadline: June 2016 (deliverable)I Agenda:

I Data acquisitionI Breaking news detection

I Evaluation framework: Twitter Trends, [Ifrim et al., 2014][Martin et al., 2015]

I Methodology: topic modeling, event extraction, (semantic andcross-lingual) ontology-based integration (e.g. BabelNet)

I Progress: social media APIs integration proposal

Page 23: Vakulenko PhD Status Report - 16 February 2016

Bibliography I

Aiello, L., Petkos, G., Martin, C., Corney, D., Papadopoulos,S., Skraba, R., Goker, A., Kompatsiaris, I., and Jaimes, A.(2013).Sensing Trending Topics in Twitter.IEEE Transactions on Multimedia, 15(6):1268–1282.

Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003).Latent Dirichlet Allocation.The Journal of Machine Learning Research, 3:993–1022.

Burnside, G., Milioris, D., and Jacquet, P. (2014).One Day in Twitter: Topic Detection Via Joint Complexity.

Cataldi, M., Di Caro, L., and Schifanella, C. (2010).Emerging Topic Detection on Twitter Based on Temporal andSocial Terms Evaluation.MDMKDD, pages 4:1–4:10.

Page 24: Vakulenko PhD Status Report - 16 February 2016

Bibliography II

Elbagoury, A., Ibrahim, R., Farahat, A., Kamel, M., andKarray, F. (2015).Exemplar-Based Topic Detection in Twitter Streams.In International AAAI Conference on Web and Social Media.

Fujiki, T., Nanno, T., Suzuki, Y., and Okumura, M. (2004).Identification of bursts in a document stream.In International Workshop on Knowledge Discovery in DataStreams, pages 55–64.

Herbst, A., Simons, A., Brocke, J. v., Mller, O., Debortoli, S.,and Vakulenko, S. (2014).Identifying and Characterizing Topics in Enterprise ContentManagement: a Latent Semantic Analysis of Vendor Casestudies.In 22st European Conference on Information Systems, ECIS.

Page 25: Vakulenko PhD Status Report - 16 February 2016

Bibliography III

Hu, M., Liu, S., Wei, F., Wu, Y., Stasko, J., and Ma, K.-L.(2012).Breaking news on twitter.In Conference on Human Factors in Computing Systems,pages 2751–2754.

Ifrim, G., Shi, B., and Brigadir, I. (2014).Event detection in twitter using aggressive filtering andhierarchical tweet clustering.In SNOW-DC@ WWW, pages 33–40.

Katsios, G., Vakulenko, S., Krithara, A., and Paliouras, G.(2015).Towards open domain event extraction from twitter: Revealingentity relations.In DeRiVE@ ESWC, pages 35–46.

Page 26: Vakulenko PhD Status Report - 16 February 2016

Bibliography IV

Leban, G., Fortuna, B., Brank, J., and Grobelnik, M. (2014).Cross-lingual detection of world events from news articles.In Proceedings of the ISWC, pages 21–24.

Leetaru, K. and Schrodt, P. A. (2013).Gdelt: Global data on events, location, and tone, 1979–2012.In ISA Annual Convention, volume 2, page 4.

Martin, C., Corney, D., and Goker, A. (2015).Mining Newsworthy Topics from Social Media.In Advances in Social Media Analysis, pages 21–43.

Martin, C. and Goker, A. (2014).Real-time topic detection with bursty n-grams: RGU’ssubmission to the 2014 SNOW challenge.In SNOW-DC@ WWW.

Page 27: Vakulenko PhD Status Report - 16 February 2016

Bibliography V

Osborne, M., Petrovic, S., McCreadie, R., Macdonald, C., andOunis, I. (2012).Bieber no more: First story detection using Twitter andWikipedia.In TAIA.

Papadopoulos, S., Corney, D., and Aiello, L. M. (2014).Snow 2014 data challenge: Assessing the performance of newstopic detection methods in social media.In SNOW-DC@ WWW, pages 1–8.

Petkos, G., Papadopoulos, S., and Kompatsiaris, Y. (2014).Two-level message clustering for topic detection in twitter.In SNOW-DC@ WWW, pages 49–56.

Page 28: Vakulenko PhD Status Report - 16 February 2016

Bibliography VI

Petrovic, S., Osborne, M., and Lavrenko, V. (2012).Using paraphrases for improving first story detection in newsand Twitter.In Conference of the North American Chapter of theAssociation for Computational Linguistics, pages 338–346.

Phuvipadawat, S. and Murata, T. (2010).Breaking News Detection and Tracking in Twitter.In International Conference on Web Intelligence and IntelligentAgent Technology (WI-IAT), volume 3, pages 120–123.

Popescu, A.-M., Pennacchiotti, M., and Paranjpe, D. (2011).Extracting events and event descriptions from twitter.In WWW, pages 105–106.

Page 29: Vakulenko PhD Status Report - 16 February 2016

Bibliography VII

Pouliquen, B., Steinberger, R., and Deguernel, O. (2008).Story tracking: linking similar news over time and acrosslanguages.In Proceedings of the workshop on Multi-source MultilingualInformation Extraction and Summarization, pages 49–56.

Reuter, N., Vakulenko, S., Brocke, J. v., Debortoli, S., andMller, O. (2014).Identifying the Role of Information Systems in AchievingEnergy-Related Environmental Sustainability using TextMining.In 22st European Conference on Information Systems, ECIS.

Ritter, A., Etzioni, O., Clark, S., and others (2012).Open domain event extraction from twitter.In SIGKDD, pages 1104–1112.

Page 30: Vakulenko PhD Status Report - 16 February 2016

Bibliography VIII

Rupnik, J., Muhic, A., Leban, G., Skraba, P., Fortuna, B., andGrobelnik, M. (2015).News Across Languages-Cross-Lingual Document Similarityand Event Tracking.arXiv preprint arXiv:1512.07046.

Steiner, T., van Hooland, S., and Summers, E. (2013).MJ No More: Using Concurrent Wikipedia Edit Spikes withSocial Network Plausibility Checks for Breaking NewsDetection.In WWW, pages 791–794.

Vakulenko, S., Mller, O., and Brocke, J. v. (2014).Enriching iTunes App Store Categories via Topic Modeling.In Proceedings of the International Conference on InformationSystems ICIS.

Page 31: Vakulenko PhD Status Report - 16 February 2016

Bibliography IX

Van Canneyt, S., Feys, M., Schockaert, S., Demeester, T.,Develder, C., and Dhoedt, B. (2014).Detecting newsworthy topics in Twitter.In SNOW-DC@ WWW, pages 1–8.

Wu, Z., Chen, L., and Giles, C. L. (2015).Storybase: Towards Building a Knowledge Base for NewsEvents.In ACL, pages 133–138.