Vakulenko PhD Status Report - 16 February 2016

PhD Report

Svitlana Vakulenko

TU Wien

February 15, 2016

Overview

I Status of thesis

I Relation to other work

I Next steps and ideas

Status of thesis

So far so. . .

I 2014I Topic ModelingI Event Extraction

I 2015I Target-dependent Sentiment AnalysisI Information Diffusion

I 2016I Breaking News DetectionI . . .

Topic Modeling

[Vakulenko et al., 2014, Herbst et al., 2014, Reuter et al., 2014]

I @ University of Liechtenstein

I Method: Latent Dirichlet Allocation (LDA) [Blei et al., 2003]

I Datasets: iTunes, case studies, sustainability reports

Topic Modeling: Results [Vakulenko et al., 2014] 1

Figure : Correspondence chart showing the overlap of LDA topics andiTunes categories

1https://ai.wu.ac.at/~vakulenko/

https://ai.wu.ac.at/~vakulenko/

Event Extraction

[Katsios et al., 2015]

I Summer School @ NCSR Demokritos

I Project: REVEAL EU-FP7 2013-2016

I Method: Relation Extraction (ClausIE)

I Datasets: FACup, SNOW, World Cup (tweets)

Event Extraction: Results [Katsios et al., 2015]

Figure : Relations extracted from FACup dataset

Target-dependent Sentiment Analysis

I @ MODUL University Vienna

I Method: POS-, Dependency parsing, ML Classifier (LogisticRegression)

I Datasets: MPQA (news articles), JDPA (product reviews)

Target-dependent Sentiment Analysis: Results

Information Diffusion


I Project: PHEME EU-FP7 2014-2017

I Method: Relation Extraction

I Dataset: news articles, tweets

Information Diffusion: Results

Figure : s: president barack obama – p: state D – o:

Breaking News Detection


I Project: InVID EU-Horizon 2016-2019

I WP: Social Media Mining

I Task: Emergent Topic Detection

I Dataset: tweets

Status of thesis

Topics Events

Breaking News Sentiment Analysis

Information Diffusion

Relation to other work

I State of the ArtI Requirements

I NewsworthinessI Scalability

I MethodologyI Data acquisitionI Topic modelingI Event extractionI First story detection

State of the Art

SNOW 2014 Data Challenge confirmed newsworthy topic detectionto be a challenging task [Papadopoulos et al., 2014]2:F-score: 0.4 Precision: 0.56 Recall: 0.36 [Ifrim et al., 2014]The limitations of the current state-of-the-art approaches include

I early topic detection

I topic relevance

I topic representation

I performance evaluation of the topic detection methods.

The most recent results reported in the relatedwork [Martin et al., 2015]

2[Van Canneyt et al., 2014, Martin and Goker, 2014, Burnside et al., 2014,Petkos et al., 2014]

Requirements: Newsworthiness

I a set of topics for a given time slot ‘covered in mainstreamnews sites’ [Papadopoulos et al., 2014]

I ’the combination of novelty andsignificance‘ [Martin et al., 2015]

One common method to find novel (emerging or recent trending)topics from a data stream is looking for bursts in frequentoccurrences of keywords and phrases(n-grams) [Martin et al., 2015, Martin and Goker, 2014,Fujiki et al., 2004, Cataldi et al., 2010, Aiello et al., 2013].

Requirements: Scalability

I an important requirement when dealing with the data streamsof a high volume and velocity, e.g. Twitter

I BNgram approach [Martin and Goker, 2014]: 2 minutes pertopic model for a 15-minutes dataset of tweets

Methodology: Data acquisition

I Twitter is the major source of news streamdata [Hu et al., 2012].

I Only a few studies focus on other data sources than Twitterstream, e.g.Wikipedia [Osborne et al., 2012, Steiner et al., 2013].

I New: integration of other social media APIs and cross-mediaretrieval, e.g.:tweets → topics(events) → (youtube) → videos

Methodology: Topic modeling

Topic detection approaches often involve

I topic clustering

I topic ranking

I topic labeling

[Petkos et al., 2014, Martin and Goker, 2014,Van Canneyt et al., 2014, Martin et al., 2015, Ifrim et al., 2014,Elbagoury et al., 2015].

Methodology: Event extraction

News are often centered around specific events (happenings),which provide a natural way to group the newsstories [Wu et al., 2015].There exist several on-line services that mine events from newsarticles in different languages:

I European Media Monitor3 [Pouliquen et al., 2008];

I GDELT project4 [Leetaru and Schrodt, 2013];

I Event Registry5 [Leban et al., 2014, Rupnik et al., 2015]

A few approaches to extract open-domain events from tweets wereproposed [Popescu et al., 2011, Ritter et al., 2012,Katsios et al., 2015], but neither of them supports cross-linguallinking.

3http://emm.newsbrief.eu4http://www.gdeltproject.org/5http://eventregistry.org

http://emm.newsbrief.eu

http://www.gdeltproject.org/

http://eventregistry.org

Methodology: First story detection

The task of first story detection (FSD) was proposed to identifythe first story about a certain event from a documentstream [Petrovic et al., 2012]. The state-of-the-art FSDapproaches use similarity metrics over documents, such as TF-IDFvectors or Locality Sensitive Hashing (LSH)[Petrovic et al., 2012, Phuvipadawat and Murata, 2010], todetermine if candidate documents are close to existing documentsor could constitute a new event.

Next steps and ideas

I Project: InVID EU-Horizon 2016-2019

I WP: Social Media Mining

I Deadline: June 2016 (deliverable)I Agenda:

I Data acquisitionI Breaking news detection

I Evaluation framework: Twitter Trends, [Ifrim et al., 2014][Martin et al., 2015]

I Methodology: topic modeling, event extraction, (semantic andcross-lingual) ontology-based integration (e.g. BabelNet)

I Progress: social media APIs integration proposal

Bibliography I

Aiello, L., Petkos, G., Martin, C., Corney, D., Papadopoulos,S., Skraba, R., Goker, A., Kompatsiaris, I., and Jaimes, A.(2013).Sensing Trending Topics in Twitter.IEEE Transactions on Multimedia, 15(6):1268–1282.

Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003).Latent Dirichlet Allocation.The Journal of Machine Learning Research, 3:993–1022.

Burnside, G., Milioris, D., and Jacquet, P. (2014).One Day in Twitter: Topic Detection Via Joint Complexity.

Cataldi, M., Di Caro, L., and Schifanella, C. (2010).Emerging Topic Detection on Twitter Based on Temporal andSocial Terms Evaluation.MDMKDD, pages 4:1–4:10.

Bibliography II

Elbagoury, A., Ibrahim, R., Farahat, A., Kamel, M., andKarray, F. (2015).Exemplar-Based Topic Detection in Twitter Streams.In International AAAI Conference on Web and Social Media.

Fujiki, T., Nanno, T., Suzuki, Y., and Okumura, M. (2004).Identification of bursts in a document stream.In International Workshop on Knowledge Discovery in DataStreams, pages 55–64.

Herbst, A., Simons, A., Brocke, J. v., Mller, O., Debortoli, S.,and Vakulenko, S. (2014).Identifying and Characterizing Topics in Enterprise ContentManagement: a Latent Semantic Analysis of Vendor Casestudies.In 22st European Conference on Information Systems, ECIS.

Bibliography III

Hu, M., Liu, S., Wei, F., Wu, Y., Stasko, J., and Ma, K.-L.(2012).Breaking news on twitter.In Conference on Human Factors in Computing Systems,pages 2751–2754.

Ifrim, G., Shi, B., and Brigadir, I. (2014).Event detection in twitter using aggressive filtering andhierarchical tweet clustering.In SNOW-DC@ WWW, pages 33–40.

Katsios, G., Vakulenko, S., Krithara, A., and Paliouras, G.(2015).Towards open domain event extraction from twitter: Revealingentity relations.In DeRiVE@ ESWC, pages 35–46.

Bibliography IV

Leban, G., Fortuna, B., Brank, J., and Grobelnik, M. (2014).Cross-lingual detection of world events from news articles.In Proceedings of the ISWC, pages 21–24.

Leetaru, K. and Schrodt, P. A. (2013).Gdelt: Global data on events, location, and tone, 1979–2012.In ISA Annual Convention, volume 2, page 4.

Martin, C., Corney, D., and Goker, A. (2015).Mining Newsworthy Topics from Social Media.In Advances in Social Media Analysis, pages 21–43.

Martin, C. and Goker, A. (2014).Real-time topic detection with bursty n-grams: RGU’ssubmission to the 2014 SNOW challenge.In SNOW-DC@ WWW.

Bibliography V

Osborne, M., Petrovic, S., McCreadie, R., Macdonald, C., andOunis, I. (2012).Bieber no more: First story detection using Twitter andWikipedia.In TAIA.

Papadopoulos, S., Corney, D., and Aiello, L. M. (2014).Snow 2014 data challenge: Assessing the performance of newstopic detection methods in social media.In SNOW-DC@ WWW, pages 1–8.

Petkos, G., Papadopoulos, S., and Kompatsiaris, Y. (2014).Two-level message clustering for topic detection in twitter.In SNOW-DC@ WWW, pages 49–56.

Bibliography VI

Petrovic, S., Osborne, M., and Lavrenko, V. (2012).Using paraphrases for improving first story detection in newsand Twitter.In Conference of the North American Chapter of theAssociation for Computational Linguistics, pages 338–346.

Phuvipadawat, S. and Murata, T. (2010).Breaking News Detection and Tracking in Twitter.In International Conference on Web Intelligence and IntelligentAgent Technology (WI-IAT), volume 3, pages 120–123.

Popescu, A.-M., Pennacchiotti, M., and Paranjpe, D. (2011).Extracting events and event descriptions from twitter.In WWW, pages 105–106.

Bibliography VII

Pouliquen, B., Steinberger, R., and Deguernel, O. (2008).Story tracking: linking similar news over time and acrosslanguages.In Proceedings of the workshop on Multi-source MultilingualInformation Extraction and Summarization, pages 49–56.

Reuter, N., Vakulenko, S., Brocke, J. v., Debortoli, S., andMller, O. (2014).Identifying the Role of Information Systems in AchievingEnergy-Related Environmental Sustainability using TextMining.In 22st European Conference on Information Systems, ECIS.

Ritter, A., Etzioni, O., Clark, S., and others (2012).Open domain event extraction from twitter.In SIGKDD, pages 1104–1112.

Bibliography VIII

Rupnik, J., Muhic, A., Leban, G., Skraba, P., Fortuna, B., andGrobelnik, M. (2015).News Across Languages-Cross-Lingual Document Similarityand Event Tracking.arXiv preprint arXiv:1512.07046.

Steiner, T., van Hooland, S., and Summers, E. (2013).MJ No More: Using Concurrent Wikipedia Edit Spikes withSocial Network Plausibility Checks for Breaking NewsDetection.In WWW, pages 791–794.

Vakulenko, S., Mller, O., and Brocke, J. v. (2014).Enriching iTunes App Store Categories via Topic Modeling.In Proceedings of the International Conference on InformationSystems ICIS.

Bibliography IX

Van Canneyt, S., Feys, M., Schockaert, S., Demeester, T.,Develder, C., and Dhoedt, B. (2014).Detecting newsworthy topics in Twitter.In SNOW-DC@ WWW, pages 1–8.

Wu, Z., Chen, L., and Giles, C. L. (2015).Storybase: Towards Building a Knowledge Base for NewsEvents.In ACL, pages 133–138.

Data & Analytics

Vakulenko PhD Status Report - 16 February 2016