News construction from microblogging post using open data

News construction from microblogging posts using open data

Francisco Berrizbeitia Universidad Simón Bolívar

Caracas, Venezuela [email protected]

June, 2014

Abstract

Information access can be limited in some situations where traditional media outlets can’t cover the events due to geographical limitations or censorship. Examples of those situations can be civil unrest, war or natural disasters. In these situations citizen journalism replace or complement traditional media in the documentation of such events. Microblogging services such as Twitter have become of great use in these scenarios due their mobile nature and multimedia capabilities.

In this research we propose a method to create searchable, semantically annotated news articles from tweets in an automated way using the cloud of linked open data.

Keywords

Semantic web, News, Microblogging, Twitter, Automatic document generation, Data Journalims, citizen journalism.

1 Introduction

Citizen journalism has become a very common practice with the arrival of the smartphones and microblogging services such as Twitter. Due to the multimedia capacities of the devices and the mobile nature of the social network people all over the word are documenting all sort events and publishing on the Web on real time. This type of journalism has particular importance in situations where the traditional media can’t cover the events, such as natural disasters, war, civil unrest or due to government or self-imposed censorship. Citizen journalism is protected by the Universal Declaration of Human Rights, article 19: (United Nations)

“Everyone has the right to freedom of opinion and expression; this right includes freedom to hold opinions without interference and to seek, receive and impart

information and ideas through any media and regardless of frontiers.”

mailto:[email protected]

This protection has had tremendous implications in the recent past, in situations where the only available information was found on social networks and in international media outlets with very limited coverage. We believe it’s of great importance to develop a technology that allows the creation of “fair” documents from all the contributions made by the users during such events.

The hope is that the automated documents created by this technology will be closer to what really happened and guarantee impartiality.

As a first step in this research we want to construct a news article from a single 140 character message using the open data cloud.

In the rest of the report we will first describe the overall approach we took to the problem then describe the system we developed for this task and finally look at the results.

2 Related Work

Information extraction on from Twitter and other microblogging plataforms has been done in the past. (David Laniado, 2010) explored the semantic value of hashtags as identifiers for the semantic web. (Shinavier, 2010) proposed the possibility of creating a real-time semantic web using structured microblogging messages. (Ritter, 2012) uses natural language processing and information extraction techniques over a corpora of tweets to extract machine readable information.

Sentiment analysis has also been a topic of research like the work of (Alexander Pak, 2010) where they propose a machine learning method to classify the tweets in positive, negative and neutral.

3 Description

The main objective is to obtain the semantically meaningful concepts expressed in the micropost from the Open Data Cloud and then create a document that extends the original text with the retrieved concepts. If we succeed in this task we will end up with a news article where the questions: who, what, where, when and why (Wikipedia, 2014) are going to be derived from the micropost and extended with the linked open data cloud.

Figure 1. Overall view of the process

In figure 1 we can see the overall process of the news creation. Being this our first approach to the problem we decided to limit the sources of information to Twitter as the only microblog input and DBpedia as our source of semantically annotated information.

The system was implemented as a web application written in PHP. In the next section we will describe each part of the system.

3.1 Information gathering and text preparation

The first task consists in gathering the posted information by a user of the social network; we collect not only the published text, but also the media when available and information about the author. We obtain all the information using the public API provided by Twitter. As shown in figure 2, the only input the system need is the tweet ID .

Figure 2. Input screen of the system

After the text is retrieved it must be “denoised” before any further processing. At this point all the stop words are removed as well as links and Tweeter specific words such as RT or FF. The hashtag character (#) is removed leaving the remaining word.

3.2 Candidate selection

Before querying the DBpedia endpoint we run first a local analysis using a version of the Wordnet database. Each word is analyzed and a matrix of acceptations for the words is created. Following a set rules we create a list possible 2-words and 1-word candidates that may be relevant concepts, places or persons. By doing this we wanted to reduce the queries we need to make the endpoint.

Since the Wikipedia and the DBpedia are tightly related, we decided to query first the Wikipedia page using the API to obtain the Wikipedia page URL of witch the candidate is the main topic.

And the end of this process we ended up having a list of candidate with known Wikipedia pages.

3.3 Semantically annotated information retrieval from the Open Data Cloud

The next step is to query DBpedia ‘s sparql endpoint to retrieve the semantically annotated information related to the tweet topic detected in the previous step. Once the information is received from the endpoint it is put together with the author information from Twitter in a turtle file in order to make it available via a sparql endpoint. We used a subset of the rNews Ontology (International Press Telecominication Counsil, 2011) shown in Figure 3.

Figure 3. Subset of the rNews Ontology used for the project

4 Results

To test the approach and the system we selected 90 tweets directly from the Twitter search on 3 subjects: The Brazilian riot during the 2014 world cup, Barack Obama and Venezuela. The process of collecting the microposts consisted on making the search thru the API and collect the first 30 messages with an associated picture, doing the same process for each of the selected topics.

After the sample was obtained we proceeded to manually tag each tweet. This was made two times by different persons to minimize the human errors. After the sample was manually tagged we ran the automated process for each tweet and saved the results for each case. The results can be seen on Figure 4. We expected to find 415 terms for all tweets and found 433, of those 317 were an exact match to what was expected in the manual process, 63 resulted in information that is not wrong but adds no real value, 53 that were wrong concepts. This give a precision of 76.36%, that’s the expect terms that were automatically detected using the method and 12.24% of errors.

Figure 4. Result of the test cases

Analyzing the errors we noticed that, the automatically retrieved concept brought a wrong a meaning for the context. For example, in the context of the Brazilian riots, the concept “fire” was defined as in “a burning fire” instead of “fire a gun”. Similar cases can be found in the other topics that were tested.

The terms that were not detected by the automated method were candidate with known Wikipedia pages that had no corresponding entrance in the DBpedia.

5 Future work

We’re encouraged with the obtained results to further develop the method and include automated context detection as a way to maximize the precision. A possible approach to solve this is described in (Esther Villar Rodríguez, 2012) and (Nebhi, 2012).

We also would like to further develop the system, to not only detect, retrieve and save information of one message but to be able to create a complete documentation of an event for extended period of time, based on several micro blogging platforms and media outlets, both independent and corporate. The end result we hope to reach is create a full searchable, semantically annotated news stream that will serve as a neutral and centralized endpoint for data journalism.

6 Conclusions

In this research we proposed a method to automatically create a news article from a tweet using the cloud of linked open data, to do it we successfully implemented a web system that takes a Tweet ID as input and generate semantically annotated news article based on a subset of the rNews Ontology. To test our approach we collected a group of 90 tweet on three subjects: the Brazilian riots during the 2014 World Cup, Barack Obama and Venezuela. The messages where tagged manually and then compared with automatically found annotations. Our method was able to capture 76.36% of the manually detected terms with an error of rate 12.24% due mostly to disambiguation problems.

These results encourage us to further develop the method and the system to solve first the disambiguation problems and to create a more ambitious approach that will allow us to create a semantically annotated news stream based not only on tweet, but also includes other microblogging services, independent blogs and corporate media outlets that can serve a centralized semantic endpoint for data journalism.

7 References Alexander Pak, P. P. (2010). Twitter as a Corpus for Sentiment Analysis and Opinion Mining.

Valletta, Malta: Proceedings of the Seventh International Conference on Language Resources and Evaluation.

David Laniado, P. M. (2010). Making sense of Twitter. Shangai, China: ISWC 2010.

Esther Villar Rodríguez, A. I. (2012). Using Linked Open Data sources for Entity Disambiguation. Rome: CLEF Iniciative.

International Press Telecominication Counsil. (2011, 10 7). rNews. Retrieved 6 21, 2014, from IPTC site for developers: http://dev.iptc.org/rNews

Nebhi, K. (2012). Ontology-Based Information Extraction from Twitter. (pp. 17-22). Mumbai: Proceedings of the Workshop on Information Extraction and Entity Analytics on Social Media Data.

Ritter, A. (2012). Extracting Knowledge from Twitter and The Web. Doctorate Thesis. University of Washington.

Shinavier, J. (2010). Realtime #SemanticWeb in <= 140 Characters. WWW2010. Raleigh, North Carolina.

United Nations. (n.d.). United Nations. Retrieved 6 22, 2014, from The Universal Declaration of Human Rights: http://www.un.org/en/documents/udhr/index.shtml

Wikipedia. (2014, 6 11). Five Ws. Retrieved 6 20, 2014, from wikipedia.org: http://en.wikipedia.org/wiki/Five_Ws

Technology

News construction from microblogging post using open data