74
A Research Plan including the Proposed Approaches and Results of Preliminary Experiments Project Title Sentiment Analysis and Opinion Mining of the Arabic Web (Digital Content) Selected ITAC Program Advanced Research Project (ARP) Academic and ICT Industry Partners Organization Name Contact Name Role American University in Cairo Ahmed Rafea Professor at CSE Department LINK Development Hanan Abdel Meguid Chief Executive Officer AUC Research Team Name Contact Details Role Prof. Ahmed Rafea [email protected] Principal Investigator Nada Ayman [email protected] Researcher A Islam Elnabarawy [email protected] Researcher A May Shalaby [email protected] Researcher A Amira Shoukry [email protected] Researcher A Link Development Team Amira Thabet [email protected] Researcher A, Team Leader Ashraf Hamed [email protected] Researcher and Developer Mohamed El Sherif [email protected] Researcher and Developer

A Research Plan including the Proposed Approaches …rafea/SATA/Reports/A Research Plan includin… · A Research Plan including the Proposed Approaches and Results of Preliminary

Embed Size (px)

Citation preview

A Research Plan including the Proposed Approaches and Results of Preliminary Experiments

Project Title

Sentiment Analysis and Opinion Mining of the Arabic Web (Digital Content)

Selected ITAC Program

Advanced Research Project (ARP)

Academic and ICT Industry Partners

Organization Name Contact Name Role

American University in Cairo Ahmed Rafea Professor at CSE Department

LINK Development Hanan Abdel Meguid Chief Executive Officer

AUC Research Team

Name Contact Details Role

Prof. Ahmed Rafea [email protected] Principal Investigator

Nada Ayman [email protected] Researcher A

Islam Elnabarawy [email protected] Researcher A

May Shalaby [email protected] Researcher A

Amira Shoukry [email protected] Researcher A

Link Development Team

Amira Thabet [email protected] Researcher A, Team Leader

Ashraf Hamed [email protected] Researcher and Developer

Mohamed El Sherif [email protected] Researcher and Developer

Introduction 

The goal of the project, as described in the project document, is to develop a prototype that can "feel" the pulse of the Arabic users with regards to a certain hot topic. This involves:

Extracting the most popular Arabic entities from online Arabic content Extracting user comments related to the entities Using extracted popular entities to build semantically-structured concepts Building relations between different concepts Analyzing concepts to get a sense of the most dominant sentiment, using online user

feedback, and thus identifying the general opinion about the topic.

In order to achieve this goal, the following milestones are proposed:

Technical Report on Approaches for Information Extraction and Sentiment Mining A Research Plan including the Proposed Approaches and Results of Preliminary

Experiments Initial SATA Prototype Requirements Specification and Conceptual Design Initial SATA Prototype Implementation Prototype Validation Technical Report with Benchmarking Results Full documentation of project and final prototype

This report describes the activities conducted in order to deliver the second milestone of the project which is mainly a research plan that will be implemented in parallel with the SATA prototype development plan. In order to achieve this deliverable; tasks were assigned to members of the team to:

identify the most promising technical approaches for: topic detection and extraction, name entity recognition (NER), sentiment analysis and opinion mining, and Detecting Influential Bloggers and Opinion Leaders,

decide on the tools and language resources that will be used to process Arabic language, develop a tool to assist the team in building annotated data from Tweeter related to the 25

January Revolution

The report is divided into six sections. The second section describes the identified technical approaches in related research areas namely: topic detection and extraction, name entity recognition (NER), sentiment analysis and opinion mining, and detecting influential bloggers and opinion leaders. The third section explains the tools and resources identified for processing Arabic language. The fourth section describes the tool developed to collect and annotate data from Tweeter. The fifth section introduces the guidelines developed for annotating the data. . The last section concludes this technical report with a research plan for the rest of the project.

Identified technical approaches 

This part describes the preliminary research efforts in the four research topics of the project namely: Topic detection and extraction, name entity recognition, sentiment analysis and opinion mining, and detecting influential bloggers and opinion leaders

Topic detection and extraction  

In the light of the literature we reviewed and discussed in the previous report, we started to collect data and run primitive experiments to determine the points of strength and weakness in different approaches. The task of topic detection and extraction consists of several phases and the most important phase that can be considered the core of this task is clustering. The first subsection will describe the experiment and its results. The second subsection will discuss the results. The third subsection will conclude this part.

Experiment conducted on clustering Arabic Tweets 

There are many clustering algorithms used for clustering textual data. One of the most used algorithms for clustering is bisecting-k-mean. This section includes the experiment objective, the description of the data, selected features, methods and tools, and the obtained results.

Objective of the experiment 

The objective of the experiment is to investigate to what extend the repeated bisection cluster method can combine tweets of the same topic of sentiment together.

Data: 

We collected 110 tweets over a span of 4 days. The tweets are manually annotated so we can get the topic of sentiment beforehand. We have 12 topic of sentiment; they are all around the impact of Jan 25th revolution in Egypt. The topics are the hottest events happened during that period of time. We took topics containing more than two tweets so they are relevant.

Features: 

The feature used in this experiment is TF-IDF (term frequency-inverse document frequency). During the preprocessing of data, each word is given an id and counted how many times it appears in each object. The object is presented as vector model with each word represented as two integers. One integer represents the word id, and the other represents the number of occurrence of this word in the object (tweet). Each word afterwards is multiplied by its IDF which is obtained by dividing the total numbers of objects by the total number of objects containing the word. A word gets a high tf-idf when it occurs a lot in a certain object and less in the whole objects.

Method and tools: 

For data preprocessing we need to:

tokenize words remove stop words stem words calculate term weight represent the items to be clustered

In this experiment a Perl script suggested by CLUTO documentation called “doc2mat” is used to do the five preprocessing step. The problem was the script didn’t work for Arabic letters, so we had to modify it so it can accept Arabic letters. We developed another small script to remove the stop words only; it was easier that way rather than merging both in one script. We skipped the stemming of words for the time being. Balode & Tank,(2009) worked on implementing a tool for twitter analysis. They used the same approach and tool we are using. They used it on English words only.

The method of clustering we are using is called repeated bisecting. In this method, the desired k-way clustering solution is computed by performing a sequence of k − 1 repeated bisections. In this approach, the matrix is first clustered into two groups, and then one of these groups is selected and bisected further. This process continuous until the desired number of clusters is found. During each step, the cluster is bisected so that the resulting 2-way clustering solution optimizes a particular clustering criterion function.

So our experiment goes as follows:

Having the extracted tweets in a text file from the data we collected which are 110 tweet including the tweet itself and hash-tags. We removed date and time and author for simplicity at the time being.

Run the stop words removal script to remove the stop words and delimiters and special characters to produce another file after words removal. We used to stop-word lists. The first one we got by simply translating English list suggested by the tool (see Appendix I). The second one we got by manually reviewing the results and the tweets and remove the words we think are irrelevant (see Appendix I)t.

Use the doc2mat script running over the new file to do the rest of the processing without stemming the words. The result is .mat file which is the accepted format by CLUTO which is called a matrix file. The matrix file contains n x m dimensions, where n is the number of rows and m is the number of columns. This file contains rows equal to the number of tweets (objects to be clustered). Another file produced called the “clabel” file which includes the unique words in each tweet, which helps in determining the discriminating and descriptive features the cluster will be based upon.

Use the CLUTO “vcluster” application to cluster the data. It takes the “mat” file to work on and “clabel” file when using the option of showing the features. It also needs the number of clusters desired. We also added the “rclass” file which contains the predetermined classes of each tweet, it’s mainly used to calculate the entropy and purity of the clustering. We got the predetermined classes through the annotation process using “Ewzenha Tool”. The result is two files, one which is the clustering file contains n rows, and each row has a number

indicating the cluster id this object (tweet) belongs to. The other file is the output file contains the cluster results.

We are using a tool called CLUTO which is developed by the computer science department in the University of Minnesota. We used the latest stable version of it CLUTO-2.1.1. CLUTO is a software package for clustering low and high dimensional datasets and for analyzing the characteristics of the various clusters. This tool can perform different methods of clustering.

Results: 

We performed 6 experiments, 3 on each stop-word list. The three experiments differ from each other that each has different numbers of clusters. We have 12 predetermined labels for our clusters, so we perform experiment on 6, 12 and 20 way clustering to compare results. The tables from 1to 6 and figures from 1 to 6 show the results of the six experiments. Each table contains the cluster id, the size of cluster i.e. number of objects (tweets) in each cluster, the Intra Similarity (ISim), the Inter or External Similarity (ESim), Entropy and Purity measures of each cluster. On the right hand side of the table we can see the predetermined 12 labels and number of tweets belongs to each label in each cluster.

Table-1: Results of applying clustering with k= 6 and removing the first stop-word list

cid Size ISim ESim Entropy Purity محاكمات مبارك عسكرية

السفارة التليفزيون الثورة االسرائيلية

المجلس العسكري

قانون التحرير الطوارئ

شھادة اباتانتخ طنطاوي

االعالم االخوان

0 12 0.113 0.005 0.454 0.5 4 0 0 0 0 1 1 6 0 0 0 0

1 14 0.102 0.006 0.4 0.714 0 0 0 0 0 10 0 0 1 1 1 1

2 15 0.093 0.006 0.585 0.4 1 0 5 0 6 1 0 1 1 0 0 0

3 15 0.085 0.006 0.676 0.4 1 1 6 1 1 0 0 0 2 0 3 0

4 26 0.055 0.005 0.783 0.269 1 4 5 1 4 1 2 0 0 7 0 1

5 28 0.047 0.004 0.785 0.286 1 2 8 1 6 3 0 0 0 1 3 3

Figure 1: Results of applying clustering with k= 6 and removing the first stop-word list

0.0020.1020.2020.3020.4020.5020.6020.7020.8020.902

0 1 2 3 4 5

ISim

ESim

Purity

Entropy

Table-2: Results of applying clustering with k= 12 and removing the first stop-word list

cid Size ISim ESim Entropy Purity محاكمات مبارك عسكرية

السفارة التليفزيون الثورة االسرائيلية

المجلس العسكري

قانون التحرير الطوارئ

شھادة انتخابات طنطاوي

االعالم االخوان

0 6 0.249 0.005 0.181 0.833 0 0 0 0 0 1 0 5 0 0 0 0

1 6 0.187 0.005 0.349 0.667 4 0 0 0 0 0 1 1 0 0 0 0

2 7 0.177 0.005 0.514 0.429 0 0 2 0 3 0 0 1 1 0 0 0

3 7 0.166 0.005 0.464 0.571 1 0 4 1 1 0 0 0 0 0 0 0

4 7 0.159 0.004 0.544 0.286 0 0 0 0 1 2 0 0 0 0 2 2

5 8 0.162 0.009 0.505 0.375 1 0 3 0 3 1 0 0 0 0 0 0

6 8 0.153 0.007 0.532 0.375 0 1 2 0 0 0 0 0 2 0 3 0

7 8 0.146 0.004 0.432 0.625 0 1 0 0 5 0 0 0 0 1 1 0

8 13 0.104 0.005 0.844 0.231 1 2 3 1 1 1 2 0 0 1 0 1

9 14 0.102 0.006 0.4 0.714 0 0 0 0 0 10 0 0 1 1 1 1

10 13 0.099 0.005 0.517 0.615 1 1 8 1 0 1 0 0 0 0 0 1

11 13 0.098 0.005 0.512 0.462 0 2 2 0 3 0 0 0 0 6 0 0

Figure 2: Results of applying clustering with k= 12 and removing the first stop-word list

0.002

0.102

0.202

0.302

0.402

0.502

0.602

0.702

0.802

0.902

0 1 2 3 4 5 6 7 8 9 10 11

ISim

Purity

ESim

Entropy

Table-3: Results of applying clustering with k= 20 and removing the first stop-word list.

cid Size ISim ESim Entropy Purity محاكمات مبارك عسكرية

السفارة التليفزيون الثورة االسرائيلية

المجلس العسكري

قانون التحرير الطوارئ

شھادة انتخابات طنطاوي

االعالم االخوان

0 3 0.374 0.004 0.256 0.667 0 1 0 0 0 0 0 0 2 0 0 0

1 3 0.358 0.004 0.256 0.667 0 2 0 0 1 0 0 0 0 0 0 0

2 3 0.346 0.005 0.442 0.333 1 0 1 0 1 0 0 0 0 0 0 0

3 4 0.305 0.004 0.226 0.75 0 0 3 1 0 0 0 0 0 0 0 0

4 4 0.307 0.007 0.418 0.5 0 0 1 0 2 0 0 0 0 1 0 0

5 4 0.278 0.004 0 1 0 0 0 0 4 0 0 0 0 0 0 0

6 4 0.276 0.005 0.558 0.25 0 1 0 0 1 0 0 0 0 1 1 0

7 6 0.249 0.005 0.181 0.833 0 0 0 0 0 1 0 5 0 0 0 0

8 5 0.239 0.009 0.271 0.6 0 0 2 0 0 0 0 0 0 0 3 0

9 6 0.207 0.005 0.181 0.833 0 0 1 0 0 0 0 0 0 5 0 0

10 6 0.205 0.007 0.181 0.833 0 0 5 0 0 1 0 0 0 0 0 0

11 6 0.199 0.006 0.628 0.333 1 2 0 0 0 1 1 0 0 0 0 1

12 6 0.187 0.005 0.349 0.667 4 0 0 0 0 0 1 1 0 0 0 0

13 7 0.184 0.006 0.594 0.429 0 0 3 1 1 0 1 0 0 1 0 0

14 6 0.185 0.007 0.349 0.667 0 0 0 0 0 4 0 0 0 1 0 1

15 7 0.177 0.005 0.514 0.429 0 0 2 0 3 0 0 1 1 0 0 0

16 8 0.174 0.007 0.296 0.75 0 0 0 0 0 6 0 0 1 0 1 0

17 7 0.169 0.004 0.594 0.429 1 1 3 1 0 0 0 0 0 0 0 1

18 7 0.159 0.004 0.544 0.286 0 0 0 0 1 2 0 0 0 0 2 2

19 8 0.162 0.009 0.505 0.375 1 0 3 0 3 1 0 0 0 0 0 0

Figure 3: Results of applying clustering with k= 20 and removing the first stop-word list

0

0.2

0.4

0.6

0.8

1

1.2

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

ISim

ESim

Entropy

Purity

Table-4: Results of applying clustering with k= 6 and removing the second stop-word list.

cid Size ISim ESim Entropy Purity محاكمات مبارك عسكرية

السفارة التليفزيون الثورة االسرائيلية

المجلس العسكري

قانون التحرير الطوارئ

شھادة انتخابات طنطاوي

االعالم االخوان

0 12 0.128 0.004 0.357 0.583 0 0 4 0 0 1 0 7 0 0 0 0

1 13 0.091 0.002 0.673 0.308 2 3 4 1 0 1 0 0 2 0 0 0

2 14 0.087 0.002 0.494 0.643 1 0 0 1 9 0 0 0 0 1 1 1

3 16 0.073 0.003 0.693 0.375 2 0 6 0 1 1 0 0 1 0 2 3

4 28 0.055 0.004 0.708 0.393 2 1 6 0 1 11 1 0 1 4 1 0

5 27 0.051 0.003 0.861 0.222 1 3 4 1 6 2 2 0 0 4 3 1

Figure 4: Results of applying clustering with k= 6 and removing the second stop-word list.

Table-5: Results of applying clustering with k= 12 and removing the second stop-word list.

cid Size ISim ESim Entropy Purity محاكمات مبارك عسكرية

السفارة التليفزيون الثورة االسرائيلية

المجلس العسكري

قانون التحرير الطوارئ

شھادة انتخابات طنطاوي

االعالم االخوان

0 6 0.192 0.002 0.628 0.333 0 0 0 1 2 0 0 0 0 1 1 1

1 7 0.179 0.006 0.624 0.286 1 2 0 0 1 1 0 0 0 2 0 0

2 7 0.174 0.006 0.514 0.429 1 0 2 0 0 0 1 0 0 3 0 0

3 7 0.164 0.004 0.594 0.429 1 1 3 0 1 1 0 0 0 0 0 0

4 8 0.158 0.004 0.697 0.25 0 0 2 1 0 1 2 0 0 1 0 1

5 8 0.149 0.003 0.152 0.875 1 0 0 0 7 0 0 0 0 0 0 0

6 8 0.144 0.004 0.362 0.625 2 0 5 0 0 0 0 0 0 0 0 1

7 8 0.139 0.002 0.697 0.25 0 0 1 0 1 1 0 0 1 0 2 2

8 12 0.128 0.004 0.357 0.583 0 0 4 0 0 1 0 7 0 0 0 0

9 14 0.108 0.004 0.4 0.714 0 0 1 0 0 10 0 0 1 1 1 0

10 12 0.009 0.003 0.573 0.417 0 1 2 0 5 0 0 0 0 1 3 0

11 13 0.091 0.002 0.673 0.308 2 3 4 1 0 1 0 0 2 0 0 0

0

0.2

0.4

0.6

0.8

1

0 1 2 3 4 5

ISim

ESim

Entropy

Purity

Figure 5: Results of applying clustering with k= 12 and removing the second stop-word list.

Table 6: Results of applying clustering with k= 20 and removing the second stop-word list.

cid Size ISim ESim Entropy Purity محاكمات مبارك عسكرية

السفارة التليفزيون الثورة االسرائيلية

المجلس العسكري

قانون التحرير الطوارئ

شھادة انتخابات طنطاوي

االعالم االخوان

0 4 0.308 0.006 0.418 0.5 0 0 2 1 0 0 0 0 0 1 0 0

1 4 0.284 0.003 0.418 0.5 0 0 1 0 0 1 0 0 0 0 0 2

2 4 0.284 0.004 0.418 0.5 1 0 2 0 0 0 0 0 0 0 0 1

3 4 0.283 0.004 0 1 0 0 0 0 4 0 0 0 0 0 0 0

4 4 0.278 0.004 0.418 0.5 0 0 0 0 0 1 2 0 0 0 0 1

5 4 0.274 0.003 0.226 0.75 1 0 0 0 3 0 0 0 0 0 0 0

6 4 0.264 0.001 0.418 0.5 0 0 0 0 1 0 0 0 1 0 2 0

7 4 0.26 0.005 0.226 0.75 1 0 3 0 0 0 0 0 0 0 0 0

8 5 0.237 0.005 0.201 0.8 0 0 4 0 0 0 0 1 0 0 0 0

9 5 0.227 0.003 0.382 0.6 0 1 0 0 0 0 0 0 0 1 3 0

10 7 0.224 0.005 0.165 0.875 0 0 0 0 0 1 0 6 0 0 0 0

11 6 0.201 0.007 0.5 0.5 0 0 1 0 0 3 0 0 0 1 1 0

12 6 0.192 0.002 0.628 0.333 0 0 0 1 2 0 0 0 0 1 1 1

13 6 0.186 0.002 0.535 0.333 0 2 1 0 0 1 0 0 2 0 0 0

14 7 0.179 0.006 0.624 0.286 1 2 0 0 1 1 0 0 0 2 0 0

15 7 0.174 0.003 0.241 0.714 0 0 2 0 5 0 0 0 0 0 0 0

16 7 0.174 0.006 0.514 0.429 1 0 2 0 0 0 1 0 0 3 0 0

17 8 0.172 0.006 0.152 0.875 0 0 0 0 0 7 0 0 1 0 0 0

18 7 0.166 0.003 0.514 0.429 2 1 3 1 0 0 0 0 0 0 0 0

19 7 0.164 0.004 0.594 0.429 1 1 3 0 1 1 0 0 0 0 0 0

0

0.2

0.4

0.6

0.8

1

0 1 2 3 4 5 6 7 8 9 10 11

ISim

ESim

Entropy

Purity

Figure 6: Results of applying clustering with k= 20 and removing the second stop-word list.

Table-7 Comparison between all results

Result No.

K Stop Word List

entropy purity Isim Esim

1 6 List-1 0.657 0.391 0.0825 0.005333

1-1 6 List-2 0.674 0.391 0.080833 0.003

2 12 List-1 0.505 0.509 0.150167 0.005417

2-1 12 List-2 0.515 0.473 0.13625 0.003667

3 20 List-1 0.385 0.573 0.237 0.0056

3-1 20 List-2 0.39 0.573 0.22655 0.0041

Figure 7: Comparison between all results

0

0.2

0.4

0.6

0.8

1

1.2

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

ISim

ESim

Entropy

Purity

0

0.2

0.4

0.6

0.8

1 1‐1 2 2‐1 3 3‐1

entropy

purity

Isim

Esim

The following table shows a sample of the clustering with k=12 and with removing the first stop list. These are clusters 0 and 9. Each row contains the cluster id and the original tweet belongs to it.

Table-8 Sample of a clusters with the highest purity measure from table-2

Discussion:  

Analyzing the above results revealed the following:

The entropy decreases with the increase of purity, which is expected On average the ISim is directly proportional with purity as can be noticed from figure 7.

This means that ISim is a good measure for the cluster quality as it can be used in real application where there are no labels for the clusters. However in some of the experiments we noticed that the purity increases without corresponding increase in ISIM.

Adding/removing words from the stop word list changes lightly the results of clustering In effect using the second list which includes more stop words decreased the quality of clustering.

Increasing the number of clusters increased the average clustering quality. But for sure it could give more than one cluster containing tweets related to the same topic.

From table 8, it’s obvious from clusters number 0 that the sentiment topic of 5 out of the 6 tweets are on "قانون الطوارئ " which is really meaningful and matches the human annotation.. But still other clusters contain objects that are not related to each other in a strong way.

The ESim is the external similarity between the objects. Their values in this experiment don’t vary much due to the small number of objects.

ستوري الذي سبتمبر والمجلس عليه أن يقوم بإستفتاء شعبي لمد القانون، وذلك طبقا لإلعالن الد ٣٠قانون الطوارئ ينتھي 0 أصدره المجلس

ولماذا ال يطبق قانون الطوارئ على مبارك الذي سيفلت من المحاكمة أصال -- ال لقانون الطوارئ، ويكفي القانون العادي 0 لعدم وجود أدلة جيدة أصال

. يسقط قانون الطوارئ. بس بصراحة 0

قانون الطوارئ تذكرنى تماما بما فعله العادلى عند فتح خطة المجلس المسنكح بقصد تعميم الفوضى فى الشارع حتى يطبق 0 EGYPT #noscaf #sep9#السجون

Egypt#.. ھذا ھو قانون الطوارئ .. اشخاص اال بتصريح من الحكومه 7يمنع قانون الطوارئ التجمع ألكثر من 0#Tahrir #Noscaf #Scaf

0 RT @WaelElebrashy: وأطالب باالعتذار لقناة الجزيرة.. لعمل بقانون الطوارئأرفض ا: دكتور محمد سليم العوا http://t.co/0rUYH6z

Conclusion: 

Topic detection and extraction is an important phase in this project as it is the phase that gives information about the data collected and what they are talking about. Using an unsupervised technique like clustering offers us freedom in collecting data without the need of training data. However, during the experimentation phase we need annotated data to compare results of different clustering algorithms and different features to get the best setting.

We intend to use a stemmer for Arabic words so nouns and verbs having the same roots can be matched. Also as the Arabic language has different words of the same verb for masculine and feminine and for each tense that makes it very difficult to list all of them in stop words list.

Annotating tweets is not an easy task. We are working on manually determine the topic of sentiment of each tweet so we can assess the results of clusters.

By finding the best clustering method we’ll be done with the first stage of our goal. This will pave the way to extract the topic of each cluster. The next stage which is labeling the generated clusters will be achieved by using NLP techniques.

In addition, the experiments have made us familiar with applying one of the well know clustering method on Arabic tweets and with preprocessing the Arabic language.

References 

Balode,Amit ; Tank Chintan, Twitter Analytics – User recommendation system, project at Indiana University, February 2009 Cluto manual, http://www.msi.umn.edu/software/cluto/, 2003

Name entity recognition (NER)  

Two different approaches were identified for named entity recognition. The first approach, the rule-based approach, depends on the syntactic and semantic structure of sentences. An example of this approach is available through the General Architecture for Text Engineering (GATE) tool, which includes support for the Arabic language. The other approach is statistical machine learning, which trains a classifier using a machine learning algorithm and a set of annotated data. One instance of this approach, which uses the Conditional Random Fields (CRF) machine learning algorithm, was implemented by Stanford University in their NER tool. Although the algorithm is independent of language, parts of the preprocessing done within the tool is language dependent and needed to be modified to support the Arabic language.

The rule‐based NER approach experiment: 

Preliminary experiments for the first approach, rule-based NER, were done using the GATE tool, which is publicly available under an open source license. Since this approach is based on grammar rules and gazetteer lists, no prior annotation or training was required.

Objective 

The objective of this experiment is to examine the capability of the Gate tool to recognize Arabic name entities, and investigate the possibility of extending it.

Data Description 

The GATE experiment was done using data from the Twitter social platform. A set of 100 Arabic tweets were extracted from Twitter, and the named entities in these tweets were manually annotated into three classes: Person, Location, and Organization. The following table shows statistical information about the total number of words in these tweets, and the number of words belonging to each named entity class.

Table 1: Annotated tweets named entity statistics

Tweets Words Person Location Organization Total Named Entities

100 1827 87 72 127 286 Methods and tools 

The GATE experiment was conducted using version 6.1 of the GATE framework. The following modules are pre-packaged as part of the tool and were used in the experiment:

Arabic Tokenizer

Arabic Gazetteer

Arabic Infered Gazetteer

Arabic Main Grammer

Arabic OrthoMatcher

ANNIE NE Transducer

The experiment was conducted using the following steps:

Configure GATE to use the correct modules

Create a new dataset in GATE and import the tweets into it

Run the GATE NER on the tweets dataset

For each of the named entity classes, count the following metrics manually: o The number of entities that were correctly recognized o The number of entities that were wrongly recognized as a different class

o The number of entities that were not recognized at all

Compute the accuracy for each named entity class

Analyze the obtained results Discussion 

The named entity recognition module in GATE relies on grammar rules and gazetteer lists to find named entities in documents. However, this approach seems to result in a somewhat low accuracy in this experiment, which may be due to the unstructured nature of tweets, and the use of the local dialect and new names that are not present in the gazetteer lists. Since the GATE tool is open source, the implementation details, grammar rules, and gazetteer lists are available within the tool’s source code. These components can be modified to improve the accuracy within the target domain, but given the fact that the rule-based approach is significantly more complex to implement; this might be a rather expensive goal to achieve in practice.

Examples of the output from this tool when run on data extracted from Twitter are shown below.

Figure 7: An example of words classified as a person in the GATE tool

Figure 8: An example of detected named entities and the type of each entity

The machine learning NER approach experiment: 

The second approach was prototyped using the Stanford NER tool. This tool is part of a larger framework for natural language processing published by Stanford University under an open source license. The Stanford NER tool uses a CRF classification algorithm, which is a statistical machine learning algorithm, to perform the task of named entity recognition.

Objective 

The objective of this experiment is to examine the capabilities of the Stanford NER tool, and determine the potential benefit of using the statistical machine learning approach as compared to the rule-based NER approach.

Data Description 

This experiment uses the same set of 100 annotated Arabic tweets that were used in the previous GATE experiment. Since this is a supervised learning approach, the dataset was divided into 75 training instances and 25 evaluation instances. As with the GATE experiment, the named entities were annotated with one of three classes: Person, Location, and Organization.

Methods and Tools 

The Stanford NER tool experiment was conducted using version 1.1.1 of the tool. However, this tool does not support the processing of Arabic text, so it had to be modified to process the Arabic text correctly. More specifically, the tokenizer contained within this tool was tailored for the English language, so another tokenizer was implemented which is more suitable for the Arabic language, and the output of that tokenizer replaced the output of the English tokenizer in the remaining parts of the algorithm. In addition, since the tool operates through a command line interface, some scripts were written in the Python programming language as well as the Windows Batch File format, to facilitate the pre-processing of the data and the running of the experiment.

The experiment was conducted using the following steps:

Pre-process the data to the format used by the tool Create a configuration file for the tool to specify the various configuration parameters Run the tool in training mode using the first 75 annotated tweets as input, thus creating a

classifier file to use in later classification Run the tool in recognition mode, using the created classifier file and the remaining 25

annotated tweets as input For each of the named entity classes, count the following metrics manually:

o The number of entities that were correctly recognized o The number of entities that were wrongly recognized as a different class o The number of entities that were not recognized at all

Compute the accuracy for each named entity class

Analyze the obtained results

Discussion 

The Stanford NER tool uses the machine learning approach to named entity recognition. The algorithm used in the training and classification is the conditional random fields (CRF) algorithm, which is a direct application of the hidden markov model. This approach is widely recognized for achieving high accuracy in classification within the same domain, which was evident to a certain extent in the results of this experiment. However, the size of the dataset used in this experiment is relatively small, and the algorithm is expected to perform much better once it was trained using a sufficiently large dataset.

Conclusion: 

The named entity recognition task is not an objective in itself, but rather an intermediate step whose output is provided as additional input to other phases of the tool. Therefore, it needs a moderately accurate approach with a reasonable amount of effort proportional to the degree of participation of this output into the final product. Based on the research and experiments that were conducted for the selection of a suitable approach for achieving this task, it is evident that the statistical machine learning approach is expected to achieve better end results without requiring too much effort to adapt to the target domain. In particular, the CRF NER classifier published by Stanford University is a good example of such algorithm, which can be used in its entirety; adapted into our final product with some modifications; or used as an example if a different implementation is needed.

Sentiment analysis and opinion mining  

Sentiment analysis or opinion mining has been currently considered to be one of the most emerging research fields caused by the great opinionated web contents coming from blogs and social network websites. There are mainly two approaches for sentiment classification: machine learning (ML) and semantic orientation (SO). The ML approach is typically a supervised approach in which a set of data labelled with its class such as “positive” or “negative” are represented by feature vectors. Then, these vectors are used by the classifier as a training data inferring that a combination of specific features yields a specific class (Morsy, 2011) employing one of the supervised categorization algorithm. Examples of categorization algorithms are Support Vector Machine (SVM), Naïve Bayesian Classifier, Maximum Entropy, etc… On the other hand, the SO approach is an unsupervised approach in which a sentiment lexicon is created with each word having its semantic intensity as a number indicating its class. Then, this lexicon is used to extract all sentiment words from the sentence and sum up their polarities to determine if the sentence has an overall positive or negative sentiment in addition its intensity whether they hold strong or weak intensity (Morsy, 2011). The SO approach is domain-independent, since one lexicon is built for all domains. The approach we have chosen for sentiment classification is the ML approach because we do not have a lexicon for Arabic sentiment word. This approach is based on selecting a set of features to build feature vectors and train a classifier. We have chosen to start with English language because of the sentiment benchmarks availability which will help us to evaluate the set of features selected and then use these features as a guideline for building a classifier for Arabic language using Arabic tweets from twitter. The first section presents the work done on English sentiment data set. The second section describes the experiment conducted on Arabic tweets. The third section concludes this part.

Overview of the Previous Work Done in Sentiment Analysis for English 

Sentiment analysis for the English language has been the interest of several researchers who have proposed different sets of features to be used for either the document-level or the sentence-level sentiment, such as n-grams and sentiment features. However, these traditional sentiment feature extraction methods do not take the linguistic context into account, such as negation and intensification. Thus in the first part of our project, a study was performed in order to improve the performance of sentiment classification at the document-level. The study proposed a

document-level sentiment classifier “using the machine learning approach by proposing new feature sets that refine the traditional sentiment feature extraction method and take contextual valence shifters into consideration from a different perspective than the earlier research” (Morsy, 2011). The results of running several experiments employing these new feature set showed a significant improvement in the classifier's performance, in terms of the accuracy, precision and recall with an overall accuracy increase of 7% from 78% to almost 85%, indicating that this new proposed feature sets are effective in document-level sentiment classification. The classifier used was SVM.

Features Used in English 

One major part in the ML approaches for text processing is the process of changing the text into a feature vector or other representation. This conversion process focuses on the more important and salient features present in the text. In this section, we focus on the types of features that were used in the proposed method for English.

Term Presence vs. Frequency

In information retrieval (IR), feature vectors have been usually used to represent a piece of text. They are vectors in which the entries correspond to the individual terms in the text. In standard IR, Term Frequencies were used extensively with regards to the TF-IDF weighting’s popularity. However, better results were obtained using the Term Presence (Pang et al, 2002) rather than Term Frequencies. Term Presence is a binary-valued feature vectors wherein the entries simply shows whether a term appears taking the value of 1, or not taking the value of 0. The later approach was more effective in reviewing the polarity classification than the real-valued feature vectors (Liu, 2010). This finding reflected the fact that some topics are more likely to be highlighted by the repeated recurrences of some terms, whereas that the overall sentiment may not be emphasized by the frequent use of the same terms. “On a related note, hapax legomena, or words that appear a single time in a given corpus, have been found to be high-precision indicators of subjectivity” (Wiebe et al, 2004). This feature, we have used in our baseline experiment as for all the words present in the tweet we have calculated their frequencies, thus giving higher weights to those frequent words.

N-grams

N-grams are from the frequent features employed in the classification of text. There have been a lot of discussions on the appropriate size of the grams to be used. Grams are words which are frequently repeated in the corpus. Unigrams (only one word, like film) were found to perform better than bigrams (two consecutive words, like movie star) as in categorizing movie reviews using sentiment polarity, whereas bigrams and trigrams (three consecutive words, like science fiction film) results in improved product review polarity classification (Liu, 2010). This feature we have used in our baseline experiment as we have extracted all the unigrams in the corpus with their corresponding frequencies.

Parts of Speech tag

Part-of-speech tagging is recognized as a simple form helping in word sense disambiguation (Pang & Lee, 2008). For example, adjectives were dealt with as special kind of features used by several researchers. Some adjectives were used for sentiment analysis, as well as a guide for feature selection in sentiment classification, as some methods are concerned with the adjectives’ presence or polarity in determining the polarity status or subjectivity of pieces of text, particularly in the unsupervised machine learning. Also, other parts of speech like nouns and verbs do contribute to extraction of sentiments or opinions. In a study carried out by Pang et al. (Pang et al, 2002) on the polarity classification of a movie review employing only adjectives as features, it concluded that adjectives “perform much worse than using the same number of most frequent unigrams” (Pang & Lee, 2008). Consequently, researchers concluded that nouns (e.g., “gem”) and verbs (e.g., “love”) can be strong sentiment indicators. Comparisons were performed on the effectiveness of adjectives, verbs, and adverbs on sentiment analysis, in which more sub-categorization is sometimes carried out. This feature will not be used in our first experiment on Arabic as we work on the Egyptian dialect and most of POS tagger work on Modern Standard Arabic. Therefore using this feature needs to adjust the POS tagger to work on Egyptian dialect.

Opinion words and phrases

Some words are sometimes used to express positive or negative sentiments; these words are called opinion words. Examples of positive opinion words are wonderful, beautiful, amazing, and good, and amazing, and examples of negative opinion words are poor, bad, and terrible. Many of the opinion words are either adjectives or adverbs; however nouns like “rubbish”, “junk”, and “crap” and verbs like “hate” and “like” can also be used to reveal opinions (Liu, 2010). Moreover, there are also phrases and idioms which can be used to express opinions like the individual words. An example of an idiom is; cost someone an arm and a leg, which is usually used to reflect negative sentiment or opinion. That is why many researchers believe that opinion words and phrases have major roles in sentiment analysis. This features we intend to use in our project, but it requires a comprehensive set of opinion words which is still being prepared.

Dependency Relations

Dependency relations within the features sets were also greatly considered by many researchers. This linguistic analysis is specifically more applicable with respect to short textual units. For example, a subtree-based boosting algorithm using words dependency (like higher-order n-grams) based features yields better results than the bag-of-words baseline. Parsing, which is identifying the words in the text, can also be used for representing valence shifters such as negation, intensifiers, and diminishers (Kennedy & Inkpenl, 2006). This feature will be investigated for our project as it is very important although the non-availability of such parser might not help us in doing so.

Negation

In sentiment analysis, dealing with negation words is very important as their presence usually alters the orientation of the opinion. For example, the sentences “I like this camera” and “I don’t like this camera” are believed to be very similar by most frequently used similarity measures as

the only different word is the negation term putting the two sentences into opposite classes. Dealing with negations can be performed in two different ways; directly, and indirectly. In the direct way, the negation words are encoded directly into the initial features’ definitions. While in the indirect way, a second-order feature of a text unit in which the feature vector used for initial representation is built essentially ignoring the negation words, which is then changed into a different representation negation aware (Pang & Lee, 2008). In an attempt to represent negation words more precisely, certain part-of-speech tag patterns are searched for, and then the complete phrase is tagged as a negation phrase (Na et al, 2004). This approached has been applied on a dataset of electronics reviews, and it was observable that there was an improvement of about 3% in accuracy resulting from this negation modeling. Further improvements can be possibly be reached by deeper (syntactic) analysis of the sentence (Liu, 2010). Moreover, sometimes negations are expressed in subtle ways like in an irony or sarcastic ways which are often very difficult to detect. For example, in the sentence “[it] avoids all clich´es and predictability found in Hollywood movies” the word “avoid” is considered to be an unexpected polarity reverser word. Negation words must be carefully handled as not all occurrences of such words mean negation. For example, “not” in “not only … but also” does not change the orientation direction. This feature will be from the important features to be used in our project as negations can greatly shift the meaning of the sentence, however it will not be used in this experiment.

Experiment conducted on Sentiment Analysis for Arabic Tweets 

As for the Arabic language, the complexity of the language has with regards to both the morphology and the structure has created a lot of challenges which resulted in very limited tools currently available for the aim of sentiment and opinion mining. For our project, we chose to start with the ML approach since the presence of semantic dictionaries or lexicons for Arabic sentiment mining is very limited, in addition to the fact that we are dealing with the Egyptian dialect not the Modern Standard Arabic (MSA) and all tools were developed for MSA (Farra et al, 2010).

The following diagram summaries the ML process of the sentence’s sentiment analysis in the Arabic language using Arabic tweets from twitter for our project. The process starts by getting the tweets from twitter. Then we will pass by each tweet and label it as positive, negative, or neutral. After that the features in each tweet will be extracted and represented in a feature vector, which will be later explained in this paper. Then, these feature vectors will be used in the training phase of the classifier, in which the classifier will use them to deduce that the combinations of specific features correspond to certain class. Afterwards, the classifier will pass by a testing phase, in which the accuracy of this trained classifier will be evaluated by applying the feature vectors of new or unseen sentences and observing in which classes the classifier will classify them. If the accuracy of the classifier falls under an accepted level, then the classifier will move to the working phase which is our main target, otherwise the classifier will go back to the training phase and the testing phase to improve its accuracy until it reaches an accepted level.

Fig 1: T

We are process

Objecti

The objsystem experimclassifie

Data D

The twethen weThe twetweets wone for consistenegativ

Data P

After geunderstthe follo

GettFro

ProTw

The process

going to exs, namely N

ive: 

jective of thusing simpl

ment to find er.

Description

eets used inent into the peets consistewere removtraining ph

ed of 661 nee tweets and

Pre‐Process

etting the twandable by owing:

ing Tweets m Twitter

Pre‐ocessing weets

of Arabic S

xperiment thaïve Base C

his experimele and primthe best set

n this experiprocess of med of almos

ved. These twhase (1057 twegative twed 162 positi

sing 

weets, we ththe classifie

ExFeBFV

T

Sentiment A

he mostly uClassifier an

ent is to buimitive feature

t of features

ment were gmanually anst 558 positiweets were weets) and ets and 396ive tweets. T

hen went inter for maxim

xtracting eature & Building Feature Vectors

Training Data

Tr

Analysis on

sed categornd Support V

ild an Arabies vectors. Ts which cou

got from Twnnotating eaive tweets, 8divided intone for test

6 positive twTweets used

to the procemum throug

TCl

Tes

raining

Arabic twe

rization algoVector Mac

ic language Then we ca

uld help in im

witter. We hach tweet to851 negativto two partsting phase (3

weets. The ted are found

ess of puttinghput. This

Training lassifier

sting Data

eets from Tw

orithms in thchine.

classifier toan add moremproving th

have got aboo be positiveve tweets, an with the ra353 tweets)esting part c

d in Append

ng them intoprocess inv

A

Testing

witter

he sentimen

o be our base features anhe accuracy

out 1410 twe, negative, nd all the neatio of 75% ). The trainiconsisted ofix II

o a format volved remo

Evaluation& 

Testing

Real Application

nt analysis

seline nd y of the

weets. We or neutral.

eutral to 25%,

ing part f 190

oving all of

Working

- removing the user-name

- removing the picture

- removing the hash tags

- removing the URLs

- removing all non-Arabic words

Feature Selection and Feature Extraction 

The feature vectors applied to the classifier consisted of the term frequency feature where each word in the tweet was counted throughout the corpus, thus assigning high weights to frequent words, and term presence feature where for each word present in the tweet is assigned value 1 otherwise if not present in the tweet the word is assigned value 0. This was done to the unigrams.

For each tweet: (“polarity”, {word1:frequency1, word2:frequency2 …})

For each tweet: (“polarity”, {word1: presence, word2: presence …})

These Feature Vectors were applied to two classification algorithms: 1) Naïve Bayes algorithm, 2) SVM Algorithm.

Training the classifiers 

Two classifiers were trained, first without removing the stop words, and second with removing stop words. The models are tested on a dataset divided into two part with the ration of 75% training and 25% testing.

Naïve Bayes algorithm: 

The results were as the following: Measure Word Frequency Term Presence

With stop words Without stop words With stop words Without stop

words Accuracy 51.84% 48.44% 53.82% 50.42%

Support Vector Machine 

The results were as follows Measures Word Frequency Term Presence

With stop worlds (170 correct, 183 incorrect)

Without stop words (182 correct, 171 incorrect)

With stop worlds (183 correct, 170 incorrect)

Without stop words (186 correct, 167 incorrect)

Accuracy 48.16% 51.56% 51.84% 52.69%

All words in the tweet

Frequency in the tweet

All words in the tweet

Present in the tweet

Discussion 

The same set of annotated tweets was applied on the two supervised categorization algorithms, causing the performance results obtained to be more or less close to each other. By comparing the two algorithms first using the word frequency feature vectors, it was clear that SVM produced less accurate results than the Naïve Bayes as the accuracy of the Naïve Bayes was more than the accuracy of the SVM by almost 3%. Afterwards all the stop words were removed and the experiment was repeated. By comparing the results obtained after removing the stop words for both algorithms with the previous results, it was clear that there weren’t big changes when compared before removing the Stop words for the Naïve Bayes and the SVM. Thus, the presence of stop words didn’t have an effect on the Naïve Bayes algorithm or the SVM algorithm.

Moreover, the two algorithms were compared using the word presence feature vectors, also once with the presence of stop words and once after removing them. The two algorithms produced almost the same results with very minor changes of about 2-3%. Even after removing the stop words, there weren’t big changes for the two algorithms. For the Naïve Bayes, the accuracy decreased, whereas the SVM the accuracy increased. Both changes were very small as compared to the accuracy before removing the stop words.

By looking at examples of the tweets that were classified correctly, or misclassified, we can infer that a more comprehensive list of some of the mostly used sentimental Arabic words needs to be built, which will help greatly when it comes to extracting the feature vectors having the presence and the frequencies of the sentimental words in the tweet than extracting the presence and frequencies of all the words. Especially that the tweets are limited to only 140 character which means that not too many words will be repeated for more than two or three times at most. Thus, focusing on the frequency and the presence of sentimental words could be more beneficial.

Conclusion 

The next stage will be started by building a more accurate and comprehensive corpus consisting of almost more than 1000 positive tweets and more than 1000 negative tweets because as the size of the training data increases more accurate results will be produced. Given the limited research done in this field with regards to Arabic language whether at the sentence-level or at the document-level, a lot of effort has to be done in exploring this field, in addition to the difficulties that might arise from the nature of the language itself. Thus the next step will be focusing more in selecting the features which will improve the performance of the sentence-level sentiment classification by proposing a combination from the available features to use, or even introducing new features if it is possible. Once we have settled on the features we will be using, then we will apply them on all the annotated tweets we have got from twitter to produce the feature vectors. Then the problem will be reduced in choosing which supervised machine learning classifying algorithm. The algorithm with the best performance result will be used in producing our final classifier.

References: 

Farra, Noura; Challita, Elie; Abou Assi, Rawad; Hajj, Hazem. Sentence-level and Document-level Sentiment Mining for Arabic Texts. IEEE International Conference on Data Mining Workshops, 2010.

Kennedy, Alistair; Inkpen, Diana. “Sentiment classification of movie reviews using contextual valence shifters,” Computational Intelligence, vol. 22, pp. 110–125, 2006.

Liu, Bing. Sentiment Analysis and Subjectivity. Handbook of Natural Language Processing, pages 978–1420085921, 2010.

Morsy, Sara. Recognizing Contextual Valence Shifters in Document-Level Sentiment Classification. Department of Computer Science and Engineering, The American University in Cairo (AUC). 2011

Na, Jin-Cheon; Sui, Haiyang; Khoo, Christopher; Chan, Syin; Zhou, Yunyun. “Effectiveness of simple linguistic processing in automatic sentiment classification of product reviews,” in Conference of the International Society for Knowledge Organization (ISKO), pp. 49–54, 2004

Pang, Bo; Lee, Lillian. “Opinion Mining and Sentiment Analysis.” Foundations and Trends in Information Retrieval 2(1-2), pp. 1–135, 2008.

Pang, Bo; Lee, Lillian; Vaithyanathan, Shivakumar. “Thumbs up? Sentiment classification using machine learning techniques,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 79–86, 2002.

Wiebe, Janyce; Wilson, Theresa; Bruce, Rebecca; Bell, Matthew; Martin, Melanie. “Learning subjective language,” Computational Linguistics, vol. 30, pp. 277–308, September 2004.

Detecting influential bloggers and opinion leaders   

With the recent rise in popularity and size of social media, there is a growing need for systems that can extract useful information from this amount of data. The micro-blogging service Twitter has become a very popular tool for expressing opinions, broadcasting news, and simply communicating with friends. People often comment on events in real time, with several hundred micro-blogs (tweets) posted each second for significant events. Twitter is not only interesting because of this real-time response, but also because it is sometimes ahead of newswire, also, the ease of use of its API to retrieve necessary data to study. We address the problem of detecting influential bloggers, or as in our case, micro-bloggers. So considering the literature reviewed in the previous report, the need for an automated analysis is necessary given the large number of virtual communities with huge amounts of users and posts. The first section describes the experiment conducted to get acquainted with the methods and tools used in this area. The second section discusses the results we obtained. The third section concludes this part.

Experiment conducted to determine an influential person on Tweeter 

This section includes the experiment objective, the description of the data, , methods and tools, and the obtained results.

Objective 

The objective of this experiment was to use data collected from Twitter to determine the influential persons in a network with some of the approaches discussed in the literature reviewed; using social network analysis centrality measures.

Data Collection using the Twitter API 

Twitter REST API methods allow developers to access core Twitter data. This includes update timelines, status data, and user information. The Search API methods give developers methods to interact with Twitter Search data. The concern for developers given this separation is the effects on rate limiting and output format. The Rate Limits for the Search API are not the same as for the REST API. When using the Search API you are not restricted by a certain number of API requests per hour, but instead by the complexity and frequency.

The Twitter Search API is a dedicated API for running searches against the real-time index of recent Tweets. However, the Search API is not complete index of all Tweets, but an index of recent Tweets, that index includes between 6-9 days of Tweets. It cannot be used to find Tweets older than about a week.

One of the issues I faced was that the user IDs returned from the Search API did not used to match user IDs in the REST API. However, as of November 7, 2011 the Search API returns Twitter user IDs that match the Twitter REST API. So the issue, of needing to maintain multiple ids for the same user, and matching returned IDs to users’ screen names, no longer needs to be handled. This new development encouraged me to go through a new round of data collection after November 7.

Using the twitter Search API, the search result for a word (for example, “Egypt”)returned around 150 tweets(due to the rate limiting), for each tweet I have the tweet id, the author’s id (fromuserid) and screen name (fromuserscreenname), the created date and the tweet text itself.

Having done several data collection rounds, each time the search results returned between a 100 and 150 tweets, posted by 35 to 84 users. This is probably due to waves of users’ activity.

For each of the users I collected a list of followers (with a limitation from the Twitter API of returning a maximum of 5000 followers, even though the actual number of followers most likely exceeds that). I had also collected lists of the followers of each of these followers, for possible further analysis at a later phase. Following someone on Twitter means subscribing to their Tweets as a Follower, their updates will appear in your timeline, and that person has permission to send you, the follower, private Tweets, called direct messages.

A tweet reply is any update posted by clicking the "Reply" button on another tweet. Any reply will always begin with @username (insert username of the person you are replying to). A

mention is any Twitter update that contains @username anywhere in the body of the Tweet; this means that replies are also considered mentions.

I did the same with mentions what I did with followers to collect them. For each of the originally collected users (the authors of my search results), I searched for their mentions using @userscreenname. This returned the recent tweets (only as far back as a week) containing “@userscreenname”, and from the tweet I was able to get the author’s id (fromuserid) and screen name (fromuserscreenname), for which I also searched for their mentions, also for possible further analysis at a later phase..

Methodology and Tool 

Social Network Analysis approach

A social network is a social structure between actors, mostly individuals or organizations. It indicates the ways in which they are connected through various social familiarities ranging from casual acquaintance to close familiar bonds. Social network analysis is the mapping and measuring of relationships and flows between people, groups, organizations, animals, computers or other information/knowledge processing entities. The nodes in the network are the people and groups, while the links show relationships or flows between the nodes (Jamali & Abolhassani, 2006).

There are some properties of social networks that are very important such as size, density, degree, reachability, distance delimiter, geodesic distance. All sociologists would agree that power is a fundamental property of social structures. There is much less agreement about what power is, and how we can describe and analyze its causes and consequences. Centrality Analysis is the main approach that social network analysis has developed to study power, and the closely related concept of centrality; degree centrality, betweenness centrality and closeness centrality (Ya-ting & Jing-min, 2011).

If an actor has direct association with many other actors then the actor is in a central position. The calculation of one’s degree centrality can use the number of points which have a direct relationship with that point. The degree centrality of a point is the comprehensive of out degree and in degree.

If an actor is in the shortcut between many other actors, this actor is in an important position, so according to this kind of thinking, the betweenness centrality can be used to measure the resources degree of the actor.

If one actor is less dependent on others in the contacting process, that actor has higher centrality. It is in the important bridging position in the network, and plays an important role in network transmission.

Table 2 - The three aspects of powerin sociograms (Degree, Closeness and Betweenness)

Power Aspect Name

Definition Influences

Degree Number of ties for an actor Having more opportunities and alternatives

Closeness Length of ties to other actors Direct bargaining and exchange with other actors

Betweenness Lying in the path between other pairs of actors

Brokering contacts among actors to isolate them or prevent or control connections

Some more properties, mentioned by Jamali & Abolhassani, 2006, which may also be used in social network analysis are:

Maximum Flow, where the approach suggests that the strength of a tie connecting two actors is no stronger than the weakest link in the chain of connections, where weakness means a lack of alternatives. It focuses on the vulnerability or redundancy of connections between pairs of actors.

The Hubbell and Katz approaches, where the strength of all the links are considered. The approaches count the total connections between actors, each connection given a weight according to its length, the greater the length, the weaker the connection.

The social network analysis program UCINET

UCINET is a social network analysis program developed by Steve Borgatti, Martin Everett and Lin Freeman (Borgatti et al., 2002). The program is distributed by Analytic Technologies, which publishes software for social network analysis and cultural domain analysis.

UCINET is menu-driven Windows program. It is a comprehensive package for the analysis of social network data as well as other 1-mode and 2-mode data. It can read and write a multitude of differently formatted text files, as well as Excel files. It can handle a maximum of 32,767 nodes (with some exceptions) although practically speaking many procedures get too slow around 5,000 - 10,000 nodes. Social network analysis methods include centrality measures, subgroup identification, role analysis, elementary graph theory, and permutation-based statistical analysis. In addition, the package has strong matrix analysis routines, such as matrix algebra and multivariate statistics.

UCINET can be downloaded at http://www.analytictech.com/ucinet and used free for 60 days. For longer use, individual students pay $40, faculty, schools & government pay $150, and corporations pay $250. For the time being, I am using the free trial for my experiments at this phase.

The UCINET datasets are collections of one or more matrices. It doesn't matter whether the data is a graph (i.e., a set of vertices and a set of edges or links), a relation (i.e., a set of ordered pairs), a hypergraph (i.e., a set of subsets), or anything else: as far as UCINET is concerned, the data are a collection of matrices. This does not mean that UCINET cannot read data that are not in matrix form: it can. It means that once the data are in the system, it is thought of as a matrix.

Network analysts commonly think of their data as graphs. The information can be represented by a matrix known as the adjacency matrix.

Experiment Results and Discussion 

Having retrieved a collection of Tweets, I was able to identify their authors as a group of users, and get for each user the number of followers, the number of recent mentions, and rank the users according to each.

The number of followers of a user has directly indicates the size of that user’s. The following three tables (Table 2, 3 and 4) are samples listing the top 10 ranking users from three of several different rounds of data collection I had made ; some of which were made for checking that my data collection code or method was working. The three groups included in tables 2, 3 and 4 are the last three groups I had collected.. The ranking is based on the number of followers.

Table 3: Group 1, 10 users ordered according to their popularity (aka, the number of followers)

User ID User Screen Name Number of Followers

211848704 ahramonline 5000 News web site for Egypt’s largest news organization.

49744860 annagueye 2684 person

132566178 AymanASU 1934 person

204480516 medoweeedo 1627 person

286170803 hateVodafoneEG 740 Common interest group

271364487 Dina_Mohammad 585 person

277989395 Sarab_1949 527 person

285887396 ElectionsEgypt 447 2011/2012أخبار االنتخابات المصرية

72302280 OFFICIALMAGDI 403 person

353949611 shakhura_news 388 أخبار البحرين وتطورات قرية الشاخورة وابوصيبع والقرى المجاورة والثورة

Table 4: Group 2, 10 users ordered according to their popularity (aka, the number of followers)

User ID User Screen Name Number of Followers

222471144 exiledsurfer 5000 person

608583 Zeinobia 5000 person

310871351 M6april 5000 ي تويترأبريل اإلخبارية عل 6شبكةمباشر

222837035 alwafdwebsite 5000 بوابة الوفد اإللكترونية

285083952 Bahrani_News 5000 شبكة إخبارية خاصة بثورة اللؤلؤ البحرينية

248268173 Rana2ElDardiry 3076 person

218679758 7usfahmy 2846 person

251504822 egyptbusiness 2303 Directory of Egyptian businesses with company profiles, press releases, tenders, jobs and management news.

150845688 MiralBrinjy 2055 person

141316530 ruby_hanem 1353 Person

Table 5: Group 3, 10 users ordered according to their popularity (aka, the number of followers)

User ID User Screen Name Number of Followers

260171076 nsfadala 5000 عضو مؤسس بتجمع الوحدة الوطنية -نائب سابق بالبرلمان البحريني- (public figure)رئيس جمعية مناصرة فلسطين

222471144 exiledsurfer 5000 Person

285735127 kooora 5000 الموقع العربي الرياضي األول

166134102 alnahar_egypt 5000 جريدة مصرية أسبوعية مستقلة

271881612 cairotoday 5000 TV Show

16090877 JamalDajani 4110 Peabody Award-Winning Producer, HuffPost blogger & media expert. (public figure)

263672809 umhouda 2479 Person

190277650 kordy90 1687 Person

243445322 MohHKamel 1660 Person

235505872 sahmnewscom 1134 اخبار البورصة المصرية

Cha et al. (2010) observed that the most followed users span a wide variety of public figures and news sources, which is also true in our case here. Among the most followed in our lists are: AhramOnline, M6april, AlwafdWebsite, Bahrani_News, Alnahar_Egypt, CairoToday and nsfadala. However, this shows and proves that the most connected users are not necessarily the most influential; news sites basically relay news and should not be considered as an influential blogger.

The number of mentions containing one’s username indicates the ability of that user to engage others in a conversation. I had constructed a Mentions adjacency matrix (513x513) of 513 users, a snippet of which is shown below in Table 6. The values in the table are the number of times a user is mentioned in a post by the other user. For example, 417268072 was mentioned in tweets 36 times by 168327248, and 18 times by 212625851, etc., and 37786267 was mentioned 135 times by 187604649, and so on.

Table 6: Adjacency Matrix for the Mentions of 10 Twitter users

Use

r ID

4172

6807

2

1683

2724

8

2126

2585

1

1666

7860

7

3778

6267

1876

0464

9

2342

0807

6

4055

6604

8

1005

3961

3

2652

7548

7 417268072 0 36 18 9 27 9 9 9 9 6

168327248 0 56 0 0 0 0 0 0 0 0

212625851 0 0 0 0 0 0 0 0 0 0

166678607 0 0 0 0 0 0 0 0 0 0

37786267 0 0 0 0 0 135 0 0 0 0

187604649 0 0 0 0 0 0 0 0 0 0

234208076 0 0 0 0 0 0 0 0 0 0

405566048 0 0 0 0 0 0 0 0 0 0

100539613 0 0 0 0 0 0 0 0 18 0

265275487 0 0 0 0 0 0 0 0 0 0

I imported the data in Table 6 into the UCINET tool using the DLEditor to convert it to a UCINET dataset to carry out the Degree Centrality test. Degree centrality is a measure of network activity, it calculates the degree and normalized degree centrality of each vertex and gives the overall network degree centralization.

Figure 9: The DLEditor to import the Mentions matrix to be read by the UCINET tool

The number of vertices adjacent to a given vertex in a symmetric graph is the degree of that vertex. For non-symmetric data the in-degree of a vertex u is the number of ties received by u and the out-degree is the number of ties initiated by u. In addition if the data is valued then the degrees (in and out) will consist of the sums of the values of the ties .In our case here, the ties between the users are the number of times one user mentions another. The normalized degree centrality is the degree divided by the maximum possible degree expressed as a percentage, however, the normalized values are ignored since the normalized values should only be used for binary data. The routine calculates these measures and some descriptive statistics based on these measures. Directed graphs may be symmetrised and the analysis is performed as above, or an analysis of the in and out degrees can be performed, as was done on the presented data.

When the analysis is run, the program writes the output to the Log File and then displays the contents to the screen, such as that shown in Figure 1. The file consists of a table which contains a list of the degree and normalized degree (n Degree) centralities expressed as a percentage for each vertex, together with the share. The share is the centrality measure of the actor divided by the sum of all the actor centralities in the network. These have been ordered so that the actor with the highest centrality appears first. Descriptive statistics which give the mean, standard deviation, variance, minimum value and maximum value for each list generated. This is followed by the degree network centralization index expressed as a percentage, however that is also only for binary networks. As was mentioned in the UCINET reference documentation (Borgatti et al., 2002) that for valued data the non-normalized values should be used and the degree centralization should be ignored.

Figure 10: The Log File, the UCINET routine output for the Degree Centrality analysis

For influence detection, we are more interested in the in-degree centrality of the Mentions matrix; the number of times an actor was mentioned by another. However, the data in Figure 1 shows that actor 37786267 has the highest out-degree centrality. By swapping the “in” and “out”

in the generated table headers in the log file, the actors are listed in order of mentions; the actor with the highest mentions placed first in the list.

The first table in Figure 1 shows that 37786267 was mentioned 135 times, which is the most anyone else on the list has been mentioned, and mentioned others 27 times, also, that 417268072 was mentioned 132 times, etc..

Conclusion 

First of all, from what we observed in Tables 1, 2 and 3, the most connected users, in terms of the number of followers, are not necessarily the most influential. So the number of followers a user has may not be relied on as indication of influence, but may be used as a factor giving weight to those popular users, or to give us direction in our search.

Secondly, one of the approaches discussed in our previous report for detecting influential bloggers was based on the assumption that a blogger is pointed out to be influential if that blogger has any influential posts. However, with the tweets’ relatively very short length (140 characters as a maximum limit); it is pretty difficult to put an influence score to each individual tweet. The length is insufficient to extract some if not all of the properties, mentioned in the previous report, from the post, so I was unable to try out the models, presented by Agarwal et al (2008) and Akritidis et al. (2009), with the data I currently have.

Even though extracting properties to score individual posts might be feasible, but with the data collected from twitter, it could be a bit challenging. Properties such as:

Activity Generation, which is a post’s capacity of generating activity; the number of comments it received and the amount of discussion it initiates. However, the mentions, including replies, in Twitter are not quite straight foreword so as to relate one post to another. In my opinion it is doable, but would require a significant amount of processing, including probably some natural language processing, to do what may be called, conversation construction. The Time Factor. Taking into consideration the age of the references or mentions to a certain post, which faces the same challenge as that of the previous point of relating one post to another. Novelty, where novel ideas are more likely to exert more influence. Besides the fact that this particular property is quite difficult to measure in general, the length of the tweet doesn’t provide much space for innovating some general or widely applicable assumption to measure the post novelty. Eloquence, where an influential person is often eloquent. There are many measures that can be used to quantify the goodness of a post, such as fluency, rhetoric skills, vocabulary usage, and content analysis, some of which may be easily applicable, but would require significant processing.

Despite the challenges, I would still like to experiment with mathematical models, and relate properties with the data available from Twitter, or any other social network further ahead.

Thirdly, the UCINET tool is too slow and I couldn’t get it to work for the larger networks I had constructed from data collected from Twitter, of around 400 users, so I had to substantially reduce the number of users to include in the network, just those included in Table 6. The tool needs further examination from my part. Even though on a narrow scale, the tool seems to have quite some Social Network Analysis potential in addition to a simple network visualization tool I would like to try out on different network structures. The tool includes a few measure I would like to try out, such as , for example, the Hubbell and Katz Influence, Closeness centrality, Reach centrality and Flow centrality, however that would require a larger and more complex social network than the sample I currently have .

On the other hand, I would still like to look into other approaches for new ideas and keep track of new developments in the field. Also, I think it would be supportive to the work that I look into defining communities, subgroups and cliques in the social network.

Finally, I would like to mention that for the target tool, which’s core purpose is sentiment analysis and opinion mining, this part of our work, detecting influential bloggers and opinion leaders, will help give direction as to which authors to track and whose posts to analyze in a community of users, instead of randomly selecting posts off the social network to analyze. To analyze and study the posts that have weight and are attracting community members’ interest.

So far the work done for detecting influential bloggers requires a study of the network, which in turn would require collecting information from the network and some data processing, which I have found to be time consuming and impractical for real-time analysis. So I was thinking along the lines of an off-line network analysis, which would be working in the background of the tool, periodically updating our list(s) of the now influential bloggers and opinion leaders, taking into consideration that a user’s influence varies by topic genres (Cha et al., 2010). So, that would probably require that for each group of current and popular news topics, defined by the Topic Detection part of our work, a sweeping of the interested community, a subgroup of the social network, be done and the leaders detected

References 

Borgatti, S.P., Everett, M.G. and Freeman, L.C. 2002. Ucinet for Windows: Software for Social Network Analysis. Harvard, MA: Analytic Technologies.

Cha, M. et al., 2010. Measuring User Influence in Twitter : The Million Follower Fallacy. Artificial Intelligence.

Jamali, M.; Abolhassani, H.; 2006, "Different Aspects of Social Network Analysis," Web Intelligence

Liang Ya-ting; Chen Jing-min, 2011."The social network analysis of political blogs in people: Based on centrality," Consumer Electronics, Communications and Networks (CECNet)

Tools and resources identified for processing Arabic language  

This part includes identified tools and resources to be used in the project for processing of Arabic Language and engineering the development of the intended tool. The first section present a brief background on part of speech tagging process and available tools for part of speech tagging of Arabic. The second section describes the resources of Link Development Company. The third section introduces a framework that could be used by the project team to develop the SATA project.

Part of speech tagging tool 

Parts Of Speech Tagging (POS) is a preprocessing step for natural language processing tools or systems. NLP typically starts with normalization, stop words removal followed by parts of speech tagging and stemming.

POS annotates each word of a sentence with a tag explaining the grammatical structure or origin of the word such as noun/verb/adjective. This process helps in further processing steps for the sentence such as parsing and semantic analysis. Figure 1, shows the NLP process.

Figure 11: NLP process

There are many tools and algorithms available for POS for English language. However the challenge faced in research is doing POS for Arabic language text, this is mainly because of the different structure, complexity and different dialects it has than other European language.

This section mentions briefly three of the most recent ways to apply POST on Arabic text, and gives a quick overview on different implemented POS Taggers.

Arabic Parts Of Speech Tagging 

There are various methods and steps that can be used to apply POSTagging on Arabic text. This part discusses briefly three main methods used by different researchers.

The Hidden Markov Model 

The first method combines between Morphological analyses with Hidden Markov Model (HMM) and relies on the Arabic sentence structure [1]. It is a two level approach containing statistical and linguistic methods:

First the text is normalized and tokenized into words and then morphologically analyzed. Second step is using the statistical model to recognize the morphological characteristics of the words).

The tagging system categories the words into three classes: noun, verb and particle with 95% accuracy.

Statistical –rule based technique 

The second method uses statistical –rule based technique which is a two steps process [2]:

First the initial tagging is made which uses a predefined lexicon. If the word is not found, it goes to the next step which is stemming, while using it can help identify the POS as the prefix or suffix might define the word.

Morphological analyzer 

The third method uses Morphological analyzer for tokenizing and morphologically tagging Arabic words in one process [3].

First obtain from morphological analyzer a list of all possible analyses for the words.

Then apply classifiers for ten morphological features to the words of the text and possible values and which word classes (POS) can express these features.

At the end, it chooses among the analyses returned by morphological analyzer by using the output of the classifiers.

Gate 

Gate is an open source tool that performs text analysis and processing of all shapes and sizes. It applies the parts of speech tagger as annotation on each single word. The tagger uses a default lexicon that was originally trained on a large corpus taken from the wall street journal.

It has plugins for different languages including Arabic. It has an Arabic Tokenizer, Gazetteer, OrthoMatcher and Arabic Main Grammar plugins. These help in performing entity extraction and parts of speech tagging on Arabic articles. It is based on the same Parts of speech tagger originally used by Stanford NLP package.

Testing the tool 

The Gate package is offered as source files or a standalone desktop application. When testing the application, it didn’t produce parts of speech tagging as detailed; it just used it to help through the entity extraction phase as shown in figure 2.

Conclusion 

The outcome showed that the tool can be of more use as entity extraction tool not as POST tool, and in addition it needs more training on Arabic data.

Figure 12: Gate example for Arabic processing

Stanford NLP‐ parts of speech tagger 

Stanford POS Tagger is an open source tagger written in Java, where the source code is available for tagging different languages including Arabic. It is based on Buck Walter techniques [4].

Testing the tool: 

As the code originally runs on English language by default, it needs some modifications to make it run on Arabic sentences. The code is written in java and available for customization so it can take a file and it returns it tagged for each word.

First we need to create a new project and include the downloaded code and introduce the trained Arabic data file called “Arabic-accurate.tagger”. As we are using eclipse to run and compile the project, we have to include the jar file that reads tagger files. After that, creation of a new class that includes the tagger name and path and the sentences to be tagged was our last step. [4]

The example used is: 

 الركاب و الساعة في كم 50 سرعة صىأق و عمرة من خارجة راكب، 45 تحمل ولكنھا راكب 5 دوب يا حمولتھا سيارة الساعة في كم 250 بسرعة يطالبون

The result was: 

NN ،/NN/راكب VBP 45/CD/تحمل NN/ولكنھا NN/راكب NNP 5/CD/دوب RP/يا NN/حمولتھا NN/سيارة CC/و DTNN/الساعة IN/في NN/كم NN 50/CD/سرعة VBD/أقصى CC/و NN/عمرة IN/من JJ/خارجة DTNN/الساعة IN/في NN/كم NN 250/CD/بسرعة VBP/يطالبون DTNN/الركاب

The definition of the taggers can be found in appendix III.

Conclusion: 

The tool is exporting good results; it is not accurate as other commercial tools but it is acceptable. It was compared to some demos from commercial tools and the results showed that it contains some errors but mostly with the more information provided on each tag, for example the first two tags “NN” are right but if “NNPS” or “NNP” most probably something that might be wrong is defining the singular and plural words.

Previous “Link Development” products reuse 

Link Development has several tools already implemented that can be useful due to previous co-operation in web development projects. These tools can be helpful and time saving as it can be used instead of implementing or buying new ones.

Onkosh 

Onkosh is an Arabic search engine that was implemented within Link Development in the period from 2007 to 2010. It was one of the first search engines allowing people to search in Arabic using Roman letters. It offered not only search capabilities but also transliteration capabilities from Arabic written in Roman letters to basic Arabic. The tools can be used on different processes mainly for the preprocessing phases [6]. The tools include:

Indexer: Create indices for all the data extracted from the crawler and the link analyzer and permits to perform queries on the given indices.

Crawler: It is a program for downloading pages from the web. It starts from a list of "seed" URLs, fetches the corresponding page, finds the links contained herein and add this links to the list of the URL to fetch.

Transliterator (bel3araby): Tool implemented to transliterate Arabic written in Roman letters to Basic Arabic language.

Dictionary: Aspell Arabic dictionary, used as a reference to check the words existence.

The tools that can be used as they fit within the process for preprocessing the data are the transliterator (bel3araby) and the Arabic corpus used within the dictionary.

Bel3araby Tool 

Bel3araby tool is responsible to transliterate the Arabic words written in Roman letters into Basic Arabic words. It depends on the Aspell dictionary that is used to give the words produced after transliteration scores to define which word is likely to be the correct one. The Algorithm used to transliterate the words is based on different Arabic grammar rules that are embedded inside the Algorithm itself.

Bel3araby produces set of words for the original word, offering around twelve values or combinations of possible letters that can form the word. After that it performs iteration on the set of rules applied and offers the best four to six matching words. The user can then choose the one he meant or the tool will automatically select the word with highest match.

This tool will allow us to include any data written in FrancoArabic format, adding a huge dataset to the tool, as many users in the Arabic region uses this way of typing instead of Arabic or English. The tool is placed after the selection and collection of data process, to transliterate all the FrancoArab sentences into Arabic language.

Arabic Corpus 

Onkosh uses a dictionary file from Aspell dictionary. These files were categorized and encoded into different encodings and including different unknown tags for the use of the dictionary. These files were found and encoded to a format where it can be read by any user. There is a Nouns dictionary file that was extracted, that holds more than 80 thousand words. These words can be used later as nouns Arabic corpus to identify locations, places, people and nonverbal entities.

The file still needs modification before entering into the database for the project as it contains some unfamiliar words that will just overhead the tool. Also some words are merged into one word due to the processing made to remove the encoding and tags added by the dictionary. This might reduce the words selected to be stored in the tool, but it is a must so as to offer quality data.

Helper Tools 

During the development of the whole project, many developers from different backgrounds focusing on different topics and parts of the code, will work together to produce the best implementation for the project. Accordingly the amount of code and files implemented in this kind of projects will be tremendously huge and not easy to be understood by other people than the ones developed it. For these reasons, some helper tools are needed, so it can allow the continuation of the project as long as possible, allowing maintaining, changing the design and adding new features for it.

Symfony is a PHP MVC (model view controller) framework that is known to be the most popular MVC framework for PHP development. MVC is the most used model among web-based applications. The idea is to separate the code from the view and the database transaction from the logic of the code. This makes it easy to work on different parts of the project without affecting unrelated parts.

Symfony has two main versions; one of them is released on august 2011 and is considered a huge improvement for the framework. Symfony is preferred as it has large community and a large documentation base. Symfony offers a lot of features such as: auto code generation, and database deployment, filters, routing, database schema generation and database frontend design. This helps the developer to save a lot of effort and reduce the amount of code needed to develop known tasks and at the same time prevents code redundancy in different parts. It also offers a great security designed architecture that would prevent most of the attacks on the system.

References 

[1] “ARABIC PART-OF-SPEECH TAGGING USING THE SENTENCE STRUCTURE “, Mohamed El Hadj1, Al-Sughayeir1 & Al-Ansari2, Imam University, KSA.

[2] “Arabic Part-of-speech Tagger”, Shereen Khoja, Lancaster University, UK.

[3] “Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop”, Nizar Habash and Owen Rambow, Columbia University, USA.

[4] Blog post by Galal Ali, http://www.galalaly.me/index.php/2011/05/tagging-text-with-stanford-pos-tagger-in-java-applications/

[5] Tagging definitions: http://www.computing.dcu.ie/~acahill/tagset.html

[6] Onkosh web servers’ documentation files.

Annotation tool  

The main purpose of the annotation tool is to build up sentiment annotated tweets data sets. The main annotation functions include:

Assigning positive, negative tags, category, and sentiment topic for each tweet.

Giving tags for persons, organization, place, and others to name entities in the tweets. Assigning polarity to sentiment words

Data used are crawled from twitter. These data sets will be very helpful in the SATA prototype as it will provide training data set to train and test different modules of the tool like (Name Entity Recognition, Sentiment Classifiers, and Topic Extraction). This data set also could be used as a benchmark for any sentiment analysis tool in Arabic.

All the data sets and the different lists produced by the annotation tool are to be exported in an XML file format. This will facilitate their usage in the development of the sentiment analysis tool modules. Appendix IV describes the annotation tool, that we called ewzenha in some details. The tool could be accessed using this URL http://ewzenha.linkdev.com/ .

Guidelines developed for annotating tweets      

This part includes guidelines for annotating the tweets with their categories, their sentiment topics, and their polarities, and annotating the name entities in the tweets.

Sentiment annotation 

For our task of annotating texts (tweets, posts, documents,..) I’ll point out some points we need to follow so our work can be consistent.

General guidelines:

Define the category/topic of sentiment of the text or the topic it is related to. If the category is tricky or the text can be related to more than one category, three people shall agree to the closest topic .If it is still hard to decide the category is left for the supervisor to decide or a voting between the team members can be held.

Define the polarity of the text regarding the topic it is related to. For example if the text is about electing El Baradei as a president the polarity of the text will be positive if the author agrees on the election and negative if he doesn’t, and neutral if he has no opinion.

The polarity of the text is defined only when the text has an opinion about some issue or event. If the text is a news feed or a statement then its polarity shall be neutral. If the polarity is hard to decide, three people shall agree on the same polarity. If it is still hard to decide, it’s left for the supervisor or a voting shall be held within the team members.

If a team member has other opinion about an annotation of a text he can explain his point of view to the annotating team, if three of them agreed with him the annotation can be changed otherwise it cannot.

For the time being (first stage) guidelines:

Every user will be assigned 500 tweets to annotate, and only the supervisor can have access to all tweets.

We will rely on the user’s sole judgment on his/her assigned annotated tweets. No other user can change the annotation of another user except the supervisor.

In addition to the category of the tweets, the topic of sentiment this tweet is about must also be stated so the polarity is related to it.

An additional flag can be added so that for the upcoming iterations only flagged tweets is re-annotated to save time and effort. The supervisor can flag the tweets he thinks it can have another annotation, or if we agreed on another scenario.

When these guidelines were applied some difficulties are encountered:

The presence of re-tweets can affect the accuracy of the results because since the tweets are divided among different people, so these re-tweets are distributed among the group members. Thus more than one annotation can exist for the same tweet with regards to its category, topic, or even sentiment, creating the need for some kind of similarity measure so that re-tweeting can be eliminated.

Also, the fact that the annotation of each tweet is submitted individually makes the process of annotation so slow, thus there has to be some kind mechanism by which a group of annotated tweets can be submitted all together as a way of speeding the annotation process. Even the process of annotating each individual word in the tweet needs to be made faster as this task specifically would involve much more data and will take much more time than annotating all the tweets

These difficulties will be considered addressed while producing the next version of the annotation tool.

Named Entities annotations 

For the task of annotating named entities, there are a few guidelines that must be observed to facilitate the use of the annotated data in the training and classification process.

Each word in a sentence is annotated as belonging to one of the following classes: o Person o Location o Organization o Date o Other

The default class for any word in a sentence is “Other” If an entity spans multiple words, such as a first and last name of a person, each one of

the words comprising the named entity is marked with the class of this entity. For example, the entity “John Doe” is a person, so in order to annotate it each one of the two words “John” and “Doe” would be annotated as belonging to the “Person” class.

If an entity contains within it another entity of a different type, the entire entity, including the words of the inner entity, will be annotated as belonging to the class of the outer entity. For example, “The John Doe Institute” is an organization, so in that specific instance all of the words would be annotated as belonging to the “Organization” class, including the words “John” and “Doe” in this occurrence. Another example is رئيس اتحاد is annotated as a person, and the four words are annotated as a person ”البرلمانيين اإلسالميينas well although the inner entity “اتحاد البرلمانيين اإلسالميين” is an organization.

If a word has the sense that it indicates a name entity like the word “دولة “ but it is followed by an adjective it is not considered as name entity. For example “دولة ھشة” is

not considered a named entity, and the same with “حكومة انتقالية”, since there is no actual organization having that name.

Conclusion 

The approaches and plans for the four research topics are identified. The proposed approaches for these research topics are:

1. Clustering followed by text processing to label the cluster which is in effect the sentiment topic. The clustering algorithm used which is bisecting k-mean is a good candidate but more algorithms will be investigated.

2. Naïve Bayes and SVM are candidates for sentiment classification of tweets. The final decision to use anyone of them will be determined later after conducting a series of experiments.

3. Rule based and Conditional Random Field (CRF) approaches have been investigated and finalizing the choice of any of them will be determined after conducting a series of experiments.

4. Influential post method and mathematical models of social networks technique will be investigated to identify the influential blogger. So far no specific results were obtained and more work will be done to investigate these two approaches.

The generic plan for the research in the next mile stone is as follows:

1. Enhance the annotation tool developed so far 2. Collect and annotate more data; 2000 tweets annotated with their categories and

sentiment topics, and polarity. 1000 tweets should be with positive polarity and 1000 should be with negative polarity.

3. Conduct more experiments on these annotated data to decide on the algorithm that will be used for the specific task

4. Specify the requirements of each module in the SATA tool 5. Implement an on-line version of each module to refine the requirement specification 6. Write a requirement specification document

 

Appendix I: Stop Words and Output of the Clustering Tool 

Here is a list of the stop words used so far, separated by space. This is the first list added to it some English word like http, to remove the hyper links from the tweets.

على مكان عن عند قليال جديد قديم اقدم التالي ال ال احد ابدا احد لم ال شئ االن مختلف ضروري يحتاج احتاج احتياج اجديرى ضمن دائما من يعبر يكون و اخر يسال مفتوح فوق ضد وحيد بعد كمان رغم مرة واحد فقط يفتح اخرى ضمن سابقل تقريبا بعد بدا ظھر يدافعجاء ك يسال سال تراجع بعيدا قبل يجب عرض ليس احسن احسن كان ممكن الن يمكن كان مع خلف ھي ھو ھم رقم ارقام حد بعض يجد وجد واضح نھي كال ينتھي انتھى انھى رد يستطيع اليستطيع بين كبير عرض انھىارتفاعھا ارتفاع اخذلماذا ھل تواصل اعطى حقيقة بعد احس يحس اول بعدين ھناك من استطاع ھؤالء فعل ھنا اللذي ياخذ يجمع مجموعات تخسر يخسر يحصل مؤكد تؤيد يؤيد يحكم تحكم مجموعة جمع يعطي يضرب اذھب ذھب يعتصم يطلق محددتتفرغ اعلى يواجه تواجه واجه كيف واجھت تفرغت يتفرغ خسر خسرت كلنا كنا نزل ينزل نزلت تحصل حصل حصلت

ساعدةتفرغ يھتم تھتم اھتمت اھتم مساعدة لم الوقف الفوري حاال فورا يبدو يساعد تساعد كن منطقة مناطق حول مثل وقت كسبت كسب شروط منتصف تربح يربح ربح ربحت يكسب وجد وجدت كفاية حتى مصرع يقتل قتل مقتل بدا يوجد توجد تكسب يشير شاھد يشاھد برنامج

ر تزور غياب يضع تضع وضعسوف وضعت خالل فترة مالبس زيارة زار يزو يقدم تقدم تقديم استاذ سيادة قناة انا نفسياستدعى منذ ذلك تلك اصغر جدا بعض خسارة مستشفى مريض نقل ينقل نقلت تنقل جانب جوانب بيضاء غرفة غرفعشان تاجيل قاضي قضاة غير معناه ماقدمناش كده شارع سرقة رصاص نار ھذا ھذه ما اذا يستدعي استدعت مدن مدينة قرية

فھم فھم قرر يقرر قرار قررت تقرري عايز عايزين يريد تريد احنا قبض يقبض محضر يطالب تطالب نطالب يطالبون طالب قالت يقول تقول طريق طريقة قطع قطعت تقطع يرتدي ترتدي قال استخدم يستخدم الفوة القمع يقمع قمع بعدم نحن كلنا استخدامحاجة فقد فقدتوضع وال ھما ليه شوية سور تضع يضع وضعت عملت يفقد تفقد شرعية مطالب صحية الذاكرة يعمل تعمل عرف عارف عارفة عارفين معروف ربما احتمال يستند تستند يفتقد تفتقد ليه يعرف يعملون عملوا يبقي ابقى نفذ نفذت ينفذبحب يحبون احب يحب تحبغدا امس بكرة يكره تكره هللا حلوة وحشة احلى نلس ناسھا رشوة قبل غد معروفة كبيرة كبير سيد السيد مھمة مھمين اھتم او اختالس ترتيب اعضاء مين تحيا كمان مھم مھتم يھتم تھتم يشرف اشراف ينفذ نفذ نفذت عضوعدم مادة ھجوم يھجم اختالف رأي اراء اختالفات وجھة نظر محتمل احتمال مترتب مرتب يقرر قرر مختلف يختلف مختلفة

وظيفةيھاجم بدون اسندت يسند اسناد استعان استعانة يستعين تستعين يوظف وظف وظفت توظيف تعيين عين عينت اسند http bit ly to it dlvr ow co fb me توقيت ميعاد وقت

The following words are added to form the second list.

فى متابعات في التعامالت ينتقد بقاء الفاعل مجھوالً حذرت إال أنه بناسھا بس أوي يكتب انتقلنا تحكمھا إلى الكثيرين وأنا منھميقتنع غضب شعبي واسع بسبب أمام مقر مطلب فتح باب الحالي جديدة الحقوق العاملون يواصلون استدعاء كثيرين اعرف ان

اح ده يرفضون طلب أستاذ الھدف الرئيسي تبدأ دراسة تفعيل يعني ايه حبس لسه ناس مؤامرة فيه اي سواء عيب له دور ر عملنا زي زى

The following are the parameters used in the experiment.

CLMethod=RB this is the clustering method.

CRfun=I2 this is the criterion function, which abides by the following equation.

,

, ∈

Where k is the total number of clusters, S is the total objects to be clustered, is the set of objects assigned to the ith cluster, v and u represents two objects, , is the similarity between two objects

SimFun=Cosine this is the similarity function which is cosine similarity #Clusters: 10 number of clusters RowModel = None this is an option to select the model to be used to scale the columns of each row. Here it is set to its default so each row is used as it is in the input file. ColModel = IDF this is an option to select the model to be used to scale the columns globally across all the rows. Here it’s set of IDF which corresponds to inverse-document frequency. GrModel = SY-DIR this is a parameter controls the type of nearest-neighbor graph. It’s set to symmetric-direct where a graph constructed with an edge between two objects u and v if and only if both of them are in the nearest neighbor lists of each other. NNbrs = 40 this is the nearest neighbor parameter. It’s set to 40 which is the default value.

Colprune = 1.00 this parameter specifies the factor by which the columns of the matrix will be pruned before performing the clustering. The range value is from 0 to 1, 1 indicates no pruning which is the default value.

EdgePrune = -1.00 this parameter controls how the edges in the graph partitioning clustering algorithms will be pruned based on the link activity of their incident vertices. The value of -1 suppresses edge pruning which is the default value.

VtxPrune = -1.00 this parameter controls how outlier vertices in the graph partitioning algorithms will be pruned. The value of -1 suppresses vertex pruning which is the default value.

MinComponent = 5 this parameter is used to eliminate small connected components from the nearest neighbor graph prior to clustering. It’s used only with graph partitioning algorithms. Its default value is 5.

CSType = Best this parameter specifies the method to be used for selecting the next cluster to be bisected by the repeated bisecting algorithms. It’s set to best which selects to bisect the cluster that will lead to the best value of the clustering criterion function that guides the clustering process. AggloFrom = 0 this parameter instructs the clustering programs to compute a clustering by combining both the partitional and agglomerative methods. It’s set when using agglomerative algorithm.

AggloCRFun = I2 this parameter controls the criterion function that is used during the agglomerative algorithm when the agglofrom parameter is set. NTrials = 10 this parameter selects the number of different clustering solutions to be computed by the various partitional algorithm. The default value is 10. Niter = 10 this parameter selects the maximum number of refinement iterations to be performed within each clustering step. Its range is from 5 to 20. It’s set to the default value which is 10. The following part of the results shows using the option –showsummaries=itemsets. This option shows the percentage of occurring of more than one feature together in the tweets belong to each cluster. The following shows descriptive and discriminating features of result 2 -------------------------------------------------------------------------------- 12-way clustering solution - Descriptive & Discriminating Features... -------------------------------------------------------------------------------- Cluster 0, Size: 6, ISim: 0.249, ESim: 0.005 Descriptive: 2.6أصال , %2.7يسقط , %3.7بصراحة , %12.7قانون , %25.4الطوارئ % Discriminating: 1.4اللي , %1.6الثورة , %2.0بصراحة , %5.6قانون , %13.8الطوارئ % Cluster 1, Size: 6, ISim: 0.187, ESim: 0.005 Descriptive: 4.0الدنيا , %4.0ميدان , %4.7مبارك , %6.0إال , %6.4تحرير % Discriminating: 1.4اللي , %2.2الدنيا , %2.2ميدان , %2.8إال , %3.5تحرير % Cluster 2, Size: 7, ISim: 0.177, ESim: 0.005 Descriptive: 2.3وبدون , %3.8الزم , %4.4المصريين , %4.8اليوم , %11.6إسرائيل % Discriminating: 1.3وبدون , %2.1الزم , %2.1المصريين , %2.6اليوم , %6.4إسرائيل % Cluster 3, Size: 7, ISim: 0.166, ESim: 0.005 Descriptive: 10.5الثوره %, belalfadl 9.1%, 2.1باقى , %2.7اسرائيل , %3.0ان % Discriminating: 5.8الثوره %, belalfadl 5.0%, 1.2باقى , %1.5اللي , %1.5اسرائيل % Cluster 4, Size: 7, ISim: 0.159, ESim: 0.004 Descriptive: 4.0اللى %, bothainakamel1 3.4%, Salafi 3.3%, 3.1صوت , %3.1بيقول % Discriminating: 1.8اللى %, Salafi 1.7%, 1.7بيقول , %1.7العقل , %1.7صوت % Cluster 5, Size: 8, ISim: 0.162, ESim: 0.009 Descriptive: 1.9الناس , %2.4موقف , %2.7بيحصل , %5.4في , %16.9اللي % Discriminating: 1.2في , %1.4بيحصل , %1.4موقف , %7.1اللي %, Egypt 1.2% Cluster 6, Size: 8, ISim: 0.153, ESim: 0.007 Descriptive: ikhwan 7.0%, 2.3ناس , %2.4الفردي , %2.6الجزيرة , %3.9االنتخابات % Discriminating: ikhwan 4.0%, 1.3فعال , %1.3ناس , %1.4الفردي , %2.2االنتخابات %

Cluster 7, Size: 8, ISim: 0.146, ESim: 0.004 Descriptive: 2.5داخل , %2.5لو , %3.2مكنش , %4.2الشباب , %7.8السفارة % Discriminating: 1.5اللي , %1.6الثورة , %1.8مكنش , %1.9الشباب , %2.7السفارة % Cluster 8, Size: 13, ISim: 0.104, ESim: 0.005 Descriptive: tahrir 6.9%, noscaf 5.3%, egypt 4.5%, sep9 3.7%, jan25 2.7% Discriminating: tahrir 3.6%, noscaf 2.5%, egypt 1.7%, 1.4ثوره %, jan25 1.3% Cluster 9, Size: 14, ISim: 0.102, ESim: 0.006 Descriptive: 10.1العسكري %, SCAF 8.8%, 7.1 المجلس %, Shorouk 2.0%, News 2.0% Discriminating: 5.8العسكري %, SCAF 4.7%, 1.2اللي , %3.0المجلس %, News 1.1% Cluster 10, Size: 13, ISim: 0.099, ESim: 0.005 Descriptive: 2.1العسكر , %2.1ھيالقوا , %3.3مش , %6.9علي , %7.0ثورة % Discriminating: 1.1اللي , %1.2ھيالقوا , %1.2في , %3.6علي , %4.0ثورة % Cluster 11, Size: 13, ISim: 0.098, ESim: 0.005 Descriptive: 4.4طنطاوي , %5.2المشير %, Egypt 3.3%, IsraeliEmbassy 2.9%, الشھادة2.5% Discriminating: 2.2طنطاوي , %2.6المشير %, IsraeliEmbassy 1.7%, الشھادة , %1.6اللي1.4%

Appendix II: Sample of Sentiment Tweet Corpus 

من شھداء الثورة فى البحيرة لبيان سبب وتاريخ الوفاة 5المحكمة تأمر باستخراج جثث atsp

صابي ثورة يناير مجانًا في مصر أو ألمانيامش مكسوفين يا مجلس بالليص نقابة األطباء األلمان تعرض عالج م atsp يالجعاوه التويتر للتغريد فقط اما النباح على قناتي العالم والمنارatsn نتنياھو مستعدون أن نكون أسخياء بشأن الحدود الفلسطينيةعلى اساس انھا ارضatsn

شباب مصر وربيع العرب احب أقول لنتينياھو واالمريكان اللي بيسقفوله حمراقابلوا بقى من atsn يسقط يحيى الجمل المنافق مھندس الثورة المضادةatsn رغم ان البرادعي عليه عالمات تعجب رھيبه الفتره الحاليه اال اني سعيد انه ماانضمش للمجلس الوطنيatsp

سنة ياخد اية 30لشعب من واحد في المظاھرات خد اعدام طيب واللى بيقتل في ا 18أمين الشرطة الھربان اللى قتل atsp مصر إحالة مبارك ونجليه عالء وجمال إلى محكمة الجنايات بتھمة قتل المتظاھرينatsp مسجل خطر يضع في بيته أسد وصقر جارح لمقاومة قوة الشرطةatsn مضى حسني مبارك وال أحد يبالي بمصيره فھل يتعظ البقيةatsn

ن تعرض عالج مصابي ثورة يناير مجانًا في مصر أو ألمانيشكلنا ايه نقابة األطباء األلما atsp

ده المجلس العسكري ھو كمان مسھل أمورة مع فوق الشعب يريد تكييف الميدان 39دا احنا ھانطظط درجة الحرارة atsn بسبب ائتالفات خطف الثورة سينتفض الشعب لثورة الجياع الحذار من خطف الثورةatsp

لن الجيش عن مالبسات مقتل رامي فخري وال تم تحويل القضية للنائب العامحتى االن لم يع atsn انفجار بمنشأة نفطية أثناء زيارة نجادatsn

من المقبوض عليھم بأحداث إمبابة والنيابة تفرض على المحامي عدم اإلطالع على 20إخالء سبيل atsp

خاص ـ الصقر يؤكد مفاوضات األھلي القطري كورابيا -خاص ـ الصقر يؤكد مفاوضات األھلي القطري - atsp البرادعى فى مقابلة مع سي إن إن مصر تتفكك اجتماعيا وتُفلس اقتصاديا وال ندرى إلى أين نتجهatsp عمال وخونة كالب وسايدة اسود خلو بالكو االسود جاعتatsn

القضية للنائب العام حتى االن لم يعلن الجيش عن مالبسات مقتل رامي فخري وال تم تحويل atsn معتز عبدالفتاح دا بياخد فيھا من فترة طويلة ريتويت لو موافقatsn

إحالة الرئيس المخلوع للجنايات كرت المجلس قبل جمعة الغضب الثانيةatsp

ضافه إلى وابل أخبار كويسه على شوية شائعات على بعض من القرارات المتخبطه باإل_وزى ما إتعودنا قبل كل جمعه مليونيه من التصريحاتatsp الشعب يريد الورد فى البساتين الورد يريد الشعب فى الميادينatsp معھد أمريكي طريقة إدارة مصر لم تختلف كثيراً بعد الثورة والديمقراطية لم تأت بعدatsn شاھد لماذا االعتراف ليس ھو دليال قاطعا في البحرين عترافات بالتعذيبatsn

طنا ديكتاتور ولم نُسقط بعد ديكتاتور بداخلنا كلنا يعتقد أن رأيه عنوان للصواب وال يقبل المختلف معه في الرأي وال يقبل أسق نقدهatsn ثوار مصر يدعون لمليونية إنقاذatsp

منھم احرار وھو في اسرائيل ال تعليق% 05مليون عربي في الوطن العربي 300اكثر تعليق مؤلم في الخطاب atsn شاھد لماذا االعتراف ليس ھو دليال قاطعا في البحرين اعترافات بالتعذيبatsn االئتالف جاى يركب على الثورة التانيه قبلھا بيومين كنتو فين من شھرين يا بقرatsn حتى االن لم يعلن الجيش عن مالبسات مقتل رامي فخري وال تم تحويل القضية للنائب العامatsn

نيفة حول عمرو موسى تنھي مؤتمر مصر األول والبرادعي يعتذر عن االنضماممشادات ع atsn احب أقول لنتينياھو واالمريكان اللي بيسقفوله حمراقابلوا بقى من شباب مصر وربيع العرatsn شاھد لماذا االعتراف ليس ھو دليال قاطعا في البحرين اعترافات بالتعذيبatsn

فينك يا قيصر يا شماتة ابلة طاظا فيّهثم لطمت كليوبترا وقالت atsp ثورة مصر الرد على صبحى صالح كشف أكذوبة التسع دول التى تشتatsp الفيفا ھذا الموسم األنجح لـزيزو في حياته الكرويةatsp الشعب يريد الورد فى البساتين الورد يريد الشعب فى الميادينatsp

بوابة األھرام قنديل ينتھي من فتاة الشمس والقمرالوفد جمع - الوفد قنديل ينتھي من فتاة الشمس والقمر atsp

من المقبوض عليھم بأحداث إمبابة والنيابة تفرض على المحامي عدم اإلطالع على 20إخالء سبيل atsp شاھد لماذا االعتراف ليس ھو دليال قاطعا في البحرين اعترافات بالتعذيبatsn

مي يارب مصر من المتسلقين ومّدعي البطولة الشعب ده لو خد فرصته سيمأل العالم باالبداع المصري يارب النصر لمصر اح األصيل

Appendix III: Part of  Speech Tags 

The definition for the taggers combined with Stanford NLP [5].

CC Coordinating conjunction e.g. and,but,or...

CD Cardinal Number

DT Determiner

EX Existential there

FW Foreign Word

IN Preposition or subordinating conjunction

JJ Adjective

JJR Adjective, comparative

JJS Adjective, superlative

LS List Item Marker

MD Modal e.g. can, could, might, may...

NN Noun, singular or mass

NNP Proper Noun, singular

NNPS Proper Noun, plural

NNS Noun, plural

PDT Predeterminer e.g. all, both ... when they precede an article

POS Possessive Ending e.g. Nouns ending in 's

PRP Personal Pronoun e.g. I, me, you, he...

PRP$ Possessive Pronoun e.g. my, your, mine, yours...

RB Adverb Most words that end in -ly as well as degree words like quite, too and very

RBR Adverb, comparative Adverbs with the comparative ending -er, with a strictly comparative meaning.

RBS Adverb, superlative

RP Particle

SYM Symbol Should be used for mathematical, scientific or technical symbols

TO to

UH Interjection e.g. uh, well, yes, my...

VB Verb, base form subsumes imperatives, infinitives and subjunctives

VBD Verb, past tense includes the conditional form of the verb to be

VBG Verb, gerund or present participle

VBN Verb, past participle

VBP Verb, non-3rd person singular present

VBZ Verb, 3rd person singular present

WDT Wh-determiner e.g. which, and that when it is used as a relative pronoun

WP Wh-pronoun e.g. what, who, whom...

WP$ Possessive wh-pronoun e.g.

WRB Wh-adverb e.g. how, where why

Appendix IV:  

 

      Manual Annotation Tool for SATA project 

(ewzenha) 

This report introduces the concept of annotations and their uses. This is a supplement to a specific project with detailed guidelines that describe the goals, benefits, designs, plans, implementations, and instructions for this particular annotation tool which is a sub-project for sentiment analysis tool for Arabic prototype. The report also provides detailed instructions for annotators and supervisors including annotation guidelines and examples of some annotations.

Table of Contents 

1. Introduction ......................................................................................................................................... 57 2. Purpose ................................................................................................................................................ 58 3. Requirements ...................................................................................................................................... 59

3.1 Features and Specifications ......................................................................................................... 59 3.2 General Principles ....................................................................................................................... 60 3.3 User Requirements ...................................................................................................................... 60 3.4 System Requirements .................................................................................................................. 60 3.5 Project Phases ............................................................................................................................. 60

3.5.1 Phase 1 ................................................................................................................................ 60 3.5.2 Phase 2 ................................................................................................................................ 60 3.5.3 Phase 3 ................................................................................................................................ 60

3.6 Test Plans .................................................................................................................................... 61 3.6.1 General Test ........................................................................................................................ 61 3.6.2 System Test ......................................................................................................................... 61 3.6.3 Strategy Test ....................................................................................................................... 61 3.6.4 Functionality Test ............................................................................................................... 61 3.6.5 Deliverables Test................................................................................................................. 61

3.7 Completion .................................................................................................................................. 61 4. Planning .............................................................................................................................................. 62

4.1 Rules of Interpretation ................................................................................................................ 62 4.1.1 Aim ..................................................................................................................................... 62

4.2 Annotation Validation ................................................................................................................. 62 4.3 Quality Control ........................................................................................................................... 62 4.4 Strategy ....................................................................................................................................... 62 4.5 Data Sources ............................................................................................................................... 63

5. Organizational Behavior ..................................................................................................................... 63 5.1 Roles and Policies ....................................................................................................................... 63

5.1.1 Administrator ...................................................................................................................... 63 5.1.2 Supervisor ........................................................................................................................... 63 5.1.3 Annotators ........................................................................................................................... 63

6. Design and Implementation ................................................................................................................ 65 7. Visualization capabilities .................................................................................................................... 67

7.1 Pie Chart ...................................................................................................................................... 67 7.2 Tag Clouds .................................................................................................................................. 67 7.3 Tweets Tab .................................................................................................................................. 67 7.4 Archive Tab ................................................................................................................................ 68 7.5 Add Tab ...................................................................................................................................... 70 7.6 Editing Page ................................................................................................................................ 70 7.7 Downloads .................................................................................................................................. 71

8. Current ewzenha ................................................................................................................................. 72 8.1 Topics .......................................................................................................................................... 72

8.1.1 General Political Hashtags .................................................................................................. 72

8.1.2 Presidential Candidates ....................................................................................................... 73 8.1.3 News Websites .................................................................................................................... 73 8.1.4 Services and Companies ..................................................................................................... 73

8.2 Data Sources ............................................................................................................................... 73 9. References ........................................................................................................................................... 74 

List of Figures Figure 1: Diagram of ewzenha .................................................................................................................... 57 Figure 2: Roles of each participants ............................................................................................................ 64 Figure 3: Work Flow of ewzenha Tool ....................................................................................................... 65 Figure 4: Twitter API Data ......................................................................................................................... 66 Figure 5: ewzenha Pie Chart ....................................................................................................................... 67 Figure 6: ewzenha tag clouds ...................................................................................................................... 67 Figure 7: ewzenha Tweets Table ................................................................................................................ 68 Figure 8: ewzenha archive positive words list ............................................................................................ 69 Figure 9: ewzenha adding Tweets/Text Page ............................................................................................. 70 Figure 10: ewzenha editing page ................................................................................................................ 70 Figure 11: ewzenha Download Page ........................................................................................................... 71 

1. Introduction 

Annotation is the methodology for collecting, creating and adding information about a word or a phrase, comment, paragraph, section, chapter or an entire document. This information is called metadata which is data about data.

The difference between annotation and other forms of metadata is that annotation is focusing on a specific point in specific region of data. The concept of annotation is applied on specific ranges of text when annotating these texts. Doing this action several times for many reasons: to speed retrieval by providing a dataset of words drawn from these texts. Or to add predefined words to the text in order to classify and categorize these texts.

However there are many methods of annotation, choosing the manual method was the better than the automatic or semi-automatic annotation. However manual method is slower but produces more precise results.

The following diagram visualizes the levels that can be derived from the annotation.

Figure 13: Diagram of ewzenha

2. Purpose 

The main purpose of ewzenha is to build up high precise data sets providing annotated tweets, lists of positive, negative and affirmation words and also name entities. Data used and data acquisition are only from tweets from twitter. These data sets will be very helpful in the SATA prototype as it will provide training procedures to train and test other developed tools like (Name Entity Recognition, Sentiment Classifiers, Topic Extraction, Opinion Leaders, Opinion influence, etc…) and also could be used as a benchmark for any sentiment analysis tool in Arabic.

Exporting all the data and the different lists each in a separate XML file format, In order to be useable to be imported and used in many developed tools in order to compare results out of these tools, and be able to build modules to extract name entities from tweets, topics of the tweets and influential persons.

3. Requirements 

The project requirements are to describe the behavior of the desired tool that to be developed. It includes a set of use cases that to describe all the interactions the users will do with the tool. In addition to mention what is supposed to be created, implemented and accomplished by the tool, also the usability, efficiency, accuracy, quality and the user interface of the tool. The main requirement is to focus and clarity on the purposes of developing such a tool.

3.1 Features and Specifications 

The developed product is a web-based tool in order to be accessible.

The tool must provide a reliable storage database.

The tool must be Verifiable, Modifiable and Traceable.

The system should never crash.

Accuracy is critical.

Every keystroke should provide a user response within 100 milliseconds.

Clearly documenting tweets annotation and names entity annotation guidelines.

Evaluation of tweets must depend on the project’s guidelines.

Good visualization capabilities and friendly user interface.

Enumerate the data for better and easier to use.

Safe servers for database and for lunching the tool.

Applying regular backups.

Manually classify each tweet in terms of positive, negative, neutral and sarcasm.

Ability to search and save new tweets.

Ability to edit tweets in term of sentiments and categories.

Ability to delete tweets or words.

Ability to extract topics for each tweet.

“Edit button” for each tweet to annotate each word entity separately in terms of Positive Words, Negative Words, Arabic Corpus, Slang Corpus, Affirmation Words and Entities (Location, Person, Organization, Date and other).

Generate and export files in XML format separately.

Specific data source.

The output of the tool whether the XML files or the displaying ability must include the following features:

1. Informative: Between thousands of tweets and words, an informative and well-ordered viewing style to help annotating to be correct and be easier for annotators.

2. Descriptive: Provides a description of the annotated tweets and words by viewing information regarding each annotator and his\her assigned tweets.

3. Evaluative: In addition to the previous information, distinguishing between the evaluated tweets and the new unevaluated ones.

3.2 General Principles  

Annotation is not about trying to attach a label to every word.

There will be a designated phase of the annotation process for the discussion and resolution of differences between the work of multiple annotators.

Justify every annotation against the guidelines.

 

3.3 User Requirements 

Internet Connection: Cable or DSL. Operating System: XP, Vista, or Windows 7. Browser: Google Chrome.

 

3.4 System Requirements 

Server to lunch the web-based tool. Server for database MySQL. PHP 5.2 or above. MySQL 3.2 or above. Apache 1.3 or above. Developer with good knowledge of web developing (HTML, PHP, MySQL). 

 

3.5 Project Phases 

The following are the three phases for the project: 3.5.1 Phase 1: Planning, designing and implementing the tool 3.5.2 Phase 2: Testing and validating the tool 3.5.3 Phase 3: Populating data into the tool

 

3.6 Test Plans 

The developer and then the team members are responsible for doing the following tests in order to make sure of the tool.

3.6.1 General Test Gathering Data test

Data Entry test

Visualizing Data test

3.6.2 System Test 

Syntax errors test

Runtime errors test

Logic errors test

Threading errors test

3.6.3 Strategy Test 

Performance test

Security test

Backups test

Recovery test

User Acceptance test

User Responsibilities test

3.6.4 Functionality Test 

This could be done by exhaustive use of the tool’s functionalities.

3.6.5 Deliverables Test  Generating XML Files

XML Files errors free

Downloading files

 

3.7 Completion 

The project is considered to be complete and successful when execution of the following: Features: When the project successfully meets all the requirements.

Testing: When evaluating different tests scenarios.

Cross-Browsing: When it is perfectly compatible with all browsers.

Functionalities: When annotators tested and agreed with every function developed in the tool.

4. Planning  

4.1 Rules of Interpretation   

To increase the consistency of judgments about what constitutes a directly related event or activity. Annotators refer to a set of tweets annotation guidelines and names entities annotation guidelines. These direct the annotators’ judgments about the topics, categories, sentiments of each tweet.

4.1.1 Aim  

Increase unifying of the annotation process.

Increase the accuracy of the tool.

4.2 Annotation Validation 

This is an important process for determining how well three annotators agree on a given annotation in the same tweet. Ewzenha tool will have the ability for an automatic calculation for this process. Tweets will be considered annotated correctly only if approved by this agreement.

4.3 Quality Control  

The basic goal of quality control is to ensure that the purpose of the tool provided meet the requirements and are dependable, satisfactory. Essentially, it involves the permanent examination of the tool by all the participants to ensure a certain levels of quality of the operation.

Supervisor: The main role is to ensure the evaluation of the annotation process. Administrator: The main role is to keep the tool error free and fix any bugs. Annotators: Their role is to test and evaluate the tool in addition to the annotation

process.

4.4 Strategy 

The topics that are to be selected should be a well-known keywords, hashtags, persons, companies, services, organizations or websites etc., thus a nonrandom data sample of tweets drawn from twitter only in Arabic language. This nonrandom selection method gives each searching time a number of tweets that to be stored in the database. Unifying the data source and the topics will contribute to a better evaluation and thus will lead to better training data sets. Although there is no effort is required to build up a big dataset of tweets. Annotators should be assigned to a number of tweets to annotate and as well as annotating words to build up the Arabic corpuses.

4.5 Data Sources 

Twitter Text file

 

5.   Organizational Behavior 

5.1 Roles and Policies 

Everyone who will use the tool has first to register in order to be able to view and use the tool including the admin himself. Not all the participants have the same access to all functions of the tool although they all have different privileges.

5.1.1 Administrator 

Keep making sure of that all the transactions done by the tool are working correct and efficient.

Handling all errors and troubleshooting. Regularly update the data base with new tweets. Assigning the new tweets to annotators and seniors as well. Updating and modifying the tool when needed. Solve any problem if happened for supervisors, seniors and juniors annotators.

5.1.2 Supervisor 

The role of the Supervisor is to make sure to have a full usage from the tool's benefits. Revising the tweets and the words annotations and writing comments. Having the only privileges to delete or update annotated tweets.

5.1.3 Annotators 

Principles to keep in mind when annotating 1. Tweet event: a good understanding of the tweets sentences. 2. What: what happened during the event. 3. Who: who (person, organization) was involved in the event, who wrote the tweet. 4. When: when the event occurred. 5. Where: where the event occurred.

Actions to be done

1. Topic extraction: a brief phrase to be written to describe the meaning or the title of the tweets

2. Sentiment: to extract the suitable judgments for each tweet 3. Category: to choose a suitable predefined category from a down menu 4. Repetition: to find out repeated tweets and an action of deletion to be taken 5. Addition: after doing the previous instructions, a final adding tweet step is to be

taken

The following figure shows the roles of each user participates in using the tool, there is a free and direct communications between all team members.

Figure 14: Roles of each participants

Web Server Database Server

Administrator

Editing / Deleting Tweets

Annotating Tweets Annotating Words

Editing / Deleting Words

Supervisor

Senior Annotator

Annotators

Revising annotation

Additionally, each user above might have different annotation facilities & privileges.

Add & Assign Tweets

6. Design and Implementation 

The following figure illustrates the work flow of the tool.

Figure 15: Work Flow of ewzenha Tool

The following figure illustrates the data returned from the Twitter API [2].

Fi 16 T itt API D t

7.

7.1

There iof the numberand thenumberfollowin

7.2

There ikeyworsaved calculatword aterms o

7.3

Visua

Pie Char

s a pie chartweets an

rs of the pose Sarcastic tr of the annng figure.

Tag Clou

s also a tagrds to show

tweets in tes the num

and present f word’s siz

Tweets T

Tweets tab o Numo Clao Cate

كرى

alizati

rt 

rt that shownd the persitive, the ntweets, in a

notated twee

uds 

gs cloud paw there tot

the datamber of occ

them in aze as shown

Tab 

is to view imber of twessification oegory of the .(.etc ,العسك

ion ca

ws the total rcentages aegative, the

addition to tets as show

art for a pretal numbera base. Ecurrences fascending on in the figu

nformation eets. of the tweete tweet (ياسة

apabil

number and the e neutral the total

wn in the

edefined r in the Ewzenha for each order in ure.

about twee

t (Positive, Nمحاكمات ,السي

lities 

ets as follow

Negative, Nلمين ,الثورة ,الم

Fi

Fi

wing.

Neutral, Sarcالمسلم االخوان ,

igure 17: ewz

igure 18: ewz

casm). مبارك ,االقباط

zenha Pie Cha

zenha tag clo

 المجلس ,م

art

uds

o Time and Date of the tweet. o Topics extracted from the tweets. o User names who annotated the shown tweets. o Tables to show tweets in the following order:

All the saved tweets. Positive tweets. Negative tweets. Neutral tweets. Sarcasmic tweets.

o Go to Page option to jump to any page. The figure below shows the view of the tweets tab.

7.4 Archive Tab 

Archive tab is to view words found in the tweets. o Positive words. o Negative words o Arabic corpus.

Figure 19: ewzenha Tweets Table

o Slang corpus. o Affirmation words. o Word entities (Location, Person, Organization, Date and other).

Figure 20: ewzenha archive positive words list

7.5 Add Tab 

 7.6 Editing Page 

Figure 22: ewzenha editing page

Figure 21: ewzenha adding Tweets/Text Page 

7.7 Downloads  

The following page is allowing users to choose the table they want to download. All files are in XML format.

Output files: 1. Tweets: Contains all the tweets with their Categories, Topics, Date and Time of each

tweet. 2. Positive Tweets: Contains all information in tweets file but only for the positive tweets. 3. Negative Tweets: Contains all information in tweets file but only for the negative tweets. 4. Neutral Tweets: Contains all information in tweets file but only for the neutral tweets. 5. Sarcasmic Tweets: Contains all information in tweets file but only for the sarcastic

tweets. 6. Positive Words: Contains a list of positive words. 7. Negative Words: Contains a list of negative words. 8. Arabic Corpus: Contains a list of Arabic words. 9. Slang Corpus: Contains a list of Arabic slang words. 10. Affirmation Words: Contains a list of Arabic affirmation words. 11. Entities: Contains a list of name entities (Location, Person, Date, Organization, Other). 12. Full Tweets: Contains a full representation of each tweet. Meaning that in addition to all

information included in tweets file, this file has the name entities and sentiment words tagged as positive or negative in each tweet.

Figure 23: ewzenha Download Page

8. Current ewzenha  

This part is to descript the current situation of ewzenha. The target now for ewzenha is to provide different types of data sets including but not limited to 1500 positive annotated tweets, 1500 negative annotated tweets and in addition to neutral and sarcastic tweets. List of 1000 words for positive, negative, affirmation words and entities. We may also assign Link’s trainees or AUC’s junior students to use the tool under a complete supervision. 8.1 Topics 

Topics are selected from well-known political hashtags, news websites, Presidential candidates, Services and companies of tweets drawn from twitter only in Arabic language. Each searching time a number of fifteen tweets returned as we use now the searching API method and not the streaming API. Annotators will be assigned to a number of tweets to annotate and as well as annotating words to build up the Arabic corpuses. The following are the selected topics used to gather tweets with a total of 38:

8.1.1 General Political Hashtags 

1. mubaraktrial 2. Tahrir 3. Jan25 4. Ikhwan 5. Sep9 6. camelbattletrial 7. camelbattle 8. IsraeliEmbassy 9. SCAF 10. NOSCAF 11. ArabSpring 12. Copts 13. Maspero 14. EgyTV 15. Elmosher

8.1.2 Presidential Candidates 

1. ElBaradei 2. HamdeenSabahy 3. amremousa 4. DrEssamSharaf 5. Bastawisi2011 6. Ayman_Nour 7. el3wwa

8.1.3 News Websites 

1. AlArabiya 2. alwafdwebsite 3. eahram 4. BBCArabicNews 5. AlMasryAlYoum_A 6. Shorouk_News 7. Tahrir_News 8. Dostor 9. youm7 10. aljazeera

8.1.4 Services and Companies 

1. VodafoneEgypt 2. Mobinil 3. EtisalatMisr 4. TEDataEgypt 5. CocaColaEgypt 6. LINKDSL

 8.2 Data Sources 

Only Twitter

 

 

9. References 

[1] http://ewzenha.linkdev.com/

[2] http://www.scoop.it/t/datavisualization/p/62758360/map-of-a-twitter-status-object