14
Vol.:(0123456789) 1 3 Social Network Analysis and Mining (2019) 9:12 https://doi.org/10.1007/s13278-019-0557-y ORIGINAL ARTICLE Character level embedding with deep convolutional neural network for text normalization of unstructured data for Twitter sentiment analysis Monika Arora 1  · Vineet Kansal 1 Received: 30 August 2018 / Revised: 5 March 2019 / Accepted: 6 March 2019 © Springer-Verlag GmbH Austria, part of Springer Nature 2019 Abstract On social media platforms such as Twitter and Facebook, people express their views, arguments, and emotions of many events in daily life. Twitter is an international microblogging service featuring short messages called “tweets” from different languages. These texts often consist of noise in the form of incorrect grammar, abbreviations, freestyle, and typographical errors. Sentiment analysis (SA) aims to predict the actual emotions from the raw text expressed by the people through the field of natural language processing (NLP). The main aim of our work is to process the raw sentence from the Twitter dataset and find the actual polarity of the message. This paper proposes a text normalization with deep convolutional character level embedding (Conv-char-Emb) neural network model for SA of unstructured data. This model can tackle the problems: (1) processing the noisy sentence for sentiment detection (2) handling small memory space in word level embedded learning (3) accurate sentiment analysis of the unstructured data. The initial preprocessing stage for performing text normalization includes the following steps: tokenization, out of vocabulary (OOV) detection and its replacement, lemmatization and stem- ming. A character-based embedding in convolutional neural network (CNN) is an effective and efficient technique for SA that uses less learnable parameters in feature representation. Thus, the proposed method performs both the normalization and classification of sentiments for unstructured sentences. The experimental results are evaluated in the Twitter dataset by a different point polarity (positive, negative and neutral). As a result, our model performs well in normalization and sentiment analysis of the raw Twitter data enriched with hidden information. Keywords Opinion mining · Convolutional neural network · Phonetic algorithm · Soundex · SemEval dataset 1 Introduction Nowadays, technology plays a key role in the development of real-world business, entertainment, and economic sector. People express their views on products/political issues in social networking platforms like Twitter, Facebook, Insta- gram, etc. (Hanafiah et al. 2017). To improve their services, they require accurate and compatible feedback from their customers. Usually, the post obtained from these sites is a short text, which is expressed in the structured, semi-struc- tured and unstructured form (Vinodhini and Chandrasekaran 2012). An informal way of writing leads to syntactically incorrect sentences, which usually contains incorrect grammar words, vowels or any punctuation marks. This is the primary challenge we face when attempting to analyze or normalize such informal written short text (Singh and Kumari 2016a, b). In the healthcare sector, patients are likely to express, share their feedback about diseases and compare the view of other patients. Moreover, many people are usu- ally looking for medical advice and remedies, when they experience injury or health issues. However, online posts related to medical information do not guarantee the quality information about particular health problems (Roccetti et al. 2016, 2017). Opinion extraction from these sources can help to under- stand their customer’s/patient feedback on the specific dis- ease, product, policy, issues, or services (Allahyari et al. 2017). Normalizing unstructured data can enhance learn- ing processes, leading to efficient polarity determination. But normalizing the unstructured and a semi-structured text is more challenging because the information is expressed * Monika Arora [email protected] 1 IET, AKTU, Lucknow, Uttar Pradesh, India

Character level embedding with deep convolutional neural ...static.tongtianta.site/paper_pdf/09c75e0e-ecdb-11e9-a69a...short text, which is expressed in the structured, semi-struc-tured

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Character level embedding with deep convolutional neural ...static.tongtianta.site/paper_pdf/09c75e0e-ecdb-11e9-a69a...short text, which is expressed in the structured, semi-struc-tured

Vol.:(0123456789)1 3

Social Network Analysis and Mining (2019) 9:12 https://doi.org/10.1007/s13278-019-0557-y

ORIGINAL ARTICLE

Character level embedding with deep convolutional neural network for text normalization of unstructured data for Twitter sentiment analysis

Monika Arora1 · Vineet Kansal1

Received: 30 August 2018 / Revised: 5 March 2019 / Accepted: 6 March 2019 © Springer-Verlag GmbH Austria, part of Springer Nature 2019

AbstractOn social media platforms such as Twitter and Facebook, people express their views, arguments, and emotions of many events in daily life. Twitter is an international microblogging service featuring short messages called “tweets” from different languages. These texts often consist of noise in the form of incorrect grammar, abbreviations, freestyle, and typographical errors. Sentiment analysis (SA) aims to predict the actual emotions from the raw text expressed by the people through the field of natural language processing (NLP). The main aim of our work is to process the raw sentence from the Twitter dataset and find the actual polarity of the message. This paper proposes a text normalization with deep convolutional character level embedding (Conv-char-Emb) neural network model for SA of unstructured data. This model can tackle the problems: (1) processing the noisy sentence for sentiment detection (2) handling small memory space in word level embedded learning (3) accurate sentiment analysis of the unstructured data. The initial preprocessing stage for performing text normalization includes the following steps: tokenization, out of vocabulary (OOV) detection and its replacement, lemmatization and stem-ming. A character-based embedding in convolutional neural network (CNN) is an effective and efficient technique for SA that uses less learnable parameters in feature representation. Thus, the proposed method performs both the normalization and classification of sentiments for unstructured sentences. The experimental results are evaluated in the Twitter dataset by a different point polarity (positive, negative and neutral). As a result, our model performs well in normalization and sentiment analysis of the raw Twitter data enriched with hidden information.

Keywords Opinion mining · Convolutional neural network · Phonetic algorithm · Soundex · SemEval dataset

1 Introduction

Nowadays, technology plays a key role in the development of real-world business, entertainment, and economic sector. People express their views on products/political issues in social networking platforms like Twitter, Facebook, Insta-gram, etc. (Hanafiah et al. 2017). To improve their services, they require accurate and compatible feedback from their customers. Usually, the post obtained from these sites is a short text, which is expressed in the structured, semi-struc-tured and unstructured form (Vinodhini and Chandrasekaran 2012). An informal way of writing leads to syntactically incorrect sentences, which usually contains incorrect

grammar words, vowels or any punctuation marks. This is the primary challenge we face when attempting to analyze or normalize such informal written short text (Singh and Kumari 2016a, b). In the healthcare sector, patients are likely to express, share their feedback about diseases and compare the view of other patients. Moreover, many people are usu-ally looking for medical advice and remedies, when they experience injury or health issues. However, online posts related to medical information do not guarantee the quality information about particular health problems (Roccetti et al. 2016, 2017).

Opinion extraction from these sources can help to under-stand their customer’s/patient feedback on the specific dis-ease, product, policy, issues, or services (Allahyari et al. 2017). Normalizing unstructured data can enhance learn-ing processes, leading to efficient polarity determination. But normalizing the unstructured and a semi-structured text is more challenging because the information is expressed

* Monika Arora [email protected]

1 IET, AKTU, Lucknow, Uttar Pradesh, India

Page 2: Character level embedding with deep convolutional neural ...static.tongtianta.site/paper_pdf/09c75e0e-ecdb-11e9-a69a...short text, which is expressed in the structured, semi-struc-tured

Social Network Analysis and Mining (2019) 9:12

1 3

12 Page 2 of 14

in different forms of emotions, unusual words, incorrect grammar (freestyle text) and short forms from all over the world in different languages. Twitter is one of the blogging sites that contain the information written with the length of 140–200 characters (Go et al. 2009; Martínez-Cámara et al. 2014). However, the processing of data from Twitter is not an easy task as it provides several unique characteristics from other social networking sites. In the past few years, researchers have a great scope in sentiment analysis to final-ize the opinion of the sentence or text as positive, negative or neutral polarity.

Text normalization and sentiment analysis help to estab-lish the overall opinion of the customer towards their prod-ucts. The sentiment formulation from the noisy data is not clear while extracting the opinion of the user. Text mining is carried out using machine learning and lexicon based tech-nique. Lexicon based learning requires a large dictionary annotated with a semantic strength, which is further used to estimate the sentiment score (Hailong et al. 2014). The biggest challenge of sentiment analysis in the field of NLP is inadequate labeled data. To overcome this problem deep learning is merged with sentiment analysis for extracting the polarity of the unstructured sentence obtained from the social platform (Vateekul and Koomsubha 2016). This type of combination extracts the sentiment by employing train-ing algorithms. This type of model achieves a good result in identifying the polarity of the given file from the trained labeled data.

The deep learning model is quite efficient in handling the sentiment analysis for both supervised and unsupervised based learning. It is a powerful tool that overcomes the defect of unlabeled data and learns the depth representation of the features and determines the effective identity of the reviews. The neural networks are employed in SA task due to the following factors: (a) the availability of training labels and (b) flexibility in learning the hidden information of the features effectively. Machine learning based SA approach has three categories: supervised, semi-supervised and unsu-pervised. Supervised and unsupervised based learning rep-resent prediction and classification of sentiments expressed by the users into positive, negative and neutral class labels. But semi-supervised learning predicts the target (unlabeled data) output by a labeled information set.

Researches designed many deep leaning methods for machine translation strategy. Recursive Neural Network (RNN) performs structure prediction that uses Treebank to estimate the different level of sentiments (Zhang and Zong 2015). Deep neural network (DNN) address the sentiment on the microblogging content with short text especially with the images (Baecchi et al. 2016). Deep belief network (DBN) uses the restricted Boltzmann machine (RBM) to learn information from hidden layers. However, these can be applied only to process the small-scale applications

(Ruangkanokmas et al. 2016). Recurrent neural network (RNN) has the restriction on using the characters at a fixed length that contaminate all the dictionary words for large dataset processing (Zhang 2015). Compared to the other neural networks, CNN performs better in recognizing the patterns with high accuracy.

The initial learning process of the neural network starts with embedding, which forms the sentences into a dis-tributed vector representation. In word-based embedding, word2vec tool is used to convert the input sentence to vector representation. The resulting features are used as inputs to the neural network for the sentiment determina-tion (Ouyang et al. 2015; Mikolov et al. 2013). However, this type requires a large vocabulary size for vector con-version. Besides, the word level embedding is not suitable for handling the multilingual document with the number of languages. The multilingual languages refer to documents expressed in multiple languages, i.e., the users around the world write comments in different languages in the social media. Most of the sentiment analysis approaches deal with single language, which leads to missing of essential information expressed in other languages. Thus, a mul-tilingual sentiment prediction is employed to handle the document written in many languages. When the number of languages increases, the vocabulary size also get increases. The increase in vocabulary size give poor performance in multilingual sentiment prediction (Dashtipour et al. 2016).

Character level embedding uses the character level fea-tures as input to the neural network for natural language processing. CNN is widely used for sentiment classifica-tion, which does not require knowledge about the seman-tic or syntactic structure of a particular language. This provides a more significant advantage for the system to handle the messages in different languages (Hwang and Sung 2017). But the noisy text input to the neural net-work causes incorrect predictions of the polarity. Hence, an effective normalization is applied with the preprocess-ing step includes tokenization (Singh and Kumari 2016), filtering (Saif et al. 2014; Silva and Ribeiro 2003), lem-matization and stemming (Jianqiang and Xiaolin 2017; Nicolai and Kondrak 2016). Besides, the character level embedding learning step requires only small alphabet memory instead of an extensive feature set. Thus, effec-tive preprocessing of the raw tweets and embedding allows a better classification of the sentence (reviews) with dif-ferent opinions.

This research has the following objectives:

1. To build an effective normalization module to structure the sentence and make it understandable to the machine.

2. To achieve an effective learning text representation with embedding models with limited memory consumption.

Page 3: Character level embedding with deep convolutional neural ...static.tongtianta.site/paper_pdf/09c75e0e-ecdb-11e9-a69a...short text, which is expressed in the structured, semi-struc-tured

Social Network Analysis and Mining (2019) 9:12

1 3

Page 3 of 14 12

3. To design a sentiment analysis architecture that deter-mines the actual polarity of the customer feedback about their products.

4. To evaluate the performance of our approach by esti-mating the accuracy and F-score and compare with the state-of-the-art approaches.

The proposed approach of the character level embed-ding of the ConvNet architecture requires fewer param-eters to be learned compared to word-based models. This technique relies only on the character level input capable of dealing with tweets written in distinct lan-guages. In this paper, the text normalization is performed with preprocessing stages such as tokenization, out of word replacement, lemmatization and stemming meth-ods. Hence, the raw input Twitter text (reviews) will be normalized and finally the polarity gets identified by the deep learning technique. The deep convolutional architec-ture of CNN is trained to analyze the sentiment of the sen-tence based on the character level embedding technique. Thus, the proposed approach identifies the polarity of the sentence without relay on a Lookup table or word2vec task to explore the sentiment of the sentences.

The contributions of this paper include:

• Extracting the knowledge/opinion through Twitter data, which contains the unstructured and semi-struc-tured data.

• Soundex algorithm for text normalization that nor-malizes the noisy text to the dictionary words and enhances the accuracy of the sentiment prediction in deep learning models.

• High learning rate assigned CNN with character level embedding model for feature text representation in the sentiment analysis.

• A flexible embedding model (Conv-char-Emb) to learn fewer learnable parameters of characters.

• A comparative analysis is performed with state-of-art methods to show the effectiveness of the proposed approach with different datasets for better prediction of sentiments.

The remainder of this paper is organized as: In Sect. 2, we have an overview of the existing works. In Sect. 3, our proposed method has been explained in two sections: nor-malization and character-based learning. In Sect. 4, the experimental analysis is made for the proposed method. Finally, in Sect. 5, the work ends based on the obtained results with a conclusion.

2 Related work

Several deep learning models include DNN, RNN, CNN, and DBN have existed earlier for SA with the data col-lected from different social sites. Neural network system modeled with hidden layers to predefine the output classes for text analysis. The final opinion for the given statement is established as positive, negative or neutral. It will help to know about the public opinion of the products, politics, brands, social events, etc.

Zare et al. (Zare and Rohatgi 2017) presented a RNNs based text normalization with a classifier, which deter-mines a nonstandard and standard token to remove the noisy data. The classification algorithm with the sequence to sequence model predicts the amount of data that actu-ally needed for input token. The XGBoost based context-aware classifier module determines the semiotic class of the normalized token.

Asghar et al. (2018), proposed a framework for Twitter sentiment classification into negative, positive or neutral. After preprocessing, collected tweets were allowed to pass through four modules containing Emotion Classifier (EC), SentiWordNet (SWN), Slang Classifier (SC), and domain-specific classifier. In the SC module, the slang terms were detected and matched with the dictionary for finding the words relevant with the given text. In EC, the textual expression was identified as + ve or –ve emotions expressed in tweets. The general purpose sentiment clas-sifier (GPSC) classifies the Twitter text and assigns the polarity scores to each of the word determined from SWN. The performance metrics such as accuracy, F- measure, precision, and recall were calculated to determine the result of the classification scheme.

Wehrmann et al. (2017), proposed the language-agnos-tic translation free method based on CNN to determine the sentiment of the sentences (reviews). In their method, the tweets were collected from the four languages (English, Portuguese, German, and Spanish) to perform the charac-ter level embedding for finding actual polarity of the input data. Their proposed Conv-char-S overcome the problems in word level and n-gram based techniques and the con-volution layer learns the feature space representation for words, character, and emotions present in the unstructured text. The performance were estimated in terms of accuracy and F- measure values and compared with the other base-line approaches.

Zhou et al. (2016), proposed a Weekly Sharped Deep Neural Networks (WSDNN) to convert the multiple lan-guages into a target language and predicts positive or negative polarities from the translated language. The cross-lingual information was passed to two-deep neural network with multiple weakly shared networks at the top.

Page 4: Character level embedding with deep convolutional neural ...static.tongtianta.site/paper_pdf/09c75e0e-ecdb-11e9-a69a...short text, which is expressed in the structured, semi-struc-tured

Social Network Analysis and Mining (2019) 9:12

1 3

12 Page 4 of 14

This network minimizes the loss caused during the transfer of the source to a target language.

Yuvaraj et al. (Yuvaraj and Sabari 2017), described the Twitter classification algorithm that uses the feature vectors that are represented by binary coding and frog meta-heu-ristic algorithm. Initially, preprocessing is carried out and the features are extracted using the Term Frequency-Inverse Document Frequency (TF-IDF) method and the optimal subset features using the frog algorithm. The extracted fea-tures were compared with the classifiers Naive Bayes (NB), K-Nearest Neighbor (KNN), Radial Basis Function (RBF) and Logistic Model Tree (LMT) network with Information Gain (IG), Genetic Algorithm (GA) in terms of accuracy and recall measure.

Agarwal et al. (2018), proposed a deep neural network model for paraphrase detection in clean and noisy texts. Ini-tially, sentence modeling was performed by the joint RNN-CNN architecture, which takes the word embedding as a source to the convolution network. The CNN model learns the local features, where the RNN network determines the long-term dependencies of the texts. Then a pair-wise simi-larity model was computed to analyze the similarity over the text to extract the important part from the sentence.

Dragoni et al. (2018), proposed an opinion mining ser-vices to detect the actual polarity within the texts. First, the reviews were sent to the data manager module consisting of document analyzer and enricher pipeline for extracting the polarity from the text and finally the information was added to the product. This type of opinion monitoring could be able to deal with aspect detection and polarity estimation of the documents for the real-time scenario. The system supports different kinds of users like buyers, managers, customers by analyzing the information in the form of multi-facet analysis.

Vechtomova et al. (Vechtomova 2017), presented a Polar-itySim approach for disambiguating contextual sentiment analysis using an Information Retrieval (IR) method to sup-port word polarity in the sentence. This technique does not rely on manually created word level corpora at the sentence or word level polarity estimation. At the initial stage, the system creates positive and negative vectors for each word and the content of the sentence (EvalV). Next, the system computes the pairwise similarity between the word vectors in the sentence for positive and negative vectors. Finally, polarity was assigned depending on whether the EvalV was more similar to positive or negative vectors.

The existing approaches perform sentiment analysis task without normalization and embedding with the word level models. These types of embedding maps are semantic simi-lar words for dimensional vector representation. However, the word-based models have several drawbacks: (a) requires more memory and vocabulary words and (b) accurate senti-ment identification is not possible when the input sentence is written with several non-dictionary words and of informal

patterns. The limitations of the existing works focus only on the sentiment analysis without the normalization procedure. Hence, there is no possibility for the accurate identification of polarity class from the noisy text. In addition, it is essen-tial to analyze the noisy words that help to devise or refine the raw data for determining the actual sentiment expressed over the product in digital marketing strategy. To overcome those complications, we employ the character level embed-ding that requires less memory instead of learning more vocabulary words. In our work, preprocessing is applied ini-tially to make the polarity identification much more accurate from the raw input sentences. In addition, the existing SA process mainly focused on word-based embedding to train the neural networks. This makes the architecture too com-plex when dealing with multiple languages. Deep learning with sentiment labels is efficient in handling a huge amount of Twitter corpus for the actual polarity estimation. Hence, we combine both the normalization task with deep learning for effective and accurate polarity detection from the raw input blogs (sentences).

3 Methodology

Sentiment analysis/opinion mining from the unstructured data is essential to obtain the opinion of the statement. Tweets without normalization pose several issues, hence preprocessing is applied before learning the labels in sen-timent identification. Therefore, tweets are first preproc-essed to replace all the noisy features like misspellings, abbreviations, incorrect grammar to the structured form of information..

Fig. 1 shows the process diagram of the proposed method. The raw input data are preprocessed initially to a normalized form. As noisy data from Twitter is given as the input for our sentiment analysis task, preprocessing is an essential stage for accurate polarity detection. Thus the unstructured data get normalized and then the sentiment is analyzed by character level CNN learning model. The preprocessing step is divided into three phases: (a) tokenization (b) OVV detec-tion and its replacement (c) lemmatization and stemming. Finally, the sentiment analysis is investigated through CNN-based deep learning technique.

3.1 Preprocessing

This involves the following processes to structure the sen-tences and make it more understandable by the machine.

3.1.1 Tokenization

The raw input is taken from the Twitter dataset which contains a large number of unwanted characters such as

Page 5: Character level embedding with deep convolutional neural ...static.tongtianta.site/paper_pdf/09c75e0e-ecdb-11e9-a69a...short text, which is expressed in the structured, semi-struc-tured

Social Network Analysis and Mining (2019) 9:12

1 3

Page 5 of 14 12

punctuation marks, comma, period, semicolon, exclama-tion point, dash, brackets, braces, question mark, apostro-phe, quotation marks, and ellipsis. Hence the input sequence is divided into pieces of words called tokens, and the low priority features get removed. For example, consider the fol-lowing example: “HELLO world How is the day for you?□” The output result after tokenization is < HELLO > < world > < How > < is > < the> ><day > < for > < you > <□>. Here the question mark (?) feature gets cleared and the remaining list of tokens are sent for further processing.

3.1.2 OOV word detection and its replacement

After tokenization, the words in the non-standard form needs to be normalized into the original format. For this purpose, the chopped words get compared with three types of diction-aries: Microsoft word-based dictionary, Short Message Ser-vice (SMS) Dictionary and Soundex Dictionary to correct the OOV words and the words with irregularity are replaced to a normalized form.

Microsoft word-based dictionary This is used for spell checking and suggest synonyms to the corresponding word. In our approach, it checks whether given the word is an SMS word or not. If the sentence contains the SMS type of word,

then it is forwarded to the SMS dictionary else to the soun-dex dictionary for normalization.

SMS dictionary In this, the SMS words are searched throughout the dictionary and replaced with its correspond-ing correct word. For example, ACK, OMG, and ITT get transformed into the extended form as “Acknowledgement”, “Oh My God” and “In This Thread”.

Phonetic algorithm The phonetic algorithm is the process of matching the given keyword to the set of strings with a similar sound. The strings can be spelled differently due to the informal/different writing style by various users on the web. However, it can be solved phonetically. This type of algorithm is a little complex due to the English spell-ing and pronunciation of words copied from various lan-guages. Soundex is the phonetic-based algorithm explored for normalizing the noisy text. In case, the name “TEJ” be the keyword, then the alpha-numeric code is generated to assess the possibilities of a similar word from the table. Thus the similar words associated with the given keywords such as “TEGI, TAJA, TAJO, TEJI, TOK, and TAKO are determined. The basic Soundex model matches some inap-propriate matches lead to the increase mismatching and error problem. This leads to the inaccurate replacement of words and the detection of correct polarity in the sentence. To solve this issue, the Soundex algorithm is modified with the

Fig. 1 Process flow of the pro-posed framework Raw twitter

Data(Tweets)

Tokenization

OOV detection

lemmantization

Stemming

Word replacement

Standard dictionary

(spell check)

SMS Dictionary

SOUNDEX Dictionary

Polarity of the input tweet

CNN Based Deep Learning

(CONV-CHAR-EMB)

Pre-processing

Normalized tweets

Page 6: Character level embedding with deep convolutional neural ...static.tongtianta.site/paper_pdf/09c75e0e-ecdb-11e9-a69a...short text, which is expressed in the structured, semi-struc-tured

Social Network Analysis and Mining (2019) 9:12

1 3

12 Page 6 of 14

inverse edit term frequency model that enhances the accu-rate matching of the word and reduces the error rate. Hence this algorithm searches for matching the correct word that sounds similar irrespective of the misspelled or unstructured words obtained from social media.

3.1.3 Soundex with inverse edit term frequency approach

In this approach, the Soundex code of the given word is matched with its Soundex code using the keys present in the key/value pair dictionary. Thus, the Soundex code string generated for similar sounding words retrieved from the database is compared with the given keyword to replace the formal words in the dictionary model.

Soundex dictionary The words coming out from the SMS dictionary, still not get cleared are decoded using the Soun-dex algorithm. For the decoding of such words, we generated Soundex dictionary called a self-generated key-value pair dictionary. Here the code string is generated for all English words and the words having the same code are clustered under a group. We make use this dictionary to retrieve all the words having similar words for the given keyword. Soundex with inverse edit term frequency identifies the correct word from the dictionary with respect to misspelled or abbrevi-ated word based on a bag of words with minimum edit term frequency technique. This helps to correct the short forms effectively expressed by the user in the reviews. The Soun-dex code is created by the first letter followed by three digits which describes the phonetic sounds of the word. For exam-ple, the code for the word RUPERT is generated as R163. The code is generated in Soundex algorithm based on the following rule.

1. Replace the first letter and change the following letters a, e, i, o, u, h, w and y to Zero.

2. Replace the other letters with the digits as follows:

• b, v f, p, ⇒ 1• c, q, s, g, j, k, x, z ⇒ 2• t, d⇒ 3• l ⇒ 4• n, m⇒ 5• r ⇒ 6

3. The word with the consecutive double letter is treated as the single letter. (e.g., gutierrez is coded as G-362).

4. If you have less number of words then append the three numbers in the code with 0 and similarly if you have more numbers just retain only the first three numbers.

5. If the vowel separates two letters in a word having the same code, then the letter to the right of the vowel is

coded. (e.g., tymczak as T-522 where k is only coded) similarly when h or w separate the consonant then select only the letter to the left one. (e.g., ashcraft as A-261).

The code generated based on the above rule is compared across a designated library to find the phonetically similar words to the corresponding word. Based on the edit distance rule, the word having the nearest code is retained. To find the correct match to the code, an inverse edit term frequency technique is applied.

Then from the set of obtained soundex codes, we cal-culate the custom weight known as the inverse edit term frequency weight, which comprised of product of the inverse of edit distance and the frequency probability of the word. Then we choose the word with the maximum custom weight from the following Eq. (1).

where editdist is the edit or the LEVENSHTEIN distance and the Fd is the frequency probability of the word calcu-lated using the corpus. The steps to find the best match is done by the following process.

1. Generate the Soundex code of the words which was not found in both standard dictionaries and in SMS Diction-ary.

2. Compare the Soundex code with keys present in key/value pair dictionary.

3. Train the dataset using distinctive available corpus and search the frequency of occurrence of different words in the corpus.

4. Calculate the weight of the word available in the matched bag of words. A weight of the word is calcu-lated by multiplying the inverse of edit distance and the frequency probability of the word.

5. Select the word with maximum calculated weight (Inverse edit distance weight).

Thus the informally written text (noisy words) get nor-malized by the above three stages and makes the sentences to a structured form, which can be easily understandable by the machine. On the other hand, the Soundex algorithm is build up to decode the phonetically similar words in the raw Twitter data. The word to be normalized is replaced by means of the soundex dictionary.

3.2 Lemmatization

After OOV word replacement, lemmatization is per-formed to convert different form of words to the base word. Before performing this, POS tagging is performed in the input sentence. Here the sentence containing the Parts Of Speech (POS) is mapped from the given text. POS tagging

(1)weight = editdist−1 ∗ Fd

Page 7: Character level embedding with deep convolutional neural ...static.tongtianta.site/paper_pdf/09c75e0e-ecdb-11e9-a69a...short text, which is expressed in the structured, semi-struc-tured

Social Network Analysis and Mining (2019) 9:12

1 3

Page 7 of 14 12

algorithm is performed based on stochastic and rule-based technique (e.g., E. Brill’s tagger). Next, the lemmatization is performed for grouping upon the same words into the single item. This will convert the informal word into its root word. For example, the words “better”, “best”, “good” are converted to “good” as all words have same meaning.

3.3 Stemmng

After lemmatization, stemming is performed to find out the base word in the Twitter sequence. The difference between stemming and lemmatization is that lemmatization reduces the words in sentences based on a dictionary, but stem-ming reduces to a root word form. For example, consider the words fishing, fisher, fished is converted into its root word “fish” while “beautifully” and “beautiful” is changed to “beauti”. This is done by the Porter stemmer algorithm, which reduces the word to its root form. This is performed by cutting the prefixes or suffixes and change the word which has no meaning related to the English dictionary words.

After preprocessing, the next step is the SA of the input tweets. Word embedding is one of the processes in NLP, which converts the input word/phrase sequence to real num-bers ( i.e., in vector form). In this, CNN based learning is performed at character level due to its simplex structure with learnable parameter tuning and achieves a good result com-pared to other networks.

3.4 Character level embedding (Conv‑char‑Emb) with CNN architecture

The proposed Conv-char-Emb model based on CNN with character level embedding on the word to train the network for the sentiment analysis on a distinct supervised corpus. The embedding is obtained from scratch by analyzing all the characters (Zhang and LeCun 2015). The character level embedding is designed to overcome both the bag of word and word level embedding approaches. This method

performs embedding of words at the character level without the knowledge of syntactic and semantic information of the input sentences. The preprocessed output combined with the deep ConvNet architecture performs effective sentiment classification of the raw tweets. This task provides a greater advantage of handling the tweets (messages) with typos and noisy sentences towards the complexity of requiring new vectors in word-based models. An example of character level representation of the sentence is presented in Table 1. Our input to the network is the words (tweets) obtained after preprocessing and the output is obtained with three or five scale polarity as strongly positive, weakly positive, neutral, strongly negative and weakly negative (i.e., 2,1,0, − 1 and − 2).

The CNN network is composed of convolutional layer, max pooling layer, and a fully connected softmax classifica-tion layer is described in Fig. 2. ConvNets can be applied directly to the sentence without the knowledge of its seman-tic feature of a language. Hence character level embedding in CNN allows to work with different languages without know-ing the crucial meaning of the word from the dictionary. This serves a greater advantage in saving the memory space of storing only the small alphabet instead of a large vocabulary. In character-based representation, all the input characters are mapped into the multidimensional spaces based on 2D input tensor for encoding the text. Hence the convolution layer reads the input at its character level. Here the input be represented by the character, f ∈{f1, f2, f3… fn} in the tweet sequence, which mapped to a matrix of size n × � , where � defines the number of characters with n letters in the sentence. Thus the sequence of characters is transferred as vectors with the fixed length. The character not in the pat-tern and the blank spaces are determined as all-zero vectors.

The network structure for the proposed SA task comprised of nine layers deep: six convolutional layers and two fully con-nected layers with two dropouts probability 0.9. The proposed character model uses the input feature having 70 characters and the input feature length of 1014. The convolutional layer

Table 1 Character level representation of the sentence “RED ROSE”

Text Alphabet

r e d – R o s ea 0 0 0 0 0 0 0b 0 0 0 0 0 0 0 0c 0 0 0 0 0 0 0 0d 0 0 1 0 0 0 0 0e 0 1 0 0 0 0 0 1. 0 0 0 0 0 0 0 0r 1 0 0 0 1 0 0 0. 0 0 0 0 0 0 0 0. 0 0 0 0 0 0 0 0� 0 0 0 0 0 0 0 0

Page 8: Character level embedding with deep convolutional neural ...static.tongtianta.site/paper_pdf/09c75e0e-ecdb-11e9-a69a...short text, which is expressed in the structured, semi-struc-tured

Social Network Analysis and Mining (2019) 9:12

1 3

12 Page 8 of 14

uses the layer with receptive fields of 7� , max pooling layer select the exact feature and the fully connected layer performs the linear mapping of the above-obtained values to class polar-ity value.

The input to our model is the sequence of characters which is mapped to the character matrix C ∈ Rd × |V| , where the sen-tence matrix for the input tweet is defined as S ∈ Rd × |S| . It is sent to the convolutional layer, which performs the convolu-tion operation with the filter F ∈ Rd ×m . The convolution layer operation is expressed by,

where S[∶,j−m+1∶j] defines matrix of the sentence along with the column size ( m ) and ⊗ determines an element-wise multiplication in the matrix. Thus the output bj defines the element-wise product of the filter F and the column matrix in the sentence matrix. The above equation defines the sin-gle filter convolved with the sentence matrix where using a set of filters the feature matrix is given by F ∈ Rn×(|S|−m+1) . Next, the rectified activation function is performed after the convolution operation. Here ReLU operation is per-formed to obtain the feature maps with the positive values. This method helps to train the network to produce accurate results.

After convolution, ReLU activation is carried out with the use of the weight matrices in (He et al. 2015) and pooling operation is performed to aggregate the values from the con-volution layer. The pooling operation is defined by:

(2)bj = (S ∗ F)j

=∑k,i

(S[∶,j−m+1∶j] ⊗ F)ki

where bi is the i-th convolution feature map with the bias value gi and the e be the unit vector matrix of the same size of bi and �( ) defines the activation unit. The max pooling operation is performed which extracts only the maximum value in the matrix set. This perform on the columns with the feature map of the form: pool (bi):R1× |S|+m−1

→ R . The activation and the pooling layer performed in the convolu-tion act as the non-linear feature extractor. Finally, the result from the pooling layer is sent to the fully connected soft-max layer. The probability distribution function for the layer is defined by:

where, wk and bk define the weight vector and the bias value for the matrix. Hence the last layer is designed to learn the parameter by minimizing the vector values and label the input sentence with the three point scale polarity.

(3)bpool =

⎡⎢⎢⎢⎣

pool(�(b1 + g1 ∗ e))

........

pool(�(bn + gn ∗ e))

⎤⎥⎥⎥⎦

(4)

P (y = j�x, v, b) = softmaxj(xTw + b)

=ex

TWJ+bj

∑K

k=1eX

Twk+bk

Fig. 2 Character level embed-ding (Conv-char-Emb) with CNN

Input text

character

Feature length

Convolution layer Max poolingFully

connected layer with drop out

Page 9: Character level embedding with deep convolutional neural ...static.tongtianta.site/paper_pdf/09c75e0e-ecdb-11e9-a69a...short text, which is expressed in the structured, semi-struc-tured

Social Network Analysis and Mining (2019) 9:12

1 3

Page 9 of 14 12

4 Experimental results and analysis

Dataset description The implementation of our approach is carried out in MATLAB 2017b. We make use of the eight dataset includes as follows:

Kaggle dataset (Zare and Rohatgi 2017) originally consist of 3750 movie reviews annotated with 1820 for positive, 1920 negative and remaining are neutral statements. The annotation of positive and negative and neutral classes are made by the review scores greater than 7 as positive and less than 5 as negative and the in-between class as a neu-tral statement. Here the data is selected from the subset of English movie reviews written in the informal style of long sentences.

Twitter dataset (Asghar et al. 2018) extract the tweets collected from the automobile, health and laptop domains. This includes 3,500 tweets from the automobile dataset, 2,737 messages in laptop and 4,127 tweets from the health domains assigned with positive, negative and neutral polar-ity. Here the non-English tweets are ignored and only of 1726, 1943 and 3467 informally written English statements are considered from automobile, laptop, and health domains.

Multilingual twitter dataset (Mozetič et al. 2016) con-sists of 1.6 million Tweets from 13 European languages that are manually annotated as three point polarity as positive, negative and neutral. From this subset, we consider only the English language dataset for our evaluation which is publi-cally available so far. In this, a total of 11,115 negative and 11,055 positive posts are determined from the total of 22,170 informally written tweets collected from the Twitter corpus.

Prettenhofer and stein (Prettenhofer and Stein 2010) con-sist of Amazon product reviews collected during November 2009 for four languages from three products page include: DVDs, books, and music. The languages include English, Japanese, French, and German, where English reviews are sampled from Multi-domain sentiment dataset. In this, Eng-lish product reviews are sampled for sentiment analysis in which the corpus consists of 1000 positive and 1000 nega-tive tweets.

Microsoft research paraphrase (MSRP) dataset (Dolan et al. 2004), consists of a file with 5800 pairs of sentences extracted from the English news sources written informally. The messages collected from a web source consists of a sen-tence with an average of 21 words. The dataset is catego-rized as 4075 testing and 1725 testing samples with positive, negative and neutral classes respectively.

Semeval dataset includes the following categories are considered for evaluation: SemEval ABAS task 12, Sem Eval-2017 dataset.

SemEval ABAS task 12 (Pontiki et al. 2015) contains three categories of English reviews from the laptop, mp3 domains consists of an average of 64.7 words per review. This dataset

is annotated with three-point sentiment polarity as positive, negative and neutral. Here the restaurant review consists of 63,516 positive, 18,705 as negative statements and remain-ing of the neutral statement from 157,865 reviews. The Mp3 corpus consists of 64,006 positive, 31,792 negative and remaining as a neutral post from the total of 135,943 review statements.

Sem Eval-2017 task 4 (38) for sentiment analysis in twit-ter task, which contains tweet classification and tweet quan-tization task with two point and five point scales. The topics in the tweet are related to the English language associated with current topics such as Donald triumph, iPhone of five subtasks A, B, C, D and E, which are publically available so far. Here the subtask CE is selected for evaluating our CNN architecture, in which all the tweets are annotated with five labels as − 2, − 1, 0, 1 and 2. The values are defined as strongly negative, weakly negative, neutral, weakly positive and strongly positive related to the corresponding tweets. In this, reviews are categorized as positive (16,405), neutral (19,187) and negative (7419) statements from the total of 43,011 posts written informally.

For training the neural network, we use the Adam rule of optimization to obtain the optimal value in sentiment estima-tion. In this, we divide our dataset into testing and training with a ratio of 30:70. The dataset consists of 200 topics in training with 30,632 labels and testing of 12,379 raw twitter data with five labels.

Performance metrics We evaluate the performances of the proposed approach with two metrics such as accuracy and F-measure values. Accuracy is defined as the number of cor-rectly classified labels and F-measure defines the mean val-ues of precision and recall. The accuracy, recall, F-measure and precision is defined by the below formulas.

where, Tp, Tn,Fp and Fn defines the true positive, true nega-tive and false positive, false negative classification values. Both the accuracy and F-score measures range between 0 and 1. The higher value determine the best performance of Twitter data sentiment analysis. For training the neural

(5)Accuracy =Tp + Tn

Tp + Fp + Tn + Fn

(6)Precision (P) =Tp

Tp + Fp

(7)Recall (R) =Tp

Tp + Fn

(8)F-Measure =2PR

P + R

Page 10: Character level embedding with deep convolutional neural ...static.tongtianta.site/paper_pdf/09c75e0e-ecdb-11e9-a69a...short text, which is expressed in the structured, semi-struc-tured

Social Network Analysis and Mining (2019) 9:12

1 3

12 Page 10 of 14

network, we use Adam rule of optimization to obtain the optimal value in sentiment analysis.

Regarding our proposed approach, we have the advan-tage of a strong preprocessing module in normalizing the unstructured text. The effective handling of normalization will impact an accurate sentiment identification of the cus-tomer/user. Therefore, the probability of getting a high accu-racy rate in sentiment estimation is ensured by our proposed technique. Furthermore, our method does not rely on word level embedding, which demands a large memory for text representation. We use character level machine translation with less learnable parameters as the input to the neural net-work. The deep CNN perform better accuracy rate by per-forming nine layers of ConvNets for analyzing the polarity. By analyzing Tables 2 and 3, the system outperform more

than the traditional supervised learning in twitter polarity estimation.

Table 2 shows the comparison measure of the metrics such as accuracy, precision, recall and F-measure values with normalization and without normalization process. In this, we perform two modes of operations to estimate the accuracy, F-score value in sentiment analysis. In the first task, we perform the polarity determination without pre-processing module and second with the normalization module. The result shows a little variation in the accuracy and F-score value due to informally written words in the tweets (input sentences). The accuracy score of 0.981 is achieved by preprocessing the raw (noisy) sentence using word replacement. The result shows the significant gain of polarity identification in the proposed normalization based sentiment analysis method.

Table 2 Accuracy, Precision, Recall, and F- measure values with and without normalization obtained by SemEval 2017 benchmark

Accuracy Precision Recall F-measure

+ve Neutral −ve +ve Neutral −ve +ve Neutral −ve

With normalization 0.981 0.965 0.89 0.91 0.94 0.9 0.933 0.95 0.89 0.92Without normalization 0.962 0.95 0.87 0.89 0.932 0.88 0.91 0.94 0.87 0.89

Table 3 Classification Accuracy comparison

Dataset Accuracy measure

Kaggle dataset Shaurya et al. (Silva and Ribeiro 2003) 0.9762Our’s 0.9782

Twitter dataset Muhammad et al. (Jianqiang and Xiaolin 2017) 0.85Our’s 0.87

Multilingual twitter dataset Jonatas et al. (Nicolai and Kondrak 2016) 0.720Our’s 0.75

Prettenhofer and stein Guangyou et al. (Zare and Rohatgi 2017) 0.810Our’s 0.82

SemEval N. Yuvaraj et al. (Asghar et al. 2018) 0.925Our’s 0.96

MSRP Agarwal et al. (Wehrmann et al. 2017) 0.77Our’s 0.79

semEval task 12 Dragoni et al. (Zhou et al. 2016) 0.85Our’s 0.861

SemEval ABAS Vechtomova et al. (Yuvaraj and Sabari 2017) 0.89Our’s 0.901

Table 4 Comparison of precision and recall

Dataset Precision Recall

+ve −ve +ve −ve

Twitter dataset Muhammad et al. (Jianqiang and Xiaolin 2017)

0.89 0.84 0.79 0.74

Proposed 0.91 0.86 0.82 0.77SemEval dataset Yuvaraj et al. (Asghar et al. 2018) 0.95 0.89 0.93 0.925

Proposed 0.96 0.92 0.94 0.931

Page 11: Character level embedding with deep convolutional neural ...static.tongtianta.site/paper_pdf/09c75e0e-ecdb-11e9-a69a...short text, which is expressed in the structured, semi-struc-tured

Social Network Analysis and Mining (2019) 9:12

1 3

Page 11 of 14 12

Table 3 describes the comparison of accuracy obtained with the existing works towards our proposed (Conv-char-Emb) method. In this, we perform the classification task with the above-mentioned dataset and compare with the existing approaches. The results show a better accuracy measure gain in terms of other models due to the normalization and the learning based sentiment analysis.

Table 4 describes the precision and recall comparison of the proposed technique with the existing method. The results show a better performance in both the positive and negative statement with higher measures. This is achieved due to the processing of the informal text and character level embed-ding associated with the ConvNet architecture.

Table 5 shows the precision and recall performance of SA using RBF classifier with the combination of different optimization algorithms such as GA, IG and binary shuffled frog algorithm (BSFA). The above-mentioned comparison is done with the results taken from reference (Asghar et al. 2018), achieves a better performance in the classification task. The proposed Conv-char-Emb architecture for polar-ity prediction achieves a greater accuracy score due to the deeper representation and understanding the features by means of character-based networks.

Table 6 shows the comparative analysis of the proposed method with the Sem Eval-2017 task 4 competitors in terms of accuracy, F-measure and the recall measure. In the exist-ing works, several approaches are used to extract the senti-ment presented in Sem Eval-2017 task 4 dataset. The result shows that the proposed method attain better accuracy, F1-score and recall measure than the other approaches.

The comparison of the proposed Conv-Char-Emb embedding with the existing model shows a significant

performance gain in both accuracy and F-measure values. The comparison results are shown in Table 7. In the refer-ences (Kim 2014; Zhang et al. 2015), Conv-Emb and Conv-char model needs more learnable parameters leads to the extension of memory consumption and low accuracy. Simi-larly, reference (Santos and Gatti 2014) uses two convolu-tional layer that extract the feature to determine the polarity of the sentences. But, we proposed a deep CNN with Conv-char-Emb model comprised of several convolutional layer with a fewer learnable parameters. In addition, the proposed system is robust by varying the hyper parameter ranges with less learning rate. Thus enhances the sentiment prediction task with less learnable parameter and memory consump-tion for both the model and data. In the reference (Baziotis et al. 2017), (Yang et al. 2017), (Ma et al. 2018), LSTM architecture are used, however it requires an explicit knowl-edge to train the network in the estimation of the sentiments. In (Chen et al. 2017), recurrent attention network is used to examine the sentiment features in the sentence structure requires more memory space for reading the information. The proposed embedding model performs well in both the accuracy and F-measure due to the less learnable parameters when comparing to the existing approaches. This shows the proposed model can be able to deal with large Twitter set and separate the sentences into three different classes. Hence

Table 5 Combination of various algorithms and classifiers in precision and recall comparison with Sem Eval-2017 task 4 dataset

Classification accuracy

Precision Recall F-measure

+ve −ve +ve −ve +ve −ve

IG-RBF 0.86 0.928 0.77 0.84 0.9 0.88 0.82GA-RBF 0.91 0.94 0.86 0.9 0.915 0.919 0.88BSFA-RBF 0.93 0.95 0.89 0.93 0.925 0.93 0.90Proposed 0.981 0.965 0.91 0.94 0.933 0.95 0.92

Table 6 Comparative results with the competitors of the SemEval 2017 -Task 4

Accuracy F1-score Recall

Corrêa et al. (Corrêa et al. 2017) 0.617 0.595 0.612Onyibe and Habash(Onyibe and

Habash 2017)0.686 0.596 0.623

Dovdon and Saias (41) 0.582 0.539 0.571Jabreel and Moreno (2017) 0.643 0.628 0.645Proposed 0.750 0.79 0.673

Table 7 Accuracy and F-measure scores of deep neural architecture in sentiment analysis with Sem Eval-2017 task 4 dataset

Model used Accuracy F-measure

Jonatas et al. (Nicolai and Kondrak 2016)

Conv-char-S 0.720 0.756

Kim (2014) Conv-Emb 0.715 0.747Zhang et al. (2015) Conv-char 0.696 0.721Dos Santos and Gatti

(2014)CharSCNN 0.723 0.76

Baziotis et al. (2017) Deep LSTM 0.682 0.675Chen et al. (2017) Recurrent Atten-

tion on memory (RAM)

0.7389 0.7385

Yang et al. (2017) LSTM + VAE 0.59 0.623Ma et al (Ma et al.

2018)Senti LSTM 0.69 73.82

Proposed Con-char-Emb 0.750 0.79

Page 12: Character level embedding with deep convolutional neural ...static.tongtianta.site/paper_pdf/09c75e0e-ecdb-11e9-a69a...short text, which is expressed in the structured, semi-struc-tured

Social Network Analysis and Mining (2019) 9:12

1 3

12 Page 12 of 14

this model can be most useful in predicting customer opinion towards their newly launched products.

4.1 Statistical analysis

We performed Student’s t testing to measure the statistical significance of the result obtained. In this, we calculated the average mean (M) and standard mean (SD) on the result for each dataset with the confidence result of 95% in the statisti-cal significance test.

From the dataset, seven features are extracted to deter-mine the standard deviation and mean values for each sub-set is shown in Table 8. In this, the mean value defines the central tendency of each dataset and standard mean defines the variation of tweets with respect to each feature. Hence, table show that each dataset is varied with different type of sentence and words, which affects the accuracy of each set.

The statistical analysis is performed to estimate the classification accuracy by conducting a Student’s t test between the proposed and the existing methods with the confidence level of 95% on the experimental results. In

Table 8 Extracted features for all dataset

S. No Features Kaggle dataset Twitter dataset Multilingual twitter

Prettenhofer and stein

MSRP SemEval

M S.D M S.D M S.D M S.D M S.D M S.D

1 Total characters 0.4545 0.3201 0.432 0.256 0.345 0.561 0.743 0.732 0.432 0.345 0.453 0.6872 Positive words 0.2253 0.2341 0.0691 0.1205 0.0406 0.675 0.1206 0.0234 0.0245 0.031 0.0321 0.05433 Negative words 0.0543 0.0732 0.0424 0.0541 0.1776 0.056 0.0765 0.052 0.1499 0.086 0.083 0.0874 Neutral words 0.1152 0.2105 0.2215 0.113 0.0050 0.1176 0.054 0.056 0.035 0.0387 0.016 0.0325 Intense words 0.0221 0.0653 0.0453 0.1255 0.625 0.246 0.456 0.1225 0.0456 0.234 0.754 0.24676 Positive exclamation 0.0034 0.0021 0.0023 0.0021 0.0011 0.0054 0.0032 0.0052 0.0062 0.0032 0.0076 0.02247 Negative exclamation 0.00565 0.0043 0.0032 0.0034 0.0015 0.0171 0.0132 0.0493 0.0294 0.0034 0.1433 0.0266

Table 9 Comparative result of accuracy conducted with paired t test for which the proposed approach is paired with the existing methods

S. no. Dataset Method t Two tailed p value Significance

1 Kaggle dataset Conv-char-S 134.54 6.96E-72 Extremely significantConv-Emb 321.76 2.74E-67 Extremely significantConv-char 234.95 6.78E-54 Extremely significantCharSCNN 455.68 2.10E-57 Extremely significant

2 Twitter dataset Conv-char-S 249.4 2.74E-132 Extremely significantConv-Emb 674.5 1.05E-124 Extremely significantConv-char 489.5 1.43E-154 Extremely significantCharSCNN 476.6 2.67E-119 Extremely significant

3 Multilingual twitter Conv-char-S 765.4 2.01E-111 Extremely significantConv-Emb 456.76 1.20E-106 Extremely significantConv-char 378.4 1.35E-76 Extremely significantCharSCNN 354.5 2.76E-143 Extremely significant

4 Prettenhofer and stein Conv-char-S 678.5 3.20E-63 Extremely significantConv-Emb 484.97 1.15E-67 Extremely significantConv-char 298.94 2.13E-69 Extremely significantCharSCNN 456.93 2.56E-87 Extremely significant

5 MSRP Conv-char-S 786.81 3.06E-50 Extremely significantConv-Emb 396.32 1.35E-62 Extremely significantConv-char 685.43 2.54E-47 Extremely significantCharSCNN 843.86 2.11E-117 Extremely significant

6 SemEval Conv-char-S 870.01 1.41E-121 Extremely significantConv-Emb 567.76 2.10E-124 Extremely significantConv-char 743.65 7.32E-142 Extremely significantCharSCNN 943.76 2.31E-111 Extremely significant

Page 13: Character level embedding with deep convolutional neural ...static.tongtianta.site/paper_pdf/09c75e0e-ecdb-11e9-a69a...short text, which is expressed in the structured, semi-struc-tured

Social Network Analysis and Mining (2019) 9:12

1 3

Page 13 of 14 12

this, t-test is performed with the null hypothesis of no significance variation with the parameter range of 30 run between the proposed and the existing approaches. From Table 9, it is observed that there is a significant differences for all the parameters, in which all the null hypothesis is eliminated that shows a better accuracy result for the proposed approach.

5 Conclusion

In this work, normalization and deep learning based sen-timent classification are implemented for the automatic extraction of sentiments from the tweets expressed in the noisy form (unstructured and semi-structured form). This provides considerable attention over researchers for the task of determining the exact mindset of the user over their products. Initially, the informally written texts are normal-ized using soundex phonetic algorithm with an inverse edit term frequency method to retrieve the hidden information. The normalized data is further processed with deep learn-ing to analyze the sentiment of the sentence. Deep learn-ing performs better than SVM and traditional architecture with more hidden layer to extract the language label. This method treats the character nor the word-based model with fewer parameters to train the CNN architecture. Hence, the proposed approach is capable of dealing with distinct lan-guages with the advantage of low translation level strategy and small memory storage instead of handling the large vocabulary. The experimental results show that the model can be extended to different languages with good accuracy and F-measure values.

References

Agarwal B, Ramampiaro H, Langseth H, Ruocco M (2018) A deep network model for paraphrase detection in short text messages. Inf Process Manag 54(6):922–937

Allahyari M, Pouriyeh S, Assefi M, Safaei S, Trippe ED, Gutierrez JB, Kochut K (2017) A brief survey of text mining: classification, clustering and extraction techniques. arXiv:1707.02919 (arXiv preprint)

Asghar MZ, Kundi FM, Ahmad S, Khan A, Khan F (2018) T-SAF: Twitter sentiment analysis framework using a hybrid classification scheme. Expert Syst 35(1):e12233

Baecchi C, Uricchio T, Bertini M, Del Bimbo A (2016) A multimodal feature learning approach for sentiment analysis of social network multimedia. Multimed Tools Appl 75(5):2507–2525

Baziotis C, Pelekis N, Doulkeridis C (2017) Datastories at seme-val-2017 task 4: deep lstm with attention for message-level and topic-based sentiment analysis. In: Proceedings of the 11th international workshop on semantic evaluation (SemEval-2017), pp. 747–754

Chen P, Sun Z, Bing L, Yang W (2017) Recurrent attention network on memory for aspect sentiment analysis. In: Proceedings of the

2017 conference on empirical methods in natural language pro-cessing, pp. 452–461

Corrêa EA Jr, Marinho VQ, Santos LB (2017) Nilc-usp at seme-val-2017 task 4: a multi-view ensemble for twitter sentiment analysis. arXiv:1704.02263 (arXiv preprint)

Dashtipour K, Poria S, Hussain A, Cambria E, Hawalah AY, Gelbukh A, Zhou Q (2016) Multilingual sentiment analysis: state of the art and independent comparison of techniques. Cognitive computa-tion 8(4):757–771

Dolan B, Quirk C, Brockett C (2004) Unsupervised construction of large paraphrase corpora: Exploiting massively parallel news sources. In: Proceedings of the 20th international conference on Computational Linguistics, Association for Computational Linguistics, pp 350

dos Santos C, Gatti M (2014) Deep convolutional neural networks for sentiment analysis of short texts. In: Proceedings of COL-ING 2014, the 25th international conference on computational linguistics: technical papers, pp 69–78

Dovdon E, Saias J (2017) ej-sa-2017 at SemEval-2017 Task 4: Experiments for Target oriented Sentiment Analysis in Twitter. INF—Artigos em Livros de Actas/Proceedings, ACL. http://www.aclwe b.org/antho logy/S/S17/S17-2106.pdf

Dragoni M, Federici M, Rexha A (2018) An unsupervised aspect extraction strategy for monitoring real-time reviews stream. Inf Process Manage 56(3):1103–1118

Go A, Bhayani R, Huang L (2009) Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford, 1(12)

Hailong Z, Wenyan G, Bo J (2014) Machine learning and lexicon based methods for sentiment classification: a survey. In: Web informa-tion system and application conference (WISA), 2014 11th, IEEE, pp 262–265

Hanafiah N, Kevin A, Sutanto C, Arifin Y, Hartanto J (2017) Text nor-malization algorithm on Twitter in complaint category. Procedia Comput Sci 116:20–26

He K, Zhang X, Ren S, Sun J (2015) Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In: Proceedings of the IEEE international conference on computer vision, pp 1026–1034

Hwang K, Sung W (2017) Character-level language modeling with hierarchical recurrent neural networks. In: IEEE international con-ference on Acoustics, speech and signal processing (ICASSP), 2017, pp 5720–5724

Jabreel M, Moreno A (2017) SiTAKA at SemEval-2017 task 4: senti-ment analysis in twitter based on a rich set of features. In: Pro-ceedings of the 11th international workshop on semantic evalua-tion (SemEval-2017), pp 694–699

Jianqiang Z, Xiaolin G (2017) Comparison research on text pre-processing methods on twitter sentiment analysis. IEEE Access 5:2870–2879

Kim Y (2014) Convolutional neural networks for sentence classifica-tion. arXiv:1408.5882 (arXiv preprint)

Ma Y, Peng H, Khan T, Cambria E, Hussain A (2018) Sentic LSTM: a hybrid network for targeted aspect-based sentiment analysis. Cogn Comput 10(4):639–650

Martínez-Cámara E, Martín-Valdivia MT, Urena-López LA, Montejo-Ráez AR (2014) Sentiment analysis in Twitter. Nat Lang Eng 20(1):1–28

Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Dis-tributed representations of words and phrases and their compo-sitionality. In: NIPS’13 Proceedings of the 26th International Conference on Neural Information Processing Systems vol 2, pp 3111–3119

Mozetič I, Grčar M, Smailović J (2016) Multilingual Twitter senti-ment classification: The role of human annotators. PloS One 11(5):e0155036

Page 14: Character level embedding with deep convolutional neural ...static.tongtianta.site/paper_pdf/09c75e0e-ecdb-11e9-a69a...short text, which is expressed in the structured, semi-struc-tured

Social Network Analysis and Mining (2019) 9:12

1 3

12 Page 14 of 14

Nicolai G, Kondrak G (2016) Leveraging inflection tables for stemming and lemmatization. In: Proceedings of the 54th annual meeting of the association for computational linguistics (volume 1: long papers), pp 1138–1147

Onyibe C, Habash N (2017) OMAM at SemEval-2017 Task 4: english sentiment analysis with conditional random fields. In: Proceed-ings of the 11th international workshop on semantic evaluation (SemEval-2017), pp 670–674

Ouyang X, Zhou P, Li CH, Liu L (2015) Sentiment analysis using convolutional neural network. In: IEEE International conference on computer and information technology; ubiquitous computing and communications; dependable, autonomic and secure com-puting; pervasive intelligence and computing (CIT/IUCC/DASC/PICOM), 2015, pp 2359–2364

Pontiki M, Galanis D, Papageorgiou H, Manandhar S, Androutsopoulos I (2015) Semeval-2015 task 12: aspect based sentiment analysis. In: Proceedings of the 9th international workshop on semantic evaluation, pp 486–495

Prettenhofer P, Stein B (2010) Cross-language text classification using structural correspondence learning. In: Proceedings of the 48th annual meeting of the association for computational linguistics, Association for Computational Linguistics, pp 1118–1127

Roccetti M, Prandi C, Salomoni P, Marfia G (2016) Unleashing the true potential of social networks: confirming infliximab medi-cal trials through facebook posts. Netw Model Anal Health Inf Bioinform 5(1):15

Roccetti M, Salomoni P, Prandi C, Marfia G, Mirri S (2017) On the interpretation of the effects of the Infliximab treatment on Crohn’s disease patients from Facebook posts: a human vs. machine com-parison. Netw Model Anal Health Inf Bioinform 6(1):11

Rosenthal S, Farra N, Nakov P. SemEval-2017 task 4: sentiment analy-sis in Twitter. In: Proceedings of the 11th international workshop on semantic evaluation (SemEval-2017) 2017, pp 502–518

Ruangkanokmas P, Achalakul T, Akkarajitsakul K (2016) Deep belief networks with feature selection for sentiment classification. In: 7th International conference on intelligent systems, modelling and simulation (ISMS), 2016, pp 9–14

Saif H, Fernández M, He Y, Alani H (2014) On stopwords, filtering and data sparsity for sentiment analysis of Twitter. In: LREC 2014, Ninth International Conference on Language Resources and Evaluation. Proceedings, pp 810–817.

Silva C, Ribeiro B (2003) The importance of stop word removal on recall values in text categorization. In: Proceedings of the inter-national joint conference on neural networks, vol 3, pp 1661–1666

Singh T, Kumari M (2016) Role of text pre-processing in twitter senti-ment analysis. Procedia Comp Sci 89:549–554

Vateekul P, Koomsubha T (2016) A study of sentiment analysis using deep learning techniques on Thai Twitter data. In: 13th interna-tional joint conference on computer science and software engi-neering (JCSSE), 2016, pp 1–6

Vechtomova O (2017) Disambiguating context-dependent polarity of words: An information retrieval approach. Inf Process Manag 53(5):1062–1079

Vinodhini G, Chandrasekaran RM (2012) Sentiment analysis and opin-ion mining: a survey. Int J 2(6):282–292

Wehrmann J, Becker W, Cagnini HE, Barros RC (2017) A character-based convolutional neural network for language-agnostic Twitter sentiment analysis. In: International joint conference on neural networks (IJCNN), 2017, pp. 2384–2391

Yang Z, Hu Z, Salakhutdinov R, Berg-Kirkpatrick T (2017) Improved variational autoencoders for text modeling using dilated convo-lutions. In: Proceedings of the 34th international conference on machine learning, vol 70, pp 3881–3890

Yuvaraj N, Sabari A (2017) Twitter sentiment classification using binary shuffled frog algorithm. Intell Autom Soft Comput 23(2):373–381

Zare M, Rohatgi S (2017) DeepNorm—a deep learning approach to text normalization. arXiv:1712.06994 (arXiv preprint)

Zhang X, LeCun Y (2015) Text understanding from scratch. arXiv:1502.01710 (arXiv preprint)

Zhang J, Zong C (2015a). Neural networks in machine translation: an overview. In: IEEE Intell Syst, pp 17241734

Zhang J, Zong C (2015b) Deep neural networks in machine translation: An overview. IEEE Intell Syst 30(5):16–25

Zhang X, Zhao J, LeCun Y (2015) Character-level convolutional net-works for text classification. In Advances in neural information processing systems, pp. 649–657

Zhou G, Zeng Z, Huang JX, He T (2016) Transfer learning for cross-lingual sentiment classification with weakly shared deep neural networks. In: Proceedings of the 39th international ACM SIGIR conference on research and development in information retrieval, ACM, pp 245–254

Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.