27
INFORMATION EXTRACTION ENGINE FOR SENTIMENT-TOPIC MATCHING IN PRODUCT INTELLIGENCE APPLICATIONS CORNELIA FERNER | INTERNATIONAL DATA SCIENCE CONFERENCE | SALZBURG 2017 WERNER POMWENGER | MARTIN SCHNÖLL | VERONIKA HAAF | ARNOLD KELLER | STEFAN WEGENKITTL

INFORMATION EXTRACTIONENGINE FORSENTIMENT-TOPIC …...WEBPAGE PIPELINE Content extraction: Boilerplateremoval(comments, ads, teasersetc.) Rawtextextraction(withouthtmltags) Store metadata

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: INFORMATION EXTRACTIONENGINE FORSENTIMENT-TOPIC …...WEBPAGE PIPELINE Content extraction: Boilerplateremoval(comments, ads, teasersetc.) Rawtextextraction(withouthtmltags) Store metadata

INFORMATION EXTRACTION ENGINE FOR SENTIMENT-TOPIC MATCHINGIN PRODUCT INTELLIGENCE APPLICATIONS

CORNELIA FERNER | INTERNATIONAL DATA SCIENCE CONFERENCE | SALZBURG 2017

WERNER POMWENGER | MARTIN SCHNÖLL | VERONIKA HAAF | ARNOLD KELLER | STEFAN WEGENKITTL

Page 2: INFORMATION EXTRACTIONENGINE FORSENTIMENT-TOPIC …...WEBPAGE PIPELINE Content extraction: Boilerplateremoval(comments, ads, teasersetc.) Rawtextextraction(withouthtmltags) Store metadata

MOTIVATION

http://www.pcworld.com/article/2138145/ultrabooks/lenovo-thinkpad-x1-carbon-review-slightly-overdone-but-plenty-tasty.html

Page 3: INFORMATION EXTRACTIONENGINE FORSENTIMENT-TOPIC …...WEBPAGE PIPELINE Content extraction: Boilerplateremoval(comments, ads, teasersetc.) Rawtextextraction(withouthtmltags) Store metadata

MOTIVATION

Page 4: INFORMATION EXTRACTIONENGINE FORSENTIMENT-TOPIC …...WEBPAGE PIPELINE Content extraction: Boilerplateremoval(comments, ads, teasersetc.) Rawtextextraction(withouthtmltags) Store metadata

AGENDA

� ARIE – article and review information extraction engine

� Topic classification

� Sentiment analysis

� Recap

Page 5: INFORMATION EXTRACTIONENGINE FORSENTIMENT-TOPIC …...WEBPAGE PIPELINE Content extraction: Boilerplateremoval(comments, ads, teasersetc.) Rawtextextraction(withouthtmltags) Store metadata

ARIE

Page 6: INFORMATION EXTRACTIONENGINE FORSENTIMENT-TOPIC …...WEBPAGE PIPELINE Content extraction: Boilerplateremoval(comments, ads, teasersetc.) Rawtextextraction(withouthtmltags) Store metadata

WEBPAGE PIPELINE

� Content extraction:

� Boilerplate removal (comments, ads, teasers etc.)

� Raw text extraction (without html tags)

� Store meta data

� Review validation:

� Only expert reviews are needed

� Sort out ads, comparisons etc.

� Latent Dirichlet Allocation + SVM

� Product type recognition:

� Only laptops (e.g. ultrabooks, convertibles) are needed

� Sort out reviews on displays, speakers etc.

� Maximum Entropy (logistic regression for multiclassproblems)

Content Extraction

Review Validation

Product Type Recognition

WebpagePipeline

Page 7: INFORMATION EXTRACTIONENGINE FORSENTIMENT-TOPIC …...WEBPAGE PIPELINE Content extraction: Boilerplateremoval(comments, ads, teasersetc.) Rawtextextraction(withouthtmltags) Store metadata

ARTICLE PIPELINE

Tokenization

Topic Classification

Sentiment Analysis

ArticlePipeline

Page 8: INFORMATION EXTRACTIONENGINE FORSENTIMENT-TOPIC …...WEBPAGE PIPELINE Content extraction: Boilerplateremoval(comments, ads, teasersetc.) Rawtextextraction(withouthtmltags) Store metadata

ARTICLE PIPELINE

� Tokenization:

� Words and sentences

� Sentence-level annotations for topic classification andsentiment analysis

� Prepocessing:

� Lowercasing

� No stopwords, no stemming, no removal ofinfrequent/frequent words

� Features:

� Bag-of-words (also with bigrams)

� Word2Vec

Tokenization

Topic Classification

Sentiment Analysis

ArticlePipeline

Page 9: INFORMATION EXTRACTIONENGINE FORSENTIMENT-TOPIC …...WEBPAGE PIPELINE Content extraction: Boilerplateremoval(comments, ads, teasersetc.) Rawtextextraction(withouthtmltags) Store metadata

ARTICLE PIPELINE

� � = 1, … , � … dictionary; set of all words

� � = 1, … , � … set of topics

� ∈ � and � ∈ �… topics and sequence of words, respectively

� MaxEnt (logistic regression, softmax):

� � = (��, … , ��) … count vector (absolute word countsin a sequence) with �� = ∑ �(� = �)� ��

� � = �│� = �∑ ��⋅!�"#$"�∈%& = ∏ �!()"#$"

∑ �!()*#$*+*,-� ��

= . � = �│� �

��

Tokenization

Topic Classification

Sentiment Analysis

ArticlePipeline

Page 10: INFORMATION EXTRACTIONENGINE FORSENTIMENT-TOPIC …...WEBPAGE PIPELINE Content extraction: Boilerplateremoval(comments, ads, teasersetc.) Rawtextextraction(withouthtmltags) Store metadata

ARTICLE PIPELINE

� Hidden Markov Model (HMM):

� decoding: find the most probable sequence of hiddenstates (topics) given the model and a sequence ofobservations (words)?

� / = 0�1 = � 2� = �│ = � … transition probabilities

� 3 = 415 = � � = 6│ = � … emission probabilities

� 7� = � � = 1 … initial state probabilities

� 8 = (�, �, /, 3, 7)

� Combining MaxEnt and HMM (Bayes):

� 415 = � � = 6│ = � = 9 :�1│;�5 ⋅9 ;�5∑ 9 :�1│;�< ⋅9 ;�<=>,-

Tokenization

Topic Classification

Sentiment Analysis

ArticlePipeline

Keyboard Display

key bright

Page 11: INFORMATION EXTRACTIONENGINE FORSENTIMENT-TOPIC …...WEBPAGE PIPELINE Content extraction: Boilerplateremoval(comments, ads, teasersetc.) Rawtextextraction(withouthtmltags) Store metadata

ARTICLE PIPELINE

� Combining MaxEnt and HMM (Bayes):

� 415 = 9 :�1│;�5 ⋅9 ;�5∑ 9 :�1│;�< ⋅9 ;�<=>,-

Tokenization

Topic Classification

Sentiment Analysis

ArticlePipeline

Page 12: INFORMATION EXTRACTIONENGINE FORSENTIMENT-TOPIC …...WEBPAGE PIPELINE Content extraction: Boilerplateremoval(comments, ads, teasersetc.) Rawtextextraction(withouthtmltags) Store metadata

ARTICLE PIPELINE

� Combining MaxEnt and HMM (Bayes):

� 415 = 9 :�1│;�5 ⋅?@A∑ 9 :�1│;�< ⋅9 ;�<=>,-

B̂5 ≅ � � = 6Tokenization

Topic Classification

Sentiment Analysis

ArticlePipeline

Page 13: INFORMATION EXTRACTIONENGINE FORSENTIMENT-TOPIC …...WEBPAGE PIPELINE Content extraction: Boilerplateremoval(comments, ads, teasersetc.) Rawtextextraction(withouthtmltags) Store metadata

ARTICLE PIPELINE

� Combining MaxEnt and HMM (Bayes):

� 415 = 9 :�1│;�5 ⋅?@AEF-(1) B̂5 ≅ � � = 6

Tokenization

Topic Classification

Sentiment Analysis

ArticlePipeline

Page 14: INFORMATION EXTRACTIONENGINE FORSENTIMENT-TOPIC …...WEBPAGE PIPELINE Content extraction: Boilerplateremoval(comments, ads, teasersetc.) Rawtextextraction(withouthtmltags) Store metadata

ARTICLE PIPELINE

� Combining MaxEnt and HMM (Bayes):

� 415 = 9 :�1│;�5 ⋅?@AEF-(1) B̂5 ≅ � � = 6

� 415 = �!A"#$"⋅?̂A∑ �!A*#$*+*,-

⋅ G � � = �│� = �!()"#$"

∑ �!()*#$*+*,-

Tokenization

Topic Classification

Sentiment Analysis

ArticlePipeline

Page 15: INFORMATION EXTRACTIONENGINE FORSENTIMENT-TOPIC …...WEBPAGE PIPELINE Content extraction: Boilerplateremoval(comments, ads, teasersetc.) Rawtextextraction(withouthtmltags) Store metadata

ARTICLE PIPELINE

� Combining MaxEnt and HMM (Bayes):

� 415 = 9 :�1│;�5 ⋅?@AEF-(1) B̂5 ≅ � � = 6

� 415 = �!A"#$"⋅?̂A∑ �!A*#$*+*,-

⋅ G � � = �│� = �!()"#$"

∑ �!()*#$*+*,-

� 415 = H � ⋅ G(�) G � = �∑ I(�)=�,-

Tokenization

Topic Classification

Sentiment Analysis

ArticlePipeline

Page 16: INFORMATION EXTRACTIONENGINE FORSENTIMENT-TOPIC …...WEBPAGE PIPELINE Content extraction: Boilerplateremoval(comments, ads, teasersetc.) Rawtextextraction(withouthtmltags) Store metadata

ARTICLE PIPELINE

� Combining MaxEnt and HMM (Bayes):

� 415 = 9 :�1│;�5 ⋅?@AEF-(1) B̂5 ≅ � � = 6

� 415 = �!A"#$"⋅?̂A∑ �!A*#$*+*,-

⋅ G � � = �│� = �!()"#$"

∑ �!()*#$*+*,-

� 415 = H � ⋅ G(�) G � = �∑ I(�)=�,-

Tokenization

Topic Classification

Sentiment Analysis

ArticlePipeline

Page 17: INFORMATION EXTRACTIONENGINE FORSENTIMENT-TOPIC …...WEBPAGE PIPELINE Content extraction: Boilerplateremoval(comments, ads, teasersetc.) Rawtextextraction(withouthtmltags) Store metadata

ARTICLE PIPELINE

� 3152 reviews

� 240220 manually labelled sentences

� 17 predefined topics

Tokenization

Topic Classification

Sentiment Analysis

ArticlePipeline

Page 18: INFORMATION EXTRACTIONENGINE FORSENTIMENT-TOPIC …...WEBPAGE PIPELINE Content extraction: Boilerplateremoval(comments, ads, teasersetc.) Rawtextextraction(withouthtmltags) Store metadata

ARTICLE PIPELINE

� Distance of the topics‘ MaxEnt distributions(bottom left) and the topics‘ word frequencies (top right).

� Hellinger distance:

J �, K K = 12 M � N K

K

Tokenization

Topic Classification

Sentiment Analysis

ArticlePipeline

Page 19: INFORMATION EXTRACTIONENGINE FORSENTIMENT-TOPIC …...WEBPAGE PIPELINE Content extraction: Boilerplateremoval(comments, ads, teasersetc.) Rawtextextraction(withouthtmltags) Store metadata

ARTICLE PIPELINE

Tokenization

Topic Classification

Sentiment Analysis

ArticlePipeline

Page 20: INFORMATION EXTRACTIONENGINE FORSENTIMENT-TOPIC …...WEBPAGE PIPELINE Content extraction: Boilerplateremoval(comments, ads, teasersetc.) Rawtextextraction(withouthtmltags) Store metadata

ARTICLE PIPELINE

Tokenization

Topic Classification

Sentiment Analysis

ArticlePipeline

0,00%

20,00%

40,00%

60,00%

80,00%

100,00%

Classifier Accuracy for Topic Detection

MaxEnt HMM

Page 21: INFORMATION EXTRACTIONENGINE FORSENTIMENT-TOPIC …...WEBPAGE PIPELINE Content extraction: Boilerplateremoval(comments, ads, teasersetc.) Rawtextextraction(withouthtmltags) Store metadata

ARTICLE PIPELINE

Tokenization

Topic Classification

Sentiment Analysis

ArticlePipeline

Page 22: INFORMATION EXTRACTIONENGINE FORSENTIMENT-TOPIC …...WEBPAGE PIPELINE Content extraction: Boilerplateremoval(comments, ads, teasersetc.) Rawtextextraction(withouthtmltags) Store metadata

ARTICLE PIPELINE

� MaxEnt

� baseline

� Recurrent neuralnetwork (RNN)

� Best performance on sentence level

� Recursive neural tensornetwork (RNTN)

� Require parsed syntaxtree

� Good on word level

Tokenization

Topic Classification

Sentiment Analysis

ArticlePipeline

Image credit: Socher et al. https://www.slideshare.net/jiessiecao/parsing-natural-scenes-and-natural-language-with-recursive-neural-networks

Page 23: INFORMATION EXTRACTIONENGINE FORSENTIMENT-TOPIC …...WEBPAGE PIPELINE Content extraction: Boilerplateremoval(comments, ads, teasersetc.) Rawtextextraction(withouthtmltags) Store metadata

ARTICLE PIPELINE

� 21695 manually labelled sentences

� 5-class analysis:

� very positive

� positive

� neutral

� negative

� very negative

� 3-class analysis:

� positive

� neutral

� negative

Tokenization

Topic Classification

Sentiment Analysis

ArticlePipeline

Page 24: INFORMATION EXTRACTIONENGINE FORSENTIMENT-TOPIC …...WEBPAGE PIPELINE Content extraction: Boilerplateremoval(comments, ads, teasersetc.) Rawtextextraction(withouthtmltags) Store metadata

ARTICLE PIPELINE

Tokenization

Topic Classification

Sentiment Analysis

ArticlePipeline

56,71%59,19%

55,62%

66,90% 68,34%65,82%

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

RNTN RNN MaxEnt

Classifier Accuracy for Sentiment Analysis

5-class 3-class

Page 25: INFORMATION EXTRACTIONENGINE FORSENTIMENT-TOPIC …...WEBPAGE PIPELINE Content extraction: Boilerplateremoval(comments, ads, teasersetc.) Rawtextextraction(withouthtmltags) Store metadata

LESSONS LEARNT

� Language is ambiguous.

� Lack of standardized features or off-the-shelf (preprocessing) methods.

� There isn‘t only English.

Page 26: INFORMATION EXTRACTIONENGINE FORSENTIMENT-TOPIC …...WEBPAGE PIPELINE Content extraction: Boilerplateremoval(comments, ads, teasersetc.) Rawtextextraction(withouthtmltags) Store metadata

ONGOING RESEARCH

� Representation learning with CNNs (deep neural network)

� Change fully connected layers to adapt net to new tasks

� Multiple languages: reuse network architecture

� Character-wise approach for reusability of representations

http://www.wildml.com

Page 27: INFORMATION EXTRACTIONENGINE FORSENTIMENT-TOPIC …...WEBPAGE PIPELINE Content extraction: Boilerplateremoval(comments, ads, teasersetc.) Rawtextextraction(withouthtmltags) Store metadata

LITERATURE

[MaxEnt] A. Berger, S. Della Pietra and V. Della Pietra, “A maximum entropy approach to natural language processing,” Computational Linguistics, vol. 22(1), pp. 39-71, 1996.

[HMM] Cappé, O., Moulines, E., and Ryden, T., “Inference in Hidden Markov Models,” New York, Springer, 2005.

[RNN] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9(8), pp.1735-1780, 1997.

[RNTN] Socher, R., Perelygin, A., Wu. J., Chuang, J., Manning, C., Ng. A, and Potts, C., “Recursive deep models for semantic compositionality over a sentiment treebank,” Conference on Empirical Methods in Natural Language Processing, 2013.

[UIMA] D. Ferrucci and A. Lally, “UIMA: An architectural approach to unstructured information processing in the corporate research environment,” Natural Language Engineering, 10(3), pp. 237-348, 2004.