58
Mining Social Text Data ZHAO Xin [email protected] Renmin University of China

Mining Social Text Dataweblab.com.cityu.edu.hk/workshops/cityu-css-2016/... · Overview •Introduction to social text data ... Deep learning for text mining •Deep neural models

  • Upload
    others

  • View
    12

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Mining Social Text Dataweblab.com.cityu.edu.hk/workshops/cityu-css-2016/... · Overview •Introduction to social text data ... Deep learning for text mining •Deep neural models

Mining Social Text Data

ZHAO Xin [email protected]

Renmin University of China

Page 2: Mining Social Text Dataweblab.com.cityu.edu.hk/workshops/cityu-css-2016/... · Overview •Introduction to social text data ... Deep learning for text mining •Deep neural models

Overview

• Introduction to social text data

• Basic techniques to manipulate text data

• Typical text mining applications for social science

• Hot trend and future direction

• Conclusion

Page 3: Mining Social Text Dataweblab.com.cityu.edu.hk/workshops/cityu-css-2016/... · Overview •Introduction to social text data ... Deep learning for text mining •Deep neural models

Overview

• Introduction to social text data

• Basic techniques to manipulate text data

• Typical text mining applications for social science

• Hot trend and future direction

• Conclusion

Page 4: Mining Social Text Dataweblab.com.cityu.edu.hk/workshops/cityu-css-2016/... · Overview •Introduction to social text data ... Deep learning for text mining •Deep neural models

Social Text Data

• Various kinds of social media

– Text data is one of the most important data types

• Tweets

• Blogs

• Comments

• Profiles

• Messages

• ……

Page 5: Mining Social Text Dataweblab.com.cityu.edu.hk/workshops/cityu-css-2016/... · Overview •Introduction to social text data ... Deep learning for text mining •Deep neural models

Characteristics of social text data

• User-oriented

– Social text data is created by users themselves

• A social networking account corresponds to an offline user in real life

• Users can generate text contents as they like

• It provides a valuable data resource to understand social users

Page 6: Mining Social Text Dataweblab.com.cityu.edu.hk/workshops/cityu-css-2016/... · Overview •Introduction to social text data ... Deep learning for text mining •Deep neural models

Characteristics of social text data

• Opinionated text

– Not only including facts

Page 7: Mining Social Text Dataweblab.com.cityu.edu.hk/workshops/cityu-css-2016/... · Overview •Introduction to social text data ... Deep learning for text mining •Deep neural models

Characteristics of social text data

• Informal and noisy text

– New terms were created by users

– A traditional vocabulary cannot work

Page 8: Mining Social Text Dataweblab.com.cityu.edu.hk/workshops/cityu-css-2016/... · Overview •Introduction to social text data ... Deep learning for text mining •Deep neural models

Characteristics of social text data

• Rich-link text

– On Twitter, we can simply connect ourselves with any celebrity account by using two typical ways:

• mention (@) and retweet (RT)

Page 9: Mining Social Text Dataweblab.com.cityu.edu.hk/workshops/cityu-css-2016/... · Overview •Introduction to social text data ... Deep learning for text mining •Deep neural models

Characteristics of social text data

• Rich-type text

– Multiple types of information are often embedded in the text

• E.g., – Hashtag

– Check-ins (Location and time)

– Phone type

Page 10: Mining Social Text Dataweblab.com.cityu.edu.hk/workshops/cityu-css-2016/... · Overview •Introduction to social text data ... Deep learning for text mining •Deep neural models

Overview

• Introduction to social text data

• Basic techniques to manipulate text data

• Typical text mining applications for social science

• Hot trend and future direction

• Conclusion

Page 11: Mining Social Text Dataweblab.com.cityu.edu.hk/workshops/cityu-css-2016/... · Overview •Introduction to social text data ... Deep learning for text mining •Deep neural models

Fundamental natural language processing techniques

• Turn text into semantic units

• Provide semantic units with meaningful annotations

Page 12: Mining Social Text Dataweblab.com.cityu.edu.hk/workshops/cityu-css-2016/... · Overview •Introduction to social text data ... Deep learning for text mining •Deep neural models

Fundamental natural language processing techniques

• Turn text into semantic units

• Provide semantic units with meaningful annotations

Page 13: Mining Social Text Dataweblab.com.cityu.edu.hk/workshops/cityu-css-2016/... · Overview •Introduction to social text data ... Deep learning for text mining •Deep neural models

Tokenization

• For English

– Delimiter-based tokenization method

– Is it a simple task?

Page 14: Mining Social Text Dataweblab.com.cityu.edu.hk/workshops/cityu-css-2016/... · Overview •Introduction to social text data ... Deep learning for text mining •Deep neural models

Chinese Segmentation

• For Chinese

– No space characters help

• 他明天去香港 ->他--明天--去--香港 – (He is going to HK tomorrow.)

– Nearly a well-done task

• On standard text collections, the accuracy is very high For social text data, it needs more improving

• Challenge: informal writing styles and out-of-vocabulary terms – E.g., new terms created by social users

Page 15: Mining Social Text Dataweblab.com.cityu.edu.hk/workshops/cityu-css-2016/... · Overview •Introduction to social text data ... Deep learning for text mining •Deep neural models

Fundamental natural language processing techniques

• Turn text into semantic units

• Provide semantic units with meaningful annotations

Page 16: Mining Social Text Dataweblab.com.cityu.edu.hk/workshops/cityu-css-2016/... · Overview •Introduction to social text data ... Deep learning for text mining •Deep neural models

Part-of-speech tagging

• Word-category tagging/labeling/classification

Page 17: Mining Social Text Dataweblab.com.cityu.edu.hk/workshops/cityu-css-2016/... · Overview •Introduction to social text data ... Deep learning for text mining •Deep neural models

Part-of-speech tagging (POS tagging)

• Word-category tagging

Page 18: Mining Social Text Dataweblab.com.cityu.edu.hk/workshops/cityu-css-2016/... · Overview •Introduction to social text data ... Deep learning for text mining •Deep neural models

Named Entity Recognition (NER)

• A very important task: find and classify entity mentions in text, for example:

– The decision by the independent MP Andrew Wilkie to withdraw his support for the minority Labor government sounded dramatic but it should not further threaten its stability. When, after the 2010 election, Wilkie, Rob Oakeshott, Tony Windsor and the Greens agreed to support Labor, they gave just two guarantees: confidence and supply.

Page 19: Mining Social Text Dataweblab.com.cityu.edu.hk/workshops/cityu-css-2016/... · Overview •Introduction to social text data ... Deep learning for text mining •Deep neural models

• A very important task: find and classify entity mentions in text, for example:

– The decision by the independent MP Andrew Wilkie to withdraw his support for the minority Labor government sounded dramatic but it should not further threaten its stability. When, after the 2010 election, Wilkie, Rob Oakeshott, Tony Windsor and the Greens agreed to support Labor, they gave just two guarantees: confidence and supply.

Named Entity Recognition (NER)

Page 20: Mining Social Text Dataweblab.com.cityu.edu.hk/workshops/cityu-css-2016/... · Overview •Introduction to social text data ... Deep learning for text mining •Deep neural models

• A very important task: find and classify entity mentions in text, for example:

– The decision by the independent MP Andrew Wilkie to withdraw his support for the minority Labor government sounded dramatic but it should not further threaten its stability. When, after the 2010 election, Wilkie, Rob Oakeshott, Tony Windsor and the Greens agreed to support Labor, they gave just two guarantees: confidence and supply.

Named Entity Recognition (NER)

Person Date Location Organization

Page 21: Mining Social Text Dataweblab.com.cityu.edu.hk/workshops/cityu-css-2016/... · Overview •Introduction to social text data ... Deep learning for text mining •Deep neural models

Entity Linking

• President Obama won the 2009 Nobel Peace Prize

Page 22: Mining Social Text Dataweblab.com.cityu.edu.hk/workshops/cityu-css-2016/... · Overview •Introduction to social text data ... Deep learning for text mining •Deep neural models

Semantic Analysis Model

• Statistical topic models

• Word embedding models

Page 23: Mining Social Text Dataweblab.com.cityu.edu.hk/workshops/cityu-css-2016/... · Overview •Introduction to social text data ... Deep learning for text mining •Deep neural models

Semantic Analysis Model

• Statistical topic models

• Word embedding models

Page 24: Mining Social Text Dataweblab.com.cityu.edu.hk/workshops/cityu-css-2016/... · Overview •Introduction to social text data ... Deep learning for text mining •Deep neural models

Statistical topic models

• Question:

– Given a large text dataset, how do we know the major topics in it?

We need a way to summarize and compress information

Page 25: Mining Social Text Dataweblab.com.cityu.edu.hk/workshops/cityu-css-2016/... · Overview •Introduction to social text data ... Deep learning for text mining •Deep neural models

Statistical topic models

• Output I:

– Topics

– Can be understood as “clusters of words”

Page 26: Mining Social Text Dataweblab.com.cityu.edu.hk/workshops/cityu-css-2016/... · Overview •Introduction to social text data ... Deep learning for text mining •Deep neural models

Statistical topic models

• Output II:

– Per-document topic distribution

– Can be understood as “document representation”

Page 27: Mining Social Text Dataweblab.com.cityu.edu.hk/workshops/cityu-css-2016/... · Overview •Introduction to social text data ... Deep learning for text mining •Deep neural models

Statistical topic models

• Put together

Page 28: Mining Social Text Dataweblab.com.cityu.edu.hk/workshops/cityu-css-2016/... · Overview •Introduction to social text data ... Deep learning for text mining •Deep neural models

Key ideas in topic models

• Capturing word co-occurrence

– If two words frequently co-occur in documents, these two words tend to be captured in top positions of the same topics

– The number of training documents are supposed to be large

• A few hundred documents will not work well

Page 29: Mining Social Text Dataweblab.com.cityu.edu.hk/workshops/cityu-css-2016/... · Overview •Introduction to social text data ... Deep learning for text mining •Deep neural models

Semantic Analysis Model

• Statistical topic models

• Word embedding model

Page 30: Mining Social Text Dataweblab.com.cityu.edu.hk/workshops/cityu-css-2016/... · Overview •Introduction to social text data ... Deep learning for text mining •Deep neural models

Word embedding

• Distributional semantics

Page 31: Mining Social Text Dataweblab.com.cityu.edu.hk/workshops/cityu-css-2016/... · Overview •Introduction to social text data ... Deep learning for text mining •Deep neural models

Word embedding

• Distributional semantics

Page 32: Mining Social Text Dataweblab.com.cityu.edu.hk/workshops/cityu-css-2016/... · Overview •Introduction to social text data ... Deep learning for text mining •Deep neural models

Word2Vec

• Input: a sequence of word tokens from a vocabulary

– Requiring a large amount of documents

• Output: a fixed-length embedding vector for each term in the vocabulary

– The embedding vector usually has several hundred dimensions with dense values

Page 33: Mining Social Text Dataweblab.com.cityu.edu.hk/workshops/cityu-css-2016/... · Overview •Introduction to social text data ... Deep learning for text mining •Deep neural models

Word embedding

• Usage: finding similar words

– E.g., query word=“Sweden”

Page 34: Mining Social Text Dataweblab.com.cityu.edu.hk/workshops/cityu-css-2016/... · Overview •Introduction to social text data ... Deep learning for text mining •Deep neural models

Interesting Observations

• China-Beijing = Japan - Tokyo

Page 35: Mining Social Text Dataweblab.com.cityu.edu.hk/workshops/cityu-css-2016/... · Overview •Introduction to social text data ... Deep learning for text mining •Deep neural models

Overview

• Introduction to social text data

• Basic techniques to manipulate text data

• Typical text mining applications for social science

• Hot trend and future direction

• Conclusion

Page 36: Mining Social Text Dataweblab.com.cityu.edu.hk/workshops/cityu-css-2016/... · Overview •Introduction to social text data ... Deep learning for text mining •Deep neural models

Typical text mining applications for social science

• Data cleaning

• Sentiment analysis

• User profiling

• User interest modeling

Page 37: Mining Social Text Dataweblab.com.cityu.edu.hk/workshops/cityu-css-2016/... · Overview •Introduction to social text data ... Deep learning for text mining •Deep neural models

Typical text mining applications for social science

• Data cleaning

• Sentiment analysis

• User profiling

• User interest modeling

Page 38: Mining Social Text Dataweblab.com.cityu.edu.hk/workshops/cityu-css-2016/... · Overview •Introduction to social text data ... Deep learning for text mining •Deep neural models

Data cleaning

• Cleaning social text data

– Very challenging to be perfect in practice

• E.g., word normalization – SG => Singapore

– HK => Hong Kong

– But fortunately standard techniques usually work moderately well (with slight modifications) in many cases

• E.g., word2vec and topic models can capture the above abbreviation patterns to some extent

Page 39: Mining Social Text Dataweblab.com.cityu.edu.hk/workshops/cityu-css-2016/... · Overview •Introduction to social text data ... Deep learning for text mining •Deep neural models

Data cleaning

• Spam detection

– Machine learning or rule-based approach

– The key is to select effective features based on user behaviors

• Multiple postings in a short period

• Redundant or nearly redundant postings

• Irrelevant postings

Page 40: Mining Social Text Dataweblab.com.cityu.edu.hk/workshops/cityu-css-2016/... · Overview •Introduction to social text data ... Deep learning for text mining •Deep neural models

Typical text mining applications for social science

• Data cleaning

• Sentiment analysis

• User profiling

• User interest modeling

Page 41: Mining Social Text Dataweblab.com.cityu.edu.hk/workshops/cityu-css-2016/... · Overview •Introduction to social text data ... Deep learning for text mining •Deep neural models

Key technical components

• Opinion extraction

– “I bought an iPhone a few days ago. It is such a nice phone. The touch screen is really cool. The voice quality is clear too.”

• What do we see?

– Opinion holders: persons who make the comments

– Opinion targets: entities and their features/aspects – Sentiments: opinions towards opinion targets

Page 42: Mining Social Text Dataweblab.com.cityu.edu.hk/workshops/cityu-css-2016/... · Overview •Introduction to social text data ... Deep learning for text mining •Deep neural models

Key technical components

• Polarity identification

– Simplest form:

• Is the attitude of this text positive or negative?

• A classification-based approach – Feature selection

– Opinion lexicon

– More complex forms:

• Five-star rating prediction

• A regression-based approach

Page 43: Mining Social Text Dataweblab.com.cityu.edu.hk/workshops/cityu-css-2016/... · Overview •Introduction to social text data ... Deep learning for text mining •Deep neural models

For ecommerce: Feature-based opinion summary

43

For each extracted aspect, we can derive the corresponding opinion score.

Page 44: Mining Social Text Dataweblab.com.cityu.edu.hk/workshops/cityu-css-2016/... · Overview •Introduction to social text data ... Deep learning for text mining •Deep neural models

44

Using Twitter sentiment for stock price prediction

Do

w J

on

es • The CALM

mood predicts DJIA 3 days later

CA

LM

Bollen et al. (2011)

Page 45: Mining Social Text Dataweblab.com.cityu.edu.hk/workshops/cityu-css-2016/... · Overview •Introduction to social text data ... Deep learning for text mining •Deep neural models

Twitter sentiment versus Gallup Poll of Consumer Confidence

Brendan O'Connor, Ramnath Balasubramanyan, Bryan R. Routledge, and Noah A. Smith. 2010. From Tweets to Polls: Linking Text Sentiment to Public Opinion Time Series. In ICWSM-2010

Page 46: Mining Social Text Dataweblab.com.cityu.edu.hk/workshops/cityu-css-2016/... · Overview •Introduction to social text data ... Deep learning for text mining •Deep neural models

Typical text mining applications for social science

• Data cleaning and preprocessing

• Sentiment analysis

• User profiling

• User interest modeling

Page 47: Mining Social Text Dataweblab.com.cityu.edu.hk/workshops/cityu-css-2016/... · Overview •Introduction to social text data ... Deep learning for text mining •Deep neural models

User profiling

• Direct extraction of demographics from social profiles

– Users fill in the profiles

• Sex

• Age

• Job

• Interests

• …….

Page 48: Mining Social Text Dataweblab.com.cityu.edu.hk/workshops/cityu-css-2016/... · Overview •Introduction to social text data ... Deep learning for text mining •Deep neural models

User profiling

• Implicit inference from social text data

– E.g., Amazon online reviews

“I bought my wife a new phone” => SEX = MALE

“I worked in an IT company......” => JOB = IT

Page 49: Mining Social Text Dataweblab.com.cityu.edu.hk/workshops/cityu-css-2016/... · Overview •Introduction to social text data ... Deep learning for text mining •Deep neural models

Typical text mining applications for social science

• Data cleaning and preprocessing

• Sentiment analysis

• User profiling

• User interest modeling

Page 50: Mining Social Text Dataweblab.com.cityu.edu.hk/workshops/cityu-css-2016/... · Overview •Introduction to social text data ... Deep learning for text mining •Deep neural models

User interest modeling

• Topical expert ranking

– Given a topic, we need to identify influential experts for that topic

– Two major components

• Interest identification and Influence ranking

• (Individual) Interest identification – Find out what topics is a celebrity interested in

• (Overall) Influence ranking – Decide how influential a celebrity is in a given topic

Page 51: Mining Social Text Dataweblab.com.cityu.edu.hk/workshops/cityu-css-2016/... · Overview •Introduction to social text data ... Deep learning for text mining •Deep neural models

User interest modeling

• Topical expert ranking

Topical word cloud Topical celebrity ranking

Page 52: Mining Social Text Dataweblab.com.cityu.edu.hk/workshops/cityu-css-2016/... · Overview •Introduction to social text data ... Deep learning for text mining •Deep neural models

User interest modeling

• Collective interest or attention modeling

– Event detection

Page 53: Mining Social Text Dataweblab.com.cityu.edu.hk/workshops/cityu-css-2016/... · Overview •Introduction to social text data ... Deep learning for text mining •Deep neural models

Overview

• Introduction to social text data

• Basic techniques to manipulate text data

• Typical text mining applications for social science

• Hot trend and future direction

• Conclusion

Page 54: Mining Social Text Dataweblab.com.cityu.edu.hk/workshops/cityu-css-2016/... · Overview •Introduction to social text data ... Deep learning for text mining •Deep neural models

Deep learning for text mining

• Deep neural models are promising

– But relying on tedious parameter tuning and large training datasets

• Shallow neural models are more practical and easier to use by non-experts

– E.g., word2vec

Page 55: Mining Social Text Dataweblab.com.cityu.edu.hk/workshops/cityu-css-2016/... · Overview •Introduction to social text data ... Deep learning for text mining •Deep neural models

Future directions for text mining

• Designing more effective information extraction models

• Joint utilizing multi-source information

• Developing large-scale and efficient algorithms

Page 56: Mining Social Text Dataweblab.com.cityu.edu.hk/workshops/cityu-css-2016/... · Overview •Introduction to social text data ... Deep learning for text mining •Deep neural models

Toolkit list

• Toolkits

– NLP

• http://nlp.stanford.edu/software/lex-parser.shtml

• http://cogcomp.cs.illinois.edu/page/software/

– Topic models

• https:// gibbslda.sourceforge.net

– Word2vec

• https://code.google.com/p/word2vec

Page 57: Mining Social Text Dataweblab.com.cityu.edu.hk/workshops/cityu-css-2016/... · Overview •Introduction to social text data ... Deep learning for text mining •Deep neural models

Thanks!

• Acknowledgement

– Thanks for Prof. Zhu’s invitation and organization

– Thanks for your attention

Some slides are modified based on the Stanford NLP course slides.

Page 58: Mining Social Text Dataweblab.com.cityu.edu.hk/workshops/cityu-css-2016/... · Overview •Introduction to social text data ... Deep learning for text mining •Deep neural models

References • https://www.coursera.org/course/nlp • https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html • Wei Shen, Jianyong Wang, and Jiawei Han. Entity Linking with a Knowledge Base: Issues,

Techniques, and Solutions. Transactions on Knowledge & Data Engineering 27(2):443--460 (2015) • Mikolov, Tomas; Sutskever, Ilya; Chen, Kai; Corrado, Greg S.; Dean, Jeff (2013). Distributed

representations of words and phrases and their compositionality. NIPS 2013. • D. Blei. Probabilistic topic models. Communications of the ACM, 55(4):77–84, 2012. • Ee-Peng Lim, Viet-An Nguyen, Nitin Jindal, Bing Liu, Hady Wirawan Lauw. Detecting product review

spammers using rating behaviors. CIKM 2010: 939-948 • Yoshua Bengio, Aaron Courville, Pascal Vincent. Representation Learning: A Review and New

Perspectives. http://arxiv.org/abs/1206.5538 • Jinpeng Wang, Xin Zhao and Xiaoming Li. Mining New Business Opportunities: Identifying Trend

related Products by Leveraging Commercial Intents from Microblogs. In EMNLP’13. • Xin Zhao, YanweiGuo, Yulan He, Han Jiang, Yuexin Wu and Xiaoming Li. A Demographic-based

System for Product Recommendation On Microblogs. In KDD 2014. • Jinpeng Wang, Gao Cong, Xin Zhao, Xiaoming Li. Mining User Intents in Twitter: A Semi-Supervised

Approach to Inferring Intent Categories for Tweets. In AAAI 2015. • Xin Zhao, Jinpeng Wang, Yulan He, Ji-Rong Wen, Edward Y. Chang, Xiaoming Li. Mining Product

Adopter Information from Online Reviews for Improving Product Recommendation. TKDD 10(3): 29 (2016)

• Xin Zhao, Yanwei Guo, Rui Yan, Yulan He, Xiaoming Li. Timeline generation with social attention. SIGIR 2013: 1061-1064