Sentiment in German-language News and Blogs, and the...

Preview:

Citation preview

Sentiment in German-language News and Blogs,

and the DAX

Robert Remus1,2 Khurshid Ahmad2 Gerhard Heyer1

1Fakultat fur Mathematik und InformatikUniversitat Leipzig, Germany

2School of Computer Science and StatisticsTrinity College Dublin, Ireland

Text Mining Services, 2009

1 / 20

Outline

Preamble

A German Case StudyOur Corpus of News and BlogsOur German Dictionary of AffectDAX 30

ResultsStylised VariablesNon-Normal Distribution

Summary

2 / 20

Outline

Preamble

A German Case StudyOur Corpus of News and BlogsOur German Dictionary of AffectDAX 30

ResultsStylised VariablesNon-Normal Distribution

Summary

2 / 20

Outline

Preamble

A German Case StudyOur Corpus of News and BlogsOur German Dictionary of AffectDAX 30

ResultsStylised VariablesNon-Normal Distribution

Summary

2 / 20

Outline

Preamble

A German Case StudyOur Corpus of News and BlogsOur German Dictionary of AffectDAX 30

ResultsStylised VariablesNon-Normal Distribution

Summary

2 / 20

Preamble: Assumptions

1. Our approach is data-driven and relies on the assumption, thatsentiment as a human quality is expressed in text and can beidentified by a machine using a frequency analysis of words as anapproximation

2. Moreover we assume that there is a possible relation betweenpublications on economics and finance and movements in financialmarkets

3 / 20

Preamble: Assumptions

1. Our approach is data-driven and relies on the assumption, thatsentiment as a human quality is expressed in text and can beidentified by a machine using a frequency analysis of words as anapproximation

2. Moreover we assume that there is a possible relation betweenpublications on economics and finance and movements in financialmarkets

3 / 20

Outline

Preamble

A German Case StudyOur Corpus of News and BlogsOur German Dictionary of AffectDAX 30

ResultsStylised VariablesNon-Normal Distribution

Summary

4 / 20

Our Corpus of News and Blogs

• The corpus is diachronically organised

• The news articles and blog posts were published and postedrespectively between 2006–2008

Corpus Items Word types Word tokens

News 8,812 3,911,104 137,343Blogs 1,719 431,722 33,325

5 / 20

Our Corpus of News and Blogs

• The corpus is diachronically organised

• The news articles and blog posts were published and postedrespectively between 2006–2008

Corpus Items Word types Word tokens

News 8,812 3,911,104 137,343Blogs 1,719 431,722 33,325

5 / 20

Outline

Preamble

A German Case StudyOur Corpus of News and BlogsOur German Dictionary of AffectDAX 30

ResultsStylised VariablesNon-Normal Distribution

Summary

6 / 20

Our German Dictionary of Affect I

• Our study mainly uses the terms categorized as Pos or Neg inHarvard University’s General Inquirer lexicon

• These terms were translated into German by a mixture of humanand machine translation, manually revised and extended by addinginflections afterwards, resulting in a German dictionary of thefollowing size:

Polarity Words

Positive 9,301Negative 10,697

7 / 20

Our German Dictionary of Affect I

• Our study mainly uses the terms categorized as Pos or Neg inHarvard University’s General Inquirer lexicon

• These terms were translated into German by a mixture of humanand machine translation, manually revised and extended by addinginflections afterwards, resulting in a German dictionary of thefollowing size:

Polarity Words

Positive 9,301Negative 10,697

7 / 20

Our German Dictionary of Affect II

• The frequency of occurrence of negative and positive terms followsa Zipf-like distribution

• Viewed anually their overall contribution to the news corpusremains constant at around 4% for positive terms and between2–3% for negative terms

8 / 20

Our German Dictionary of Affect II

• The frequency of occurrence of negative and positive terms followsa Zipf-like distribution

• Viewed anually their overall contribution to the news corpusremains constant at around 4% for positive terms and between2–3% for negative terms

8 / 20

Our German Dictionary of Affect III

The 10 most frequent “positive” terms (News corpus, 2006–2008)

Word GI equivalent frelviele plenty 0.08%viel plenty 0.08%gut good 0.07%grossen great 0.05%macht to create 0.05%grosse great 0.05%geben to give 0.04%angebot offer 0.04%teil deal 0.04%erhalten to obtain 0.03%

0.53%

9 / 20

Our German Dictionary of Affect IV

The 10 most frequent “negative” terms (News corpus, 2006–2008)

Word GI equivalent frelgegen against 0.15%ende — 0.09%fall fall 0.05%streik strike 0.04%krise crisis 0.04%kosten cost 0.04%finanzkrise — 0.04%knapp short 0.03%streiks strike 0.03%trotz defiance 0.03%

0.54%

10 / 20

Outline

Preamble

A German Case StudyOur Corpus of News and BlogsOur German Dictionary of AffectDAX 30

ResultsStylised VariablesNon-Normal Distribution

Summary

11 / 20

DAX 30

• Our study uses the DAX 301, that comprises the 30 largest andmost actively traded German companies, which are listed in theFrankfurt Stock Exchange

1Deutscher Aktien IndeX 30

12 / 20

Outline

Preamble

A German Case StudyOur Corpus of News and BlogsOur German Dictionary of AffectDAX 30

ResultsStylised VariablesNon-Normal Distribution

Summary

13 / 20

Results: Stylised Variables I

Time series Min Max 104×Mean

DAX 30 -0.09 0.11 -2.03News Positive -1.82 1.66 1.65

News Negative -1.54 1.69 8.29Blogs Positive -2.2 2.08 5.34

Blogs Negative -2.18 2.48 16.18

14 / 20

Results: Stylised Variables II

Time series 102×Std Dev Skewness Kurtosis

DAX 30 1.61 0.15 11.14News Positive 28.21 -0.07 5.45

News Negative 35.44 -0.03 1.56Blogs Positive 45.02 -0.02 1.81

Blogs Negative 68.96 0.04 0.24

15 / 20

Outline

Preamble

A German Case StudyOur Corpus of News and BlogsOur German Dictionary of AffectDAX 30

ResultsStylised VariablesNon-Normal Distribution

Summary

16 / 20

Results: Non-Normal Distribution I

Probability distribution Normal DAXbetween (St. Dev.)

0 to 0.25 19.74% 33.46%0.25 to 0.5 18.55% 22.31%

0.5 to 1 29.98% 27.17%1 to 1.5 18.37% 10.24%1.5 to 2 8.81% 3.02%

2 to 3 4.28% 1.44%3+ 0.27% 2.36%

100% 100%

17 / 20

Results: Non-Normal Distribution II

Probability distribution Normal Newsbetween (St. Dev.) Positive Negative

0 to 0.25 19.74% 24.74% 22.77%0.25 to 0.5 18.55% 23.61% 20.41%

0.5 to 1 29.98% 30.01% 29.82%1 to 1.5 18.37% 11.67% 14.77%1.5 to 2 8.81% 4.70% 7.15%

2 to 3 4.28% 3.57% 4.14%3+ 0.27% 1.69% 0.94%

100% 100% 100%

18 / 20

Results: Non-Normal Distribution III

Probabiliy distribution Normal Blogsbetween Positive Negative

0 to 0.25 19.74% 19.53% 20.12%0.25 to 0.5 18.55% 19.88% 19.41%

0.5 to 1 29.98% 33.14% 31.12%1 to 1.5 18.37% 15.74% 15.74%1.5 to 2 8.81% 7.34% 8.76%

2 to 3 4.28% 3.55% 4.38%3+ 0.27% 0.83% 0.47%

100% 100% 100%

19 / 20

Summary

It has been shown that

• the distributions of returns of affect content in news and blogs arenot normal

• the returns, i.e. the changes, of affect content in German-language◦ news are higher than in the DAX◦ blogs are higher than in German-language news

• the volatility, i.e. the fluctuation of affect content inGerman-language◦ news is much higher than in the DAX◦ blogs is much higher than in German-language news

20 / 20

Summary

It has been shown that

• the distributions of returns of affect content in news and blogs arenot normal

• the returns, i.e. the changes, of affect content in German-language◦ news are higher than in the DAX◦ blogs are higher than in German-language news

• the volatility, i.e. the fluctuation of affect content inGerman-language◦ news is much higher than in the DAX◦ blogs is much higher than in German-language news

20 / 20

Summary

It has been shown that

• the distributions of returns of affect content in news and blogs arenot normal

• the returns, i.e. the changes, of affect content in German-language◦ news are higher than in the DAX◦ blogs are higher than in German-language news

• the volatility, i.e. the fluctuation of affect content inGerman-language◦ news is much higher than in the DAX◦ blogs is much higher than in German-language news

20 / 20