36
Luigi Curini - @Curini VOICES from the Blogs & University of Milan #Spoletta @ Mashable Social Media Day 30-Jun-2015 Separating the wheat from the chaff: Signal, Noise & Other stories in the Big (Data) World http://voicesfromtheblogs.com

Distinguere grano e loglio segnali, rumore e altre storie in un big (data) world luigi curini

Embed Size (px)

Citation preview

Luigi Curini - @Curini

VOICES from the Blogs & University of Milan

#Spoletta @ Mashable Social Media Day 30-Jun-2015

Separating the wheat from the

chaff: Signal, Noise & Other

stories in the Big (Data) World

http://voicesfromtheblogs.com

http://voicesfromtheblogs.com

Testo

They are:๏ Big (in volume)๏ Many (per unit of time)๏ Unstructured (messy, not ready to be processed)

Big (or organic) data

Sources:๏ Administrative repositories๏ Transaction data๏ Social media & Social Network

http://voicesfromtheblogs.com

Testo

Can we really ignore this?

http://voicesfromtheblogs.com

Testo

Do they work at all?

The blue side of Big data

http://voicesfromtheblogs.com

Testo

Big data “believers”

Wired Magazine (2008): “The End of Theory: The Data Deluge Makes the Scientific Method Obsolete”.

http://voicesfromtheblogs.com

Testo

Big data “detractors”

Financial Times (2014): “Big Data: Are We Making a Big Mistake?”

?

http://voicesfromtheblogs.com

Testo

Time (2014): “Google’s Flu Project Shows the Failing of Big Data”

Big Data: the Big Fail?

Testo

S&P500, butter production in Bangladesh and causality

http://voicesfromtheblogs.com

http://voicesfromtheblogs.com

Testo

Believer or Detractor?

“BIG Data” are today’s “data”

answer:

“there are Big Data & small data scientists”

change the data scientist, not the data!

a good advise:

http://voicesfromtheblogs.com

Testo

Believer or Detractor?

Data Scientist is the answer!

http://voicesfromtheblogs.com

Let us focuson big data

coming fromSocial Media

http://voicesfromtheblogs.com

geo-localized data

retrospective analysis (capture opinions when they are

expressed)

real-time analysis (continuous monitoring and/or alerting)

speed of data analysis (if you know how to do it)

gathering of unsolicited opinions

census-type analysis: analyze the entire population of

texts not just a sample

population on social media not necessarily

representative of demographic population

can’t ask questions, just listen to people: if people

don’t discuss about a topic you don’t have the data

textual analysis, language evolves continuously

and changes according to topic, media, etc.

pros

cons

http://voicesfromtheblogs.com

Three simple ideas

“Romance should never begin with sentiment. It should begin with science and end with a settlement.”

Oscar Wilde, An Ideal Husband

http://voicesfromtheblogs.com

NO: Mentions, Likes or Retweet. Computers are good at this, but humans can do better!

How to analyze Social Media data (1)

Obama 16.8M of followers

Romney 0.6M of followers

Final result: Obama +4.0% !

http://voicesfromtheblogs.com

NO: ontological dictionaries, nor NLP rules

How to analyze Social Media data (2)

Testo “This movie has good premises. Looks like it has a nice plot, an

exceptional cast, first class actors and Stallone gives his best. But it

sucks”

"Ibis redibis numquam peribis in bello", can be translated as “will go,

will come back, will not die in war", but also the opposite way, “will

go, will not come back, will die in war"

“ragazza stufa scappa di casa… i genitori muoiono di freddo”

“There is no favorable wind for the mariner who doesn’t know where to go” (Seneca)

http://voicesfromtheblogs.com

NO: ontological dictionaries, nor NLP rules

How to analyze Social Media data (2)

Look at the data Look into the data

http://voicesfromtheblogs.com

Switch to Supervised Techiniques!

The advantages of human beings…

• Always in sync with linguistic expressions

[dictionaries are static]

• Completely language-independent

• Moreover….

http://voicesfromtheblogs.com

Beyond sentiment…

there is more information out there!!!

http://voicesfromtheblogs.com

Opinions, reasons, attitudes, tones…

see the colours!

NO: individual classification and later aggregation. Estimate directly the aggregated distribution of opinions!

How to analyze Social Media data (3)

We don’t care about the needle in the haystack...

...we care about the haystack! (G. King)

http://voicesfromtheblogs.com

The iSA® innovation

http://voicesfromtheblogs.com

What people don’t say if asked

but discuss on social media!

http://voicesfromtheblogs.com

about 2.5M arab texts July-October 2014

http://voicesfromtheblogs.com

Charlie Hebdo effect?

http://voicesfromtheblogs.com

Short Memory Effect!Positive support October 2014 7 & 8 January Next week

World 21,2% 18,1% 21,9%

Europe 21,9% 17,5% 20%

France 20,8% 3% 17%

See the large picture: the Moncler case study

www.voicesfromtheblogs.com | we capture the sentiment of the net

Monday Nov. 3rd 2014. The day after the TV Show Report sent on air a negative reportage on the Moncler company. Mentions online (among Twitter, Facebook, Instagram, blog, forum and other social channels) for the brand raised of about 450% compared to the average level.

That peak corresponded to a 22% fall in social brand reputation in just 24 hours (from a positive sentiment of 75% to a negative of 53%, and 43% on Twitter alone).

The assets on the stock exchange felt as well by 5%.

Was this due to the Social Media?

www.voicesfromtheblogs.com | we capture the sentiment of the net

Obviously not, the negative trend was totally predictable and independent of the SM sentiment

See the large picture: the Moncler case study

MIXING OFFICIAL STATISTICS AND SOCIAL

MEDIA DATA:!

NOWCASTING

http://voicesfromtheblogs.com

Wired Next Index

http://voicesfromtheblogs.com

Official Statistics: “cold data”, backward looking, low frequency, slow (GDP, import/export, n. companies, labour force statistics, ecc) !

Surveys: “cold data”, slow, forward looking (e.g. consumer or entrepreneur expectations’, ecc) !

SM Sentiment: “hot data”, nowcasting, forward looking

INDEX.WIRED.IT

http://voicesfromtheblogs.com

http://voicesfromtheblogs.com

self-!expectation!

(VOICES)

consumers!expectation!

(Istat)

entrepreneurs!expectation!

(Istat)

expectation!in the country!

(VOICES)

-15 days

-10 days

-18 days

Period: 1 January - 31 March 2014

Cold!indicators

Hot!indicators

nowcasting !=!

anticipation

http://voicesfromtheblogs.com

Wired Next Index vs. MIB Index!

VOICES from the Blogs born in October 2010 as a scientific

project to capture opinions expressed on the Web (social

media, blogs, forums, web)

On 12/12/12 VOICES became a Spin-off of the University of

Milan – Italy; and started operations as an independent

company

Up to January 2015 VOICES has analyzed more than half

billion of posts written in Italian, English, French, Spanish,

German, Russian, Arabic, Portuguese, Chinese and

Japanese

In December 2014 VOICES is among the winners of the

contest “Produrre Statistica ufficiale con i Big Data”

promoted by &

About us

www.voicesfromtheblogs.com | we look into the data, not at the data

Since March 2015 SWG has become a

partner of VOICES

Thanks to this partnership, the first

integrated group in data science and

business intelligence has born in Italy

About us

www.voicesfromtheblogs.com | we look into the data, not at the data

But remember…

Big Data is likely to contribute so long as the desired qualities of the data ar

not negatively correlated with the quantity of data

In a nutshell…

Method DO MATTER!

http://voicesfromtheblogs.com

Thx !

For more information, analyses and

white papers about the project visit us at

http://voicesfromtheblogs.com

On Twitter: @blogsvoices

http://voicesfromtheblogs.com