42
Data randomness, variation, coincidences, populations and estimation and the use and abuse of statistics www.linkedin.com/in/sureshs ood http://www.slideshare.net/ssood/randomness- 28944785

Randomness

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Randomness

Data randomness, variation, coincidences, populations and estimation and the use and abuse of statistics

www.linkedin.com/in/sureshsood

http://www.slideshare.net/ssood/randomness-28944785

Page 2: Randomness
Page 3: Randomness

http://datafication.com.au/instagram/

Page 4: Randomness

September 9/11 Coincidences

• 911 is the emergency number

• The twin towers looked like the number 11 so perhaps all 9/11 things relate to 11

• 9 + 1 + 1 = 11, the first flight to hit the twin towers was flight 11

• On board flight 11 was 92 people on board, 9 + 2 = 11

• September 11 is the 254th day of the year 2 + 5 + 4 = 11 and 365 – 254 = 111)

• 11 letters each in “New York City”, “Afghanistan”, “the Pentagon”, and “George W. Bush”

• New York was the 11th state admitted to the union

• 119 (1 + 1 + 9 = 11) used to be the area code to both Iraq and Iran

• Flight 77 that crashed in Pennsylvania had 65 people on board, 6 + 5 = 11

• March 11 (2004) attack in Spain. There are exactly 911 days between this and the September 11 (2001) attack.

Page 5: Randomness

The Strange Coincidence of the Girl from Petrovka

http://listverse.com/2007/11/12/top-15-amazing-coincidences/

Page 6: Randomness

Key Research Finding - Serendipity is essential

"An executive from the association addressed my class for volunteers"

"I knew before coming to university that I wanted to do something different, I wanted to take advantage of all opportunities"

"I randomly received an email for the Young Australian Entrepreneurship competition event and decided to enter it"

"You want to be lucky, not right"

"You want to be lucky, not right"

"You make your luck happen”

"I found the idea of my business while doing an assignment"

(Sood and Marchand 2012)

Page 7: Randomness

High Calibre Analytics Graduates

Page 8: Randomness

Data Scientist Job Roles(LinkedIn 16 September 2012)

Notes: Word count shown next to each wordExclusion words: ability area bay com experience francisco job linkedin preferred san

Page 9: Randomness

Statistics as Hypothesis Driven Process

Adap

ted

from

and

sou

rce:

pau

l.ken

nedy

@ut

s.ed

u.au

Page 10: Randomness

Leading Questions: Yes Prime Minister(video:bit.ly/yes_stats)

The risk with data mining is the discovery of meaningless patterns and given enough data and time you can support almost anything

Sir Humphrey Appleby demonstrates use of leading questions to skew an opinion survey to support or oppose National Service (Military Conscription)

Taken from the 1st Season of Yes Prime Minister - Episode 2, The Ministerial Broadcast.

Yes Prime Minister is a British political satire/ comedy aired in the 1980s

Page 11: Randomness

http://www.kdnuggets.com/2013/07/kdnuggets-cartoon-nsa-cat-videos-ufo-reports-pizza-connection.html

Page 12: Randomness
Page 13: Randomness

ATLAS: The observed (full line) and expected (dashed line) 95% CL combined upper limits on the SM Higgs boson production cross

section divided by the Standard Model expectation as a function of mH in the full mass range considered in this analysis (a) and in the low mass range (b). The dashed curves show the median expected

limit in the absence of a signal and the green and yellow bands indicate the corresponding 68% and 95% intervals.

Page 14: Randomness

Particle physics has an accepted definition for a “discovery”: a five-sigma level of certainty

The number of standard deviations, or sigmas, is a measure of how unlikely it is that an experimental result is simply down to chance rather than a real effect

Similarly, tossing a coin and getting a number of heads in a row may just be chance, rather than a sign of a “loaded” coin

The “three sigma” level represents about the same likelihood of tossing more than eight heads in a row

Five sigma, on the other hand, would correspond to tossing more than 20 in a row

One standard deviation from the center would give a probability of 68% of all data (~ 1 in 3)

About 95.5% of the data will be inside two standard deviations (~ 1 in 22)

About 99.7% lie within three standard deviations (~ 1 in 370),

Four standard deviation events occur 1 in 15,787 times

Five standard deviation events occur 1 in every 1,744, 278 times.

So a five sigma effect, which two experiments now have, means that such a thing would be observed by chance with a probability of 1/1,744, 278 = 5.7 x 10-7.

This is so unlikely that this is the criterion for accepting an effect as real in particle physics, when it is corroborated by another experiment as in this case.

5 s Statistics of the Higgs Boson

Page 15: Randomness

24104 Emerging Marketing Issues and Social Media

Assessment item 1: Project (Group)

Objective(s): This addresses Subject Learning Objective/s 1-4 Weighting: 30%

Due: The group report is due by start of lecture in Week 14.

Length: The final deliverable report requires to be of sufficient length to document:

1. The acquisition of the social data and supporting process

2. Visualisation of the network data and key measures

3. Description of models built from social data

4. Conclusion highlighting any useful insights

Page 16: Randomness

24104 Emerging Marketing Issues and Social MediaTask: Groups of students (4-5) participate in a practical project to data mine social media data.

Completion of this task requires the group to provide a report documenting the experience in acquiring and discovering the social data

using visualisation, setting up the data mining environment, describing the findings with regard to the models built from the data and

concluding insights. The approach to mine the data is in 2 stages:

1. Visualise a social network of data freely available to the group e.g. LinkedIn, Twitter, YouTube, Facebook, email,Flickr.

Identify and describe key network measures

2. Mine the data to build models from the social data

This project uses the sophisticated REVOLUTION R ENTERPRISE software as a platform for data mining. The software is free for academic

use. The Rattle (R Analytical Tool To Learn Easily) package provides a graphical user interface specifically for data mining using R and

overcomes the need to use heavy programming.

The following resources help to bootstrap the project and amuse the group project members:

Kaggle – Kaggle.com/competitions

AnalyticsBridge A social network for analytics professionals - analyticbridge.com

The R Inferno “If you are using R and you think you’re in hell, this is a map for you”

http://www.burns-stat.com/documents/books/the-r-inferno/

Furnas, Alexander ( 2012) Everything You Wanted to Know About Data Mining but Were Afraid to Ask, the Atlantic, 3 April

http://www.theatlantic.com/technology/archive/2012/04/everything-you-wanted-to-know-about-data-mining-but-were-afraid-to-ask/255388/

Page 17: Randomness

How ?Train of Thought Analysis

• A bottom-up approach • Perceptual process of discovery to uncover structure• Distinguish patterns,structure, relationships and anomalies• Reveals indirect links • Knowledge is colour coded• Marketing Analyst can spot irregularities• Not sure why but where does this lead• Harnesses the power of the human mind

Data Information Knowledge

Page 18: Randomness

How to Find a Killer using Visualisation

• 1990’s Ivan Milat killed 7 backpackers making him Australia's most notorious Serial Killer

• Everyone in Australia was a suspect

• Enormous volumes of data from multiple sources

RTA Vehicle records Gym Memberships Gun Licensing records Internal Police records

• • Police applied visualisation techniques (NetMap) to the data

• Reduced the suspect list from 18 million to 230

• Further analysis with the use of additional information reduced this to 32

Page 19: Randomness

Key Network Measures

• Degree Centrality• Betweenness Centrality• Closeness Centrality• Eigenvector Centrality

krackkite.##h (modified labels)

Connector(hub)

Diana’sClique

Broker

Boundary spanners

Contractor ? Vendor

Page 20: Randomness

NodeXL - Excel 2007/10/13 workbook template for viewing and analyzing network graphs

http://nodexl.codeplex.com/releases/view/108288

Page 21: Randomness

Import ego, Fan page and groups networks from Facebook using Social Network Importer for NodeXL

http://socialnetimporter.codeplex.com/

Page 22: Randomness

Aquarius,Aries,Cancer,Capricorn,Gemini,Leo,Libra, Pisces, Sagittarius,Scorpio,Taurus,Virgo

Ambivalent, Employee, Opposer, Reporter, Supporter 11. Committed Partnerships, 12. Compartmentalised Friendship,13. Childhood friendship,14. Courtship,15. Fling, 16. Secret-Affair, 17. Enslavement , 2. Marriages of Convenience,3. Best Friendships,4. Kinships, 5. Rebounds/ Avoidance-Driven,6. Courtships,7.Dependencies 8. Enmities, 9. Love-Hate (Sweeney and Chew)

Africa,Argentina,Australia,Australia/Hong Kong, Austria, California, Canada, China, Egypt, England, Finland, France Germany, Guernsey, Holland, India, Indonesia, Ireland , Israel, Italy , Japan, Kuwait, Malaysia, Nepal,Paraguay , Philippines, Phillipines, Portugual, Saudi Arabia, Singapore South Africa, Spain, Sweden, Taiwan, Thailand,UK ,USA

A&F,Beijing ,Gucci,LVMH,New York,Old Navy, ,Paris, Sydney, Tiffany, Tokyo, Tommy, Versace

An-Verb,An-Vis,Hol-Verb,Hol-Vis

Depriv/Enhance,Enhance/Depriv

Page 23: Randomness

23

Page 24: Randomness

Model Comparison By Variables/Predictors

Page 25: Randomness

1.Gayle

3. Paris

2. Paige

+

+

4.”The occasion was my cousin Paige’s 16th”

5. “I am a Canadian and get by in French.”

6. "All I can say is WOW! We rented a 2 bedroom, 1 ½ bath apartment (two showers), "Merlot" from ParisPerfect http://www.parisperfect.com/ and boy was it ever perfect! "

7. “We had a full view of the Eiffel from our charming little terrace. ....We were within walking distance to two metro stops (Pont d'Alma or Ecole Militaire) "

8. "We were walkable to many good bistros, cafes and bakeries and only a few blocks from the wonderful market street Rue Cler."

9. "I bought a Paris Pratique pocket-sized book at a Metro station. This handy guide has detailed maps of each arrondisement, as well as the metro lines, the bus lines, the RER and the SCNF (trains). I'll never be without this again."

10."Six months before our trip, I gave Paige a couple of good guide books on Paris and suggested she let me know what her interests were since after all, this was to be her trip."

11.Sites•The Marais•Notre Dame•L'Arc de Triomphe - 248 steps up and 248 steps down...•Champs Elysee•Jacquemart Museum•Louvre Lite•Musee D'Orsay•Les Invalides, Napoleon's Tomb and the Napoleon Museum•Sacre Coeur•Monmartre•Rodin Museum•Pompidou Museum•Train to Vernon, bike to Giverny with Fat Tire Bike Tours•http://www.fattirebiketoursparis.com/•Eiffel Tower

Elaboration of Trip to Paris Blog Story (Means-End & Heider)Woodside,Sood & Miller 2008 When Consumers and Brands Talk Psychology & Marketing

12. Unforgettable Memories"This trip had so many memories, but here are a few choice highlights........On our very first night, knowing that the Eiffel Tower light show started at 10:00 p.m.... she [Paige] dropped her camera…down 6 flights…we were stunned…SpanishFamily below standing below [with pieces of the camera]”

15." Michael Osman is an American artists living in Paris.""He supplements his income by being a tour guide." I" found out about him on Fodors""So I engaged Michael for two days."

16. "On our trip to Giverny, we met a young woman from Brisbane, Australia who was traveling on her own and we invited her to join us. Three of us enjoyed delicious and innovative soufflés, while Paige had the rack of lamb. We shared two dessert soufflés, one chocolate and the other cherry/almond. Yum"

17. "I wanted Paige to get a feel for shopping experiences that

she would not have at home (aka the ubiquitous mall). "

18."We went on Fat Tire's day trip to Monet's gardens and house in Giverny, about an hour outside Paris."

13."The father stretched out his cupped hands which held all of the pieces they were able to recover, including the memory stick and he very solemnly said, "El muerto...".

14. "They had decide to come to Paris to find the Harley Davidson store so they could buy Harley Paris t-shirts."

+

+

+

+

19....."I know Paige will treasure the memory of this girl's trip for many

years to come."

25

Page 26: Randomness

Tag Cloud of Paige’s Story About Travel to Paris

Created from Daniel Steinbock’s TagCrowd under Creative Commons ©

2626

Page 27: Randomness

Linguistic Inquiry and Word Count (LIWC)Text Analysis : The Psychological Power of Words

LWIC dimension “I love Paris”Paige’s Story

Personal texts Formal texts

Self-references (I, me, my)

6.12 11.4 4.2

Social words 10.55 9.5 8.0

Positive emotions 3.04 2.7 2.6

Negative emotions 0.54 2.6 1.6

Overall cognitive words 4.12 7.8 5.4

Articles (a, an, the) 7.74 5.0 7.2

Big words (> 6 letters) 18.40 13.1 19.6

Pennebaker, J. W., Francis ME, Booth RJ. (2001). Linguistic Inquiry and Word Count (LIWC): LIWC2001. Mahwah: Lawrence Erlbaum Associates.

27

Page 28: Randomness

28

Page 29: Randomness

29

Page 30: Randomness
Page 31: Randomness

Which Pattern is Random ?

http://www.wired.com/wiredscience/2012/12/what-does-randomness-look-like/

Page 32: Randomness

Ceiling of the Waitomo cave in New Zealand.http://www.waitomo.com/SiteCollectionImages/glowworms/Waitomo-Glowworm-Caves-New-Zealand-boat-group.jpg

Page 33: Randomness

Which Pattern is Random ?

HTTHTTHTHHTTHTHTHTTHHTHTT

HTTHHHTTHTTHTHTHTHHTTHTTH

THTHTHTHHHTTHTHTHTHHTHTTT

HTHHTHTHTHTHHTTHTHTHTTHHT

THHHTHTTTTHTTHTTTHHTHTTHT

HHHTHTHHTHTTHHTTTTHTTTHTH

TTHHTTTTTTTTHTHHHHHTHTHTH

THTHTHHHHHTHHTTTTTHTTHHTH

http://www.wired.com/wiredscience/2012/12/what-does-randomness-look-like/

Page 34: Randomness

http://www.bombsight.org/

Journal of the Institute of Actuaries 72 (1946) 0481

http:

//m

advi

s.bl

ogsp

ot.c

om.a

u/20

10/0

9/fly

ing-

bom

bs-o

n-lo

ndon

-sum

mer

-of-1

944.

htm

l

Page 35: Randomness

Newcomb Discovery (1881)• American mathematician/astronomer Simon Newcomb discovered the

first few pages of a logarithmic table corresponding to the lower significant digits (typically those below 5) were comparatively dirtier than the later pages corresponding to the higher significant digits (typically those above 5)

• Newcomb attributed greater usage to users were looking-up numbers that started with digit 1 more often than numbers starting with, say, digit 5

• This leads to probability distribution of an user accessing any of the pages at any given time was skewed in favour of the earlier pages corresponding to the lower significant digits!

• This was directly in contrast with the normal theory of probability according to which the probability of randomly picking any number between one and nine should be equal to the unique value of 1/9 or roughly 11.11%

Page 36: Randomness

If the leading (first) digit is d, then the frequency of occurrence (probability) of the leading digit is

Log10 (1 + 1/d)

Leading digit (d)

1 2 3 4 5 6 7 8 9

Probability of

occurrence

30% 18% 12% 10% 8% 7% 6% 5% < 5%

Number Leading (first) digit

350 3

42057 4

0.64 6

Page 37: Randomness

Benford Stumbles Over Newcomb Finding

• In 1938, almost half a century after the Newcomb Frank Benford was going through a large collection of numerical data from disparate sources when he stumbled upon a similar finding

• Benford used a huge volume of data to empirically support his finding including areas of rivers, street addresses of “American men of Science” and numbers appearing in front-page newspaper stories. He went on to publish his findings in a number of papers including the 1937 “The Law of Anomalous Numbers”. Thus the ‘principle’ came to be known as “Benford’s Law”

Page 38: Randomness

Benford Utility

• Human choices are not random, invented numbers are unlikely to follow Benford’s Law

• Only works with natural numbers (those numbers that are not ordered in a particular numbering scheme

• When people invent numbers, their digit patterns (which have been artificially added to a list of true numbers) will cause the data set to appear unnatural

– See Durtshi, Hillison and Pacini (2004) The Effective Use of Benford’s Law to Assist in Detecting Fraud in Accounting Data by).

• Does not work with Lottery!

• Formally proven in 1996

• Corpus of over 650 papers available at

– http://www.benfordonline.net/list/chronological

• Benford Law Plug-in is for Kirix Strata, R package “BenfordTests” or visualise in Tableau

Page 39: Randomness

Smartphone, Google Glass or Apple Watchwill Know What you Want before you do

“…from 2014 your phone [glasses or watch] will anticipate your needs, do the research, tell you what what you want to know – sometimes before the question even occurs to you…”

Chapman, Jake (2013), The Wired World in 2014

Page 40: Randomness

Useful References Informing our Thinking

There is a potential 93% average predictability in user mobility, an exceptionally high value rooted in the inherent regularity of human behavior. Yet it is not the 93% predictability that we find the most surprising. Rather, it is the lack of variability in predictability across the population.

Scellato et al. (2011), NextPlace: A Spatio-temporal Prediction Framework for Pervasive Systems. Proceedings of the 9th International Conference on Pervasive Computing (Pervasive'11)

Daily and weekly routines => Few significant places every day => Regularity in human activities => Regularity leads to predictability

Page 41: Randomness

Domenico, A. Lima, Musolesi.M. (2012) Interdependence and Predictability of Human Mobility and Social Interactions. Proceedings of the Nokia Mobile Data Challenge Workshop.

we have shown that it is possible to exploit the correlation between movement data and social interactions in order to improve the accuracy of forecasting of the future geographic position of a user. In particular, mobility correlation, measured by means of mutual information, and the presence of social ties can be used to improve movement forecasting by exploiting mobility data of friends. Moreover, this correlation can be used as indicator of potential existence of physical or distant social interactions and vice versa.

Sadilek, A and Krumm, J. (2012) Far Out: Predicting Long-Term Human MobilityWhere are you going to be 285 days from now at 2pm …we show that it is possible to predict location of a wide variety of hundreds of subjects even years into the future and with high accuracy.

Useful References Informing our Thinking

Page 42: Randomness

Caution!

“Children never put off till tomorrow what will keep them from going to bed tonight”

ADVERTISING AGE

42