Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
1/33
Concepts More Messy Correlation Datafication Value Implications Break Risks Control Assignment Conclusion References
CS-495/595Big Data processing concepts (part 2)
Lecture #2
Dr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck Cartledge
28 Jan. 201528 Jan. 201528 Jan. 201528 Jan. 201528 Jan. 201528 Jan. 201528 Jan. 201528 Jan. 201528 Jan. 201528 Jan. 201528 Jan. 201528 Jan. 201528 Jan. 201528 Jan. 201528 Jan. 201528 Jan. 201528 Jan. 201528 Jan. 201528 Jan. 201528 Jan. 201528 Jan. 2015
2/33
Concepts More Messy Correlation Datafication Value Implications Break Risks Control Assignment Conclusion References
Table of contents I1 Concepts
2 More
3 Messy
4 Correlation
5 Datafication
6 Value
7 Implications
8 Break
9 Risks
10 Control
11 Assignment
12 Conclusion
13 References
3/33
Concepts More Messy Correlation Datafication Value Implications Break Risks Control Assignment Conclusion References
Another view of Big Data
A nexus
Things that can only bedone at large scale (can’t dothem at small scale)
Extract new insights
Create new forms of value
Change relationshipsbetween markets,organizations, people, andgovernments
Coles vs. Woolworth in Australia, an analytics firm changedeverything [10].
4/33
Concepts More Messy Correlation Datafication Value Implications Break Risks Control Assignment Conclusion References
Another view of Big Data
A nexus
Data munging, datawrangling
Cleansing, normalization
Various sources
Various formats
Various qualities
5/33
Concepts More Messy Correlation Datafication Value Implications Break Risks Control Assignment Conclusion References
Another view of Big Data
Programming
Amdahl’s law
Partition: C/C++ fork(),java thread(), C# fork(),MPI Init(),MPI Comm size(), MPI . . .
6/33
Concepts More Messy Correlation Datafication Value Implications Break Risks Control Assignment Conclusion References
Another view of Big Data
Distillation of data into information
Intended user
Data visualization
Intended message
Medium
Edward Tufte Figure: Napoleon’s march into andout of Russia.
Charles Joseph Minard (1781 — 1870)[13]
7/33
Concepts More Messy Correlation Datafication Value Implications Break Risks Control Assignment Conclusion References
Another view of Big Data
A Good Read
“Big Data,” by Mayer-Schonberger and Cukier [5]
National bestseller
Popular non-fiction
Easy read
No pictures
No math
Just prose
The basis of today’s lecture. A good book to refer to; not arequired text.
8/33
Concepts More Messy Correlation Datafication Value Implications Break Risks Control Assignment Conclusion References
Another view of Big Data
Rituals. We all have them.
9/33
Concepts More Messy Correlation Datafication Value Implications Break Risks Control Assignment Conclusion References
How many samples is enough?
n=1,000 vs. n=all
Classical (frequentism)
Bayesian
Difference between conventionaldigital cameras and a Lytrocamera
1 Focusing on one plane, vs
2 Capturing it all and postprocessing
Collect all the data and see where it takes you.
10/33
Concepts More Messy Correlation Datafication Value Implications Break Risks Control Assignment Conclusion References
How many samples is enough?
Albert-Laszlo Barabasi and cell phones
Looking at how cell-phonesconnect to each other
n=0.2 European populationfor 4 month period
Graph theory perspective,the connections created a“small world” graph
Image from [7]
Unexpected insights found by following the data.
11/33
Concepts More Messy Correlation Datafication Value Implications Break Risks Control Assignment Conclusion References
Precision vs. accuracy
Precision – closeness of themeasurements
Accuracy – closeness to theactual underlying value
With small datasets, they have to be accurate and precise. Largedatasets are more accurate because they contain all of life’smessiness.
12/33
Concepts More Messy Correlation Datafication Value Implications Break Risks Control Assignment Conclusion References
Precision vs. accuracy
Natural language processing
Brown Corpus – 1960(≈ 1, 000, 000 curatedwords)
Google words – 2006(≈ 1, 000, 000, 000, 000uncurated words)
“Data in the wild.” [3]
“. . . simple models and a lot of data trump more elaborate modelsbased on less data.”[3]
13/33
Concepts More Messy Correlation Datafication Value Implications Break Risks Control Assignment Conclusion References
Precision vs. accuracy
Neat and tidy vs. messy
SQL databases (only 5% ofdata is structured) – neatand tidy
noSQL databases (datamodeled other than in atable) – messy
1 Graph2 Column3 Document4 Persistent Key value pair5 Volatile Key value pair
The world does not live by SQL databases alone. There are somedata that are more efficiently modeled by things other than tables.
14/33
Concepts More Messy Correlation Datafication Value Implications Break Risks Control Assignment Conclusion References
Predictive analytics
Correlations show what not why.
How New York City got a handle on exploding manhole covers.
94,000 miles of underground cable5% laid before 193051,000 manhole covers and serviceboxesRecords from 1880s (messy, messydata)Look for correlations – 106 foundTrained with data up to 2009Tested with data from 2009 – 10%manholes accounted for 44%problems
Primary indicators – (1) age of the cables, (2) previous problemswith that manhole [9].
15/33
Concepts More Messy Correlation Datafication Value Implications Break Risks Control Assignment Conclusion References
Predictive analytics
Is this the end of theory as we know it?
“There is now a better way.Petabytes allow us to say:’Correlation is enough.’ We canstop looking for models. We cananalyze the data withouthypotheses about what it mightshow. We can throw thenumbers into the biggestcomputing clusters the world hasever seen and let statisticalalgorithms find patterns wherescience cannot.” [1]
Figure: The Iowa agriculturelandscape: Green areas are moreproductive for soy, corn, and wheat;red are least.
Correlation will tell us what is. Theories will tell us why.
16/33
Concepts More Messy Correlation Datafication Value Implications Break Risks Control Assignment Conclusion References
Quantifying the world
When words become data.
Google starts to scan all the booksit can (2004)
Scans are turned into words
Google scan approx. 20 million(15% of all books) by 2012
Words over time(https://books.google.com/ngrams)
New science of “culturomics” [6]
Undefined words
Some companies have been able to capitalize on the value ofwords, others haven’t.
17/33
Concepts More Messy Correlation Datafication Value Implications Break Risks Control Assignment Conclusion References
Quantifying the world
When location becomes data.
A few lessons from UPS [12].
Fitted vehicles with sensors, GPS,and wireless modulesApplied analytics to the dataReduced driver’s routes by30,000,000 milesReduced fuel by 3,000,000 gallonsReduced carbon-dioxide emissionsby 30,000 metric tonsGave preference to right hand turnsvice crossing turnsImproved efficiency and safety
Application of location data and graph theory to compute shortestpaths.
18/33
Concepts More Messy Correlation Datafication Value Implications Break Risks Control Assignment Conclusion References
Luis von Ahn and killing spambots
Wanted to thwart spambots, but allowhumans
Something that was HARD forcomputers, but easy for humans
At 22 came up with the idea for theCompletely Automated Public TuringTest to Tell Computers and HumansApart (captcha)
At 27 as PhD, awarded MacArthurFoundation “genius” award
Designed ReCapthca – crowd sourcedcharacter recognition
Minimal human cost, aggregated tovery large worth ($750,000,000 peryear)
One man’s trash is another man’s treasure.
19/33
Concepts More Messy Correlation Datafication Value Implications Break Risks Control Assignment Conclusion References
The “option value” of data.
Information generated for one purpose can be reused. Data movesfrom primary to secondary uses.
Reuse of search terms –predictive search hints
Reuse of 1890s electricalcable information in NYC
Ways to unleash data’svalue:
1 Basic reuse2 Merging datasets3 Finding “twofers”
Figure: Image fromhttp://www.cognizant.com/latest-thinking/perspectives/dealing-with-big-data
Ultimate value of data is what one can gain from all the possibleways in can be employed.
20/33
Concepts More Messy Correlation Datafication Value Implications Break Risks Control Assignment Conclusion References
The value of open data.
Big companies have the resources and need to address big data.Government is the biggest company, and it can compel others to give itdata.
US Government – President Obamadirected government to provide datahttp://www.data.gov
UK Open Data Initiative –http://data.gov.uk/
European Union – EU Open DataPortal – https://open-data.europa.eu/
Other governmental levels as well
With government acting as the data broker, others can act as visionariesand scientists to provide services, provide products, and create wealth.
21/33
Concepts More Messy Correlation Datafication Value Implications Break Risks Control Assignment Conclusion References
What happens when we start to use the crystal ball?
Too few data scientists
We need more of them now[4]
Data is widely available andvitally important
Rise of data brokers, datavisionaries, data scientists Figure: Image from
http://blog.sqlauthority.com/2013/10/25/big-data-how-to-become-a-data-scientist-and-learn-data-science-day-19-of-21/
Demise of the “expert” because we let the data lead us. Contradiction.
22/33
Concepts More Messy Correlation Datafication Value Implications Break Risks Control Assignment Conclusion References
What happens when we start to use the crystal ball?
Unexpected places where Big Data is used:
Baseball – The book Moneyball
On line education – Courserareviews where students replay themost
Neonatal care – predictions basedon what worked in the past
Ship tracking – Marine Trafficwebsite
Cars tracking – Inrix geo-location100 million cars in US
We are awash in data. People are using it for all sorts of things.
23/33
Concepts More Messy Correlation Datafication Value Implications Break Risks Control Assignment Conclusion References
Break time.
Take about 10 minutes.
24/33
Concepts More Messy Correlation Datafication Value Implications Break Risks Control Assignment Conclusion References
Big Brother(s) is watching you.
Each of us has a penumbra of data.
Memphis, TN – Blue CrimeReduction Utilizing StatisticalHistory (CRUSH) [8]Richmond, VA – correlates types ofarrests with when and where [11]DHS – Future Attribute ScreeningTechnology (FAST)/PassiveMethods for Precision BehavioralScreening “. . . developphysiological and behavioralscreening technologies . . . ”
“I’m placing you under arrest for the future murder of SarahMarks, that was to take place today . . . ” [2]
25/33
Concepts More Messy Correlation Datafication Value Implications Break Risks Control Assignment Conclusion References
Big Brother(s) is watching you.
The dark side of Big Data.
“. . . big data allows for moresurveillance of our lives while itmakes some of the legal meansfor protecting privacy largelyobsolete. It also rendersineffective the core technicalmethod of preserving anonymity.. . . ” [5]
Big data is powerful. It is seductive. “. . . the possession of greatpower necessarily implies great responsibility.”
26/33
Concepts More Messy Correlation Datafication Value Implications Break Risks Control Assignment Conclusion References
. . . what’s past is prologue . . . – Shakespeare, “The Tempest”
What is past?
Gutenberg is credited with creatingmovable type
Created an explosion of books andliteracy
Gave rise to copyright and protections
Big Data is an explosion
Need to create protections
America On Line released“anonymized” search logs
Data available 4 Aug 2006, removed 7Aug 2006
Big Data techniques (not by thatname) identified users
A single thread is weak. A rope made from many threads is strong.
27/33
Concepts More Messy Correlation Datafication Value Implications Break Risks Control Assignment Conclusion References
. . . what’s past is prologue . . . – Shakespeare, “The Tempest”
Protections
The internet “never forgets”
Interest in the “right toforget”
How to mandateforgetfulness??
How to ensureforgetfulness??
People evolve and change. They should be judged by what theydo, not by what their past predicts they will do.
28/33
Concepts More Messy Correlation Datafication Value Implications Break Risks Control Assignment Conclusion References
A “Hello World” level problem.
With a little license.
A simply stated problem: Countthe number of unique words inShakespeare’s Macbeth.
A few Java classes
A Hadoop environment
Process strings from a file
Summarize the results
Grad students have a little moreto do.
29/33
Concepts More Messy Correlation Datafication Value Implications Break Risks Control Assignment Conclusion References
What have we covered?
“ Big data is about what, not why.We don’t always need to know thecause of a phenomenon; rather, wecan let the data speak for itself.”“Yet the most important reason forthe program’s success was that itdispensed with reliance oncausation in favor of correlation.”Let the data guide you vice“knowing” what knowing what theanswer is and finding data tosupport your assumptions.
Next lecture: Hadoop book, Chapters 1, 2, and 5
30/33
Concepts More Messy Correlation Datafication Value Implications Break Risks Control Assignment Conclusion References
References I
[1] Chris Anderson, The end of theory, Wired magazine 16(2008), no. 7, 16–07.
[2] Steven Spielberg Director, The minority report, TwentiethCentury Fox Film Corporation, 2002.
[3] Alon Halevy, Peter Norvig, and Fernando Pereira, Theunreasonable effectiveness of data, Intelligent Systems, IEEE24 (2009), no. 2, 8–12.
[4] James Manyika, Michael Chui, Brad Brown, Jacques Bughin,Richard Dobbs, Charles Roxburgh, and Angela H Byers, Bigdata: The next frontier for innovation, competition, andproductivity, McKinsey Global Institute (2011).
31/33
Concepts More Messy Correlation Datafication Value Implications Break Risks Control Assignment Conclusion References
References II
[5] Viktor Mayer-Schonberger and Kenneth Cukier, Big data: Arevolution that will transform how we live, work, and think,Houghton Mifflin Harcourt, 2013.
[6] Jean-Baptiste Michel, Yuan Kui Shen, Aviva Presser Aiden,Adrian Veres, Matthew K Gray, Joseph P Pickett, DaleHoiberg, Dan Clancy, Peter Norvig, Jon Orwant, et al.,Quantitative analysis of culture using millions of digitizedbooks, science 331 (2011), no. 6014, 176–182.
[7] J-P Onnela, Jari Saramaki, Jorkki Hyvonen, Gyorgy Szabo,David Lazer, Kimmo Kaski, Janos Kertesz, and A-L Barabasi,Structure and tie strengths in mobile communicationnetworks, Proceedings of the National Academy of Sciences104 (2007), no. 18, 7332–7336.
32/33
Concepts More Messy Correlation Datafication Value Implications Break Risks Control Assignment Conclusion References
References III
[8] Chris Peck, Blue crush controversy,http://www.commercialappeal.com/opinion/blue-
crush-controversy, 2013.
[9] Cynthia Rudin, David Waltz, Roger N Anderson, AlbertBoulanger, Ansaf Salleb-Aouissi, Maggie Chow, HaimontiDutta, Philip N Gross, Bert Huang, Steve Ierome, et al.,Machine learning for the new york city power grid, PatternAnalysis and Machine Intelligence, IEEE Transactions on 34(2012), no. 2, 328–345.
[10] Mercedes Ruehl, Coles, woolies and the big data arms race,http:
//www.brw.com.au/p/tech-gadgets/coles_woolies_
and_the_big_data_arms_4I2P2oieDKZGdev5aY778H, 2013.
33/33
Concepts More Messy Correlation Datafication Value Implications Break Risks Control Assignment Conclusion References
References IV
[11] Informs Staff, Police department wins gartner’s 2007 biexcellence award, http://www.informationweek.com/software/information-
management/police-department-wins-gartners-2007-
bi-excellence-award/d/d-id/1053178?, 2007.
[12] , Ups wins gartner bi excellence award,https://www.informs.org/Announcements/UPS-wins-
Gartner-BI-Excellence-Award, 2011.
[13] Wikipedia, Charles joseph minard — wikipedia, the freeencyclopedia, http://en.wikipedia.org/wiki/Charles_Joseph_Minard,2014.