Upload
shannon-harrell
View
213
Download
0
Tags:
Embed Size (px)
Citation preview
Mercer
Data Information Knowledge 3
Presentation originally from the University of Texas at Austin
Edits by Rick Mercer
Mercer2
Outline
• Review scope of big data• Searching using Indexes • Analyzing Data
Mercer3
Review: Big data is huge
• 50 petabytes of data = 25 trillion pages of text!
Mercer4
Big data is increasing
2005 2006 2007 2008 2009 2010 2011 2012 2013
Big Data Created
Mercer5
Big data is mostly unstructured
Big Data Organization
Unstructured data Structured data
Mercer6
Try to structure part of the web?
• Top level domain names attempted, but– .edu .com .org .biz .ca .co.uk .nz .
mx
• Open Directory Project – Instead of applying a formula to search
strings, this lists directories that you drill into
• Compare searches here and on Google for–machine learning software
Mercer7
Google Search Formula
• Google uses programs (spiders) to index and explore the Web:– visit webpages,– gather all of the links on each page visited,
and– add them to their list of pages to visit in the
future
• Google takes your words and examines its index for pages that have your words– Applies 200 questions to determine result
list
Mercer8
What is an index?
• Indexes organizes conceptual topics and location pairings
• Google creates an index to look things up, much like the index in a book
• Why do searches use indexes? Efficiency!
Mercer9
Why index?
• An index provides an easy way to find pertinent information related to a topic
• Building indexes is difficult, but indexes make it possible to get results in 0.5 seconds– Google index is100,000,000 gigabytes– over one million computing hours to
build it
• Play first 2 minutes of this video
Mercer
Analyzing Data
Making sense of our world with numbers
Mercer11
Old school – the Scientific Method
• Graph shows scientific method we are taught– Investigate
phenomena to acquire new knowledge
• Procedures vary• Statistical Hypothesis
Testing will be shown later, as an application of the scientific method
Mercer12
Analyzing Statistics
Three uses of statistical analysis commonly used by scientists, mathematicians, politicians, and other professionals across the globe.
Mercer13
Descriptive Statistics
1. Descriptive analytics - provide information about collected data via statistics such as mean, median, mode, range– These tend to 'describe' circumstances, but do
not offer conjectures about unknowns– Example: the percentage of graduates
employed within 6 months of graduating– Application: Google’s indexing the web– Consider another site dealing with describing
data (recorded search history on any topic):• http://www.google.com/trends/
Mercer14
Predictive Analytics
2. Predictive Analytics is the practice of extracting information from existing data sets in order to determine patterns and predict future outcomes and trend– Does not predict the future, may be wrong– Example: Given that 90 of the 100 CS graduates
were employed within 6 months in 2011, it is __ % likely that 108 of the 120 CS graduates in 2015 will be employed within 6 months
– Upcoming Application: ranking pages based on a search query
– Ron Burgandy 8-second clip
Mercer15
Prescriptive Analytics
3. Prescriptive Analytics is the practice of extracting information from existing data sets in order to determine patterns and predict future outcomes and trend– Does not predict the future– Example: how likely is that I will find a
high-paying job if I choose to major in ‘computer science’ rather than ‘biology’
– Upcoming Application: autocomplete makes recommendation based on previous rankings
Mercer16
How Useful are these three?
• The following grades each type of analysis on its utility (how useful is it?) and confidence (how likely is it to be true and/or valid?) in the context of decision making
AnalysisType
Utility
Level
Confidence
Level
Example
Descriptive
C A+ Obama got x% of the vote
Predictive B+ B- x% chance of winning
Prescriptive
A C How to run the campaign
Mercer17
An example of hypothesis testing
Car Talk Puzzler – “The Case of the Finicky Volare”
Mercer18
Not statistical hypothesis testing
• A man thinks he is having some peculiar car trouble:
“It doesn't like a certain kind of ice cream I buy."
• He goes on to explain that he only has three flavors of ice cream he likes:– Vanilla– Chocolate– Three-bean tofu mint chipped-beef ice cream.
Mercer19
Instead, only establish an hypothesis
He says:
"When I go to buy chocolate I park in front of the ice cream parlor: I buy the chocolate,
and my car starts right up. I buy the vanilla, and my car starts right up.
However, if I buy the three-bean tofu mint chipped-beef, my car won't start.”
• What could be the issue with this car?
Mercer20
Deriving an hypothesis
“When you go in to buy chocolate, you go into the freezer case and
there's chocolate in a container -- you take it, you pay for it, you get into your
car and drive away. Same thing with vanilla, but nobody buys mint chip beef
bean tofu, right? So, somebody must hand pack the ice cream into a special
container”• What could be the issue with this car?
Mercer21
Hypothesis Example
• The car is old.• It takes longer to purchase hand packed ice
cream than pre-packed ice cream.• Ice cream is purchased more often in the
summer when it is hot
Hypothesis: A proposed explanation for this phenomenon
The car overheats and ‘vapor locks’ in the extra time it takes to purchase three-bean tofu mint chipped-beef ice cream.
Mercer22
Statistical Hypothesis Testing
• To get an idea, it’s more than car mechanics
• More realistic to view 1.5 minutes of this example from Khan Academy
Mercer23
Exploratory data analysis
• Exploratory data analysis is another method of scientific inquiry
• Utilize big data statistics to postulate correlations (data sets that are linked together) that have not yet been hypothesized
• Attempts to discover patterns in order to establish correlative links
Mercer24
An example of exploratory data analysis
• Data from millions of searches to predict what you are looking for
• Discovers patterns• Correlations not guaranteed
Mercer26
“How to change a _______” Y-A
xis
Fre
qu
en
cy
X-Axis Words in blank
Mercer27
Exploratory data analysis
Statistical Hypothesis
TestingInjections into rats
Exploratory Data AnalysisAutocomplete
Confidence
since the test is constrained to a specific issue (and the variables are known), confidence can be really high
confidence is typically lower, because the data are messier and the connections among them are unknown
Scope works over a typically circumscribed problem
usually applied to much larger datasets with more unknowns
Sample size
power increases with sample size, sometimes able to determine a minimum sample size to guarantee a desired confidence
power is dependent on sample size, because there are no established hypothesis and it is unknown how large a sample size must be to discover knowledge
Mercer28
Some big data sets and collections
• Google Public Data Explorer (Links to an external site.) (130 datasets from Bureau of Labor Statistics, U.S. Census Bureau, etc.)
• data.gov (Links to an external site.) - an online repository of datasets from U.S. Government– Many counties have searchable property databases, such as the
Travis County Appraisal District (Links to an external site.) – Many counties have searchable legal databases, such as the
Travis County Clerk (Links to an external site.) – Some data sets defy categorization, such as the
Texas Death Row Executions (Links to an external site.) data set• Google's Ngram Data (Links to an external site.) - data on Google's catalog of millions of
books, including raw data sets • Google Trends (Links to an external site.) - detailed search history information, including
CSV downloads • NOAA National Climatic Data Center (Links to an external site.)• Knoema (Links to an external site.) - "free to use public and open data platform for users
with interests in statistics and data analysis, visual storytelling and making infographics"• Geocommons (Links to an external site.) - "all about open data analysis and maps"• Stat Silk (Links to an external site.) - "interactive maps of open data"• Better World Flux (Links to an external site.) - "a beautiful interactive visualization of
information on what really matters in life"• Gapminder (Links to an external site.) - "unveiling the beauty of statistics for a better
world view"