Upload
osborn-lewis
View
216
Download
0
Tags:
Embed Size (px)
Citation preview
1
Introduction to Data ScienceSection 2Data Matters 2015
Sponsored by the Odum Institute, RENCI, and NCDS
Thomas M. [email protected]
3
Data Science is More than Analysis
• Data analysis gets most of the attention in data science.
• In that sense, many people struggle to distinguish data science from applied statistics.
• Analysis is obviously important, but statistical analysis skills are only useful if the data can be collected in put in a usable form.
• Data Science is much broader than just data analysis.
4
The Data Lifecycle
• Data science considers data at every stage of what is called the data lifecycle.
• This lifecycle generally refers to everything from collecting data to analyzing it to sharing it so others can re-analyze it.– In fact, it includes the planning process that should be in place
before any other work begins.• New visions of this process in particular focus on integrating
every action that creates, analyzes, or otherwise touches data.• These same new visions treat the process as dynamic – data
archives are not just digital shoe boxes under the bed.• There are many representations of the this lifecycle.
9
Lessons from the Lifecycle
• Data Science is more than just data analysis.• Effective data science requires– Planning– Vision– Storage– Interoperability of systems– A team approach– Adaptability and Scalability
10
What is Missing?
• Most definitions of data science underplay or leave out discussions of:– Substantive theory– Metadata– Privacy and Ethics– Greater Consideration for missing data,
representativeness, and uncertainty– More thinking about the proper Null hypothesis– Leadership on leveraging data science for the
public good
12
The Data Generating Process (DGP)
• Most of the time we don’t care about the data itself.• Most of the time we are trying to learn something
about an underlying process that produces the data – a DGP.
• Technically trained folks might be good at uncovering patterns in data, but you need substantive expertise to:– Know where to look in the first place– Know what to look for– Know what you find actually might mean
13
What is the DGP?
• Good analysis starts with a question you want to answer.– Blind data mining can only get you so far, and really, there is no
such thing as completely blind mining• Answering that question requires laying out expectations
of what you will find and explanations for those expectations.
• Those expectations and explanations rest on assumptions.• If your data collection, data management, and data
analysis are not compatible with those assumptions, you risk producing meaningless or misleading answers.
14
The DGP (cont.)
• Think of the world you are interested in as governed by dynamic processes.
• Those processes produce observable bits of information about themselves – data
• We can use data science to:– Collect, catalog, and organize those bits of information– Discover patterns in data and fit models to that data– Make predictions outside of our data– Inform explanations of both those patterns and those predictions.
• Real discovery is NOT about modeling patterns in observable data. It is about understanding the processes that produced that data.
15
Theories and DGPs
• Theories provide explanations for the processes we care about.
• They answer the question, Why does something work the way it does.
• Theories make predictions about what we should see in data.
• We use data to test the predictions, but we never completely test a theory.
16
Why do we need theory?
• Can’t we just find “truth” in the data if we have enough of it? Especially if we have all of it?
• No!– More data does not mean more representative data.– Every method of analysis makes some assumptions, so
we are better off if we make them explicit.– Patterns without understanding are a best
uninformative and at worst deeply misleading.
17Robert Mathews Aston, 2000. “Storks Deliver Babies (P=0.008).”Teaching Statistics. Volume 22, Number 2, Summer 2000
18
New Behaviors Require New Theories
• The Target example illustrated how existing theories about habit formation informed their data mining efforts.
• However, whole new behaviors exist that are creating a lot of the data that data scientists want to analyze:– Online shopping– Cell phone usage– Crowd sourced recommendation systems– Facebook, Google searching, etc.– Online mobilization of social protests
• We need new theories for these new behaviors.
20
What is Metadata?
• Metadata is data about data. It is frequently ignored or misunderstood.
• Metadata is required to give data meaning.• It includes:– Variable names and labels, value labels, information on
who collected the data, when, by what methods, in what locations, for what purpose, etc.
• Metadata is essential to use data effectively, to reuse data, to share data, and to integrate data.
• Data without metadata is worthless.
21
The Value of Metadata
• Data by itself is just a bunch of 0’s and 1’s.• Metadata– Provides meaning– Allows for cataloging– Facilitates search and discovery– Enables linking data sets
22
Types of Metadata
• NICO Defines three types:– Structural: describes how the components of the
data are organized (columns, rows, chapters, etc.)– Descriptive: provides titles, authors, keywords,
subjects, etc. that facilitate attribution and search/discovery.
– Administrative: technical information on how file was created, software used, formats for storage, etc.• Includes rights and preservation metadata
23
Metadata Standards
• There are emerging standards for metadata– The American National Standards Institute– The International Organization for Standardization
• Dublin Core – 15 classis metadata terms.– Title, Creator, Subject, Description, Publisher,
Contributor, Data, Type, Format, Identifier, Source, Language, Relation, Coverage, Rights