The Anatomy and Physiology of Data Science
Peter Fox1 ([email protected]) http://tw.rpi.edu/web/Courses
(1Rensselaer Polytechnic Institute 110 8th St., Troy, NY, 12180 United States – see Acknowledgements)
Glossary:RPI – Rensselaer Polytechnic InstituteTWC – Tetherless World Constellation at Rensselaer Polytechnic Institute
Acknowledgments:TWC eScience GroupW3C Provenance Working Group
Sponsors:Rensselaer Polytechnic InstituteTetherless World Constellation
MOTIVATION
Whether the science (especially geosciences) community at-large likes it or not, the co-opting of the term Data Science by the private sector has led to increased hype over data science as a career and as a means to solve challenging data problems, and lack of educational innovation in curricula for data science.
If the full benefits of a new generation of statistical and analytical software tools that operate on high-performance computational infrastructure are to be attained, adequate attention to the 'science of data science' is needed. In this contribution, we present a science view of data science both from an education and research perspective.
We introduce a research agenda that explores the key challenges that must be met to meet the needs of research driven by large-scale data analytics.
We focus on three, as-yet untapped, data science topics: understanding scale in systems, sparse systems, and abductive reasoning.
We conclude with a specific call to action to make progress on the aforementioned topics.
The Landscape – Data Ecosystem and What Makes Up a Data Scientist?
Learning Outcomes
Physiology (in a group) Definition of Science Hypotheses,
Guiding Questions Finding and Integrating Datasets Presenting Analyses and Viz. Presenting Conclusions
Institutions to provide reliable,
high-functionality data
infrastructures that facilitate
analytics Provision of intermediate to
advanced Statistics to
undergraduates and early graduate
students Well-curted datasets are made
widely available along with
developed models and validation
statistics All results are under continuous
scrutiny, are traceable and
verifiable
AGUFM14 – ED31E-3455 (MS Hall A-C)
To demonstrate knowledge of relevant analytic methods, and to recognize and apply quantitative algorithms, techniques and interpret results
To demonstrate strategic thinking skills, combined with a solid technical foundation in data and model-driven decision-making.
To develop ability to apply critical and analytical methods to formulate and solve science, engineering, medical, and business problems
Examine real-world examples to place data-mining techniques in context, develop data-analytic thinking, to illustrate that their application is art and science.
Must effectively communicate analytic findings to non-specialists.
Must develop and demonstrate a working knowledge of decision making under uncertainty, be able to build optimization models that incorporate random parameters: static stochastic optimization, two-stage optimization with recourse, chance-constrained optimization, and sequential decision making.
Anatomy (as an individual) Data Life Cycle – Acquisition,
Curation and Preservation Data Management and Products Forms of Analysis, Errors and
Uncertainty Technical tools and standards
Anatomy study of the structure and relationship between body parts
Physiology is the study of the function of body parts and the body as a whole.
1
Data Information Knowledge
Producers Consumers
Context
PresentationOrganization
IntegrationConversation
CreationGathering
Experience
BigData Science (Data Analytics) Anatomy & Physiology Call To Action
Learning Outcomes“Data” Science Anatomy & Physiology Call To Action
Anatomy (individual) Intermediate Skill in parametric
and non-parametric statistics Application of a broad spectrum
of Data Mining and Machine
Learning Algorithms Ability to cross-validate and
optimize models Application to specific datasets
Through class lectures, practical sessions, written and oral
presentation assignments and projects, students should:
Develop and demonstrate skill in Data Collection and Data
Management
Demonstrate proficiency in Data/ Information Product
Generation
Demonstrate science-driven Analysis and Presentation of
Integrated Datasets from the Web
Demonstrate the development and application of Data Models
Convey knowledge of and apply Data and Metadata Standards
and explaining Provenance
Apply Data Life-Cycle principles, construct Data Workflows
Develop and demonstrate skill in Data Tool Use and
Evaluation
Data Science across the curriculum Same as “Calculus” And … Intro to Statistics
Data Management is Second
Nature Like operating an instrument Openness/ sharing is the natural
state As-a-whole, the Data Scientist
works collaboratively and is
recognized and rewarded by peers
and organizations
Data Science primarily advances the inductive conduct of science but to understand scale in systems, accommodate sparse systems, and provide for abductive reasoning, data scientists must progress to data analyticists.
Data science is advancing the inductive conduct of science and is driven by the greater volumes, complexity and heterogeneity of data being made available over the Internet. Data science combines aspects of data management, library science, computer science, and physical science using supporting cyberinfrastructure and information technology. It is changing the way all of these disciplines do both their individual and collaborative work. Key methodologies in application areas based on real research experience are taught to build a skill-set.
Data and Information analytics extends analysis (descriptive and predictive models to obtain knowledge from data) by using insight from analyses to recommend action or to guide and communicate decision-making. Thus, analytics is not so much concerned with individual analyses or analysis steps, but with an entire methodology. The world at-large is confronted with increasingly larger and complex sets of structured/unstructured information; from sensors, instruments, and generated by computer simulations; data is "hidden" in websites, application servers, social networks and on mobile devices. In commerce and industry, analytics-driven enterprises are becoming mainstream. Yet, there is a shortfall in the key education skills needed to meet the growing needs. Traditional enterprises are moving toward analytics-driven approaches for core business functions. In the government and corporations, cybersecurity problems are prevalent.
Key topics include: advanced statistical computing theory, multivariate analysis, and application of computer science courses such as data mining and machine learning and change detection by uncovering unexpected patterns in data.
Lt. Cmdr Data, Star Trek TNG
Lt. Cmdr Data and Friends
Overused Venn diagram of the intersection of skills needed for Data Science (Drew Conway)
The Data-Information-Knowledge Ecosystem (Fox; derived)
?
Physiology (term project) Definition of Science Hypotheses,
with Prediction/ Prescription Goal Cleaning and Preparing Datasets Validating and Verifying Models Presenting Ideas and Results