12
Fall 2017 CptS 483:04 Introduction to Data Science What Is Data Science? Assefaw Gebremedhin

CptS 483:04 Introduction to Data Science€¦ · Quote from Intro of “Foundations of Data Science” manuscript by Avrim Blum, John Hopcroft and Ravindran Kannan (2015) Computer

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CptS 483:04 Introduction to Data Science€¦ · Quote from Intro of “Foundations of Data Science” manuscript by Avrim Blum, John Hopcroft and Ravindran Kannan (2015) Computer

Fall 2017

CptS 483:04 Introduction to Data Science

What Is Data Science?

Assefaw Gebremedhin

Page 2: CptS 483:04 Introduction to Data Science€¦ · Quote from Intro of “Foundations of Data Science” manuscript by Avrim Blum, John Hopcroft and Ravindran Kannan (2015) Computer

Assefaw Gebremedhin: Introduction to Data Science, http://scads.eecs.wsu.edu

What is Data Science?

• Big Data and Data Science hype •  and getting past the hype

• Why now? • Current landscape of perspectives • Skill sets needed

Page 3: CptS 483:04 Introduction to Data Science€¦ · Quote from Intro of “Foundations of Data Science” manuscript by Avrim Blum, John Hopcroft and Ravindran Kannan (2015) Computer

Assefaw Gebremedhin: Introduction to Data Science, http://scads.eecs.wsu.edu

Big Data and Data Science Hype

What might be eyebrow-raising about Big Data and Data Science?

•  Lack of definition around basic terminology •  Lack of recognition for researchers in academia and industry

who have been working on this kind of stuff for years •  The hype is crazy •  Statisticians might perceive this whole movement as an

identity theft •  Some say “anything that has to call itself a science isn’t”

Source: Doing Data Science (O’Neil & Schutt, 2013).

Page 4: CptS 483:04 Introduction to Data Science€¦ · Quote from Intro of “Foundations of Data Science” manuscript by Avrim Blum, John Hopcroft and Ravindran Kannan (2015) Computer

Assefaw Gebremedhin: Introduction to Data Science, http://scads.eecs.wsu.edu

Getting past the hype

Around all the hype, there is a ring of truth Data Science is something new – it has access to a larger body of knowledge and methodology as well as a process that has foundations in both statistics and computer science. [DDS, O’Neil and Schutt]

We are here in this course to understand this better and contribute to the ongoing pursuit of a sharper definition.

Page 5: CptS 483:04 Introduction to Data Science€¦ · Quote from Intro of “Foundations of Data Science” manuscript by Avrim Blum, John Hopcroft and Ravindran Kannan (2015) Computer

Assefaw Gebremedhin: Introduction to Data Science, http://scads.eecs.wsu.edu

Quote from Intro of “Foundations of Data Science” manuscript by Avrim Blum, John Hopcroft and Ravindran Kannan (2015)

Computer science as an academic discipline began in the 60’s. Emphasis was on programming languages, compilers, operating systems, and the mathematical theory that supported these areas. Courses in theoretical computer science covered finite automata, regular expressions, context free languages, and computability. In the 70’s, algorithms was added as an important component of theory. The emphasis was on making computers useful. Today, a fundamental change is taking place and the focus is more on applications. There are many reasons for this change. The merging of computing and communications has played an important role. The enhanced ability to observe, collect and store data in the natural sciences, in commerce, and in other fields calls for a change in our understanding of data and how to handle it in modern setting. The emergence of the web and social networks, which are by far the largest such structures, presents both opportunities and challenges for theory.

John Hopcroft

Page 6: CptS 483:04 Introduction to Data Science€¦ · Quote from Intro of “Foundations of Data Science” manuscript by Avrim Blum, John Hopcroft and Ravindran Kannan (2015) Computer

Assefaw Gebremedhin: Introduction to Data Science, http://scads.eecs.wsu.edu

Quote from Intro of “Foundations of Data Science” manuscript by Avrim Blum, John Hopcroft and Ravindran Kannan (2015)

While traditional areas of computer science are still important and highly skilled individuals are needed in these areas, the majority of researchers will be involved with using computers to understand and make usable massive data arising in applications, not just how to make computers useful on specific well-defined problems. With this in mind we have written this book to cover the theory likely to be useful in the next 40 years, just as automata theory, algorithms and related topics gave students an advantage in the last 40 years. One of the major changes is the switch from discrete mathematics to more of an emphasis on probability, statistics, and numerical methods. John Hopcroft

Page 7: CptS 483:04 Introduction to Data Science€¦ · Quote from Intro of “Foundations of Data Science” manuscript by Avrim Blum, John Hopcroft and Ravindran Kannan (2015) Computer

Assefaw Gebremedhin: Introduction to Data Science, http://scads.eecs.wsu.edu

Why Now? Enablers of today’s “big data revolution”

•  Proliferation of sensors •  Creation of almost all information in digital form

•  Datafication •  Dramatic cost reduction in storage

•  You can afford to keep all the data •  Dramatic increases in network bandwidth

•  You can move the data to where it is needed •  Dramatic cost reduction and scalability improvements in

computation •  Dramatic algorithmic breakthroughs

•  Machine Learning, Data Mining, Fundamental advances in CS and Statistics

•  Ever more powerful models producing ever increasing volumes of data that must be analyzed

Page 8: CptS 483:04 Introduction to Data Science€¦ · Quote from Intro of “Foundations of Data Science” manuscript by Avrim Blum, John Hopcroft and Ravindran Kannan (2015) Computer

Assefaw Gebremedhin: Introduction to Data Science, http://scads.eecs.wsu.edu

Current landscape (of perspectives)

Example 1. Metamarket CEO Mike Driscolli (on Quora discussion from 2010 on “What is Data Science”):

Data Science, as practiced, is a blend of Red-Bull-fueled hacking and espresso-inspired statistics. But data science is not merely hacking—because when hackers finish debugging their Bash one-liners and Pig scripts, few of them care about non-Euclidean distance metrics. And data science is not merely statistics, because when statisticians finish theorizing the perfect model, few could read a tab-delimited file into R if their job depended on it. Data science is the civil engineering of data. Its acolytes possess a practical knowledge of tools and materials, coupled with a theoretical understanding of what’s possible.

Page 9: CptS 483:04 Introduction to Data Science€¦ · Quote from Intro of “Foundations of Data Science” manuscript by Avrim Blum, John Hopcroft and Ravindran Kannan (2015) Computer

Assefaw Gebremedhin: Introduction to Data Science, http://scads.eecs.wsu.edu

Current landscape (of perspectives) Example 2. Drew Conway’s Venn diagram of DS (2010)

Page 10: CptS 483:04 Introduction to Data Science€¦ · Quote from Intro of “Foundations of Data Science” manuscript by Avrim Blum, John Hopcroft and Ravindran Kannan (2015) Computer

Assefaw Gebremedhin: Introduction to Data Science, http://scads.eecs.wsu.edu

Current landscape (of perspectives)

Example 3. Vasant Dhar, in the article “Data Science and Prediction”, Communications of the ACM, Dec 2013, makes the following three big points: http://cacm.acm.org/magazines/2013/12/169933-data-science-and-prediction/fulltext •  Data Science is the study of the generalizable extraction of knowledge from

data. •  A common requirement in assessing whether new knowledge is actionable for

decision making is its predictive power, not just its ability to explain the past. •  A data scientist requires an integrated skill set spanning math, ML, statistics,

computer science, along with a deep understanding of the craft of problem formulation to engineer effective solutions.

Page 11: CptS 483:04 Introduction to Data Science€¦ · Quote from Intro of “Foundations of Data Science” manuscript by Avrim Blum, John Hopcroft and Ravindran Kannan (2015) Computer

Assefaw Gebremedhin: Introduction to Data Science, http://scads.eecs.wsu.edu

A Data Science Profile

• Computer science • Math •  Statistics • Machine Learning • Domain expertise • Communication and presentation skills • Data visualization

Page 12: CptS 483:04 Introduction to Data Science€¦ · Quote from Intro of “Foundations of Data Science” manuscript by Avrim Blum, John Hopcroft and Ravindran Kannan (2015) Computer

Assefaw Gebremedhin: Introduction to Data Science, http://scads.eecs.wsu.edu

Author Schutt’s data science profile