34
Lecture 2: Measures and Data Collection/Cleaning CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

Lecture 2: Measures and Data Collection/Cleaning CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

Embed Size (px)

Citation preview

Page 1: Lecture 2: Measures and Data Collection/Cleaning CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

Lecture 2: Measures and Data Collection/Cleaning

CS 6071

Big Data Engineering, Architecture, and Security

Fall 2015, Dr. Rozier

Page 2: Lecture 2: Measures and Data Collection/Cleaning CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

Homework 2

• Presentations on Biomedical Data Science• Due: Next Week on Tuesday?

Reorganizing Groups.

Page 3: Lecture 2: Measures and Data Collection/Cleaning CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

Measurements

• Measurements have inherent assumptions• Measurements are often stated very

informally

– Formalize our measures!

Page 4: Lecture 2: Measures and Data Collection/Cleaning CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

Measurements

Measure theory is a bit like grammar, many people communicate clearly without worrying about all the details, but the details do exist and for good reasons. - Maya Gupta, University of Washington

Page 5: Lecture 2: Measures and Data Collection/Cleaning CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

The Problem of Measures

• Physical intuition of the measure of length, given a body E, the measure of this body, m(E) might be the sum of it’s components, or points.

• Let’s take two bodies on the real number line– Body A is the line A = [0, 1]– Body B is the line B = [0, 2]

Which is “longer”?

Page 6: Lecture 2: Measures and Data Collection/Cleaning CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

The Problem of Measures

• Physical intuition of the measure of length, given a body E, the measure of this body, m(E) might be the sum of it’s components, or points.

• Let’s take two bodies on the natural number line– Body A is the line A = [0, 1]– Body B is the line B = [0, 2]

Which is “longer”?

Page 7: Lecture 2: Measures and Data Collection/Cleaning CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

Solving the Problem of Measures

• What does it mean for some body (or subset)

to be measurable?

• If a set E is measurable, how does one define its measure?

• What properties or axioms does measure (or the concept of measurability) obey?

Page 8: Lecture 2: Measures and Data Collection/Cleaning CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

Measure Theory

• Before we can measure anything we need something to measure!

• Let’s define a measurable space– A measurable space is a collection of events B, and

the set of all outcomes, Ω, also called the sample space.

Page 9: Lecture 2: Measures and Data Collection/Cleaning CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

Events and Sample Spaces

• Each event, F, is a set containing zero or more outcomes.– Each outcome can be viewed as a realization of an

event. The real world can be viewed as a player in a game that makes some move:

– All events in F that contain the selected outcome are said to “have occurred”.

Page 10: Lecture 2: Measures and Data Collection/Cleaning CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

Events and Sample Space

• Take a deck of 52 cards + 2 jokers

• Draw a single card from the deck.

• Sample space: 54 element set, each card is a possible outcome.

• An event is any subset of the sample space, including a singleton set, or the empty set.

Page 11: Lecture 2: Measures and Data Collection/Cleaning CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

Events and Sample Space

• Potential events:– “Red and black at the

same time without being a joker” – (0 elements)

– “The 5 of hearts” – (1 element)

– “A king” – (4 elements)– “A face card” – (12

elements)– “A card” – (54 elements)

Page 12: Lecture 2: Measures and Data Collection/Cleaning CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

Forming an Algebra on B and Ω

• In order to define measures on B, we need to make sure it has certain properties, those of aσ-algebra.

• A σ-algebra is a special kind of collection of subsets that is closed under countable-fold set operations (complement, union of countably many sets, and intersection of countably many sets).

• “Vanilla” algebras are closed only under finite set operations.

Page 13: Lecture 2: Measures and Data Collection/Cleaning CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

Countable Sets

• Countable sets are those with the same cardinality of natural numbers.

• Quick refresher: Prove the cardinality of integers and natural numbers are the same.

Page 14: Lecture 2: Measures and Data Collection/Cleaning CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

σ-algebra

• If we have a σ-algebra on our sample space Ω, then:

Page 15: Lecture 2: Measures and Data Collection/Cleaning CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

Measures

• A measure µ takes a set A from a measureable collection of sets B and returns the measure of A, which is some positive real number.

Formally:

Page 16: Lecture 2: Measures and Data Collection/Cleaning CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

Example Measure• Let’s define a measure of “Volume”.

• The triple combines a measureable space and a measure, the triple is called a measure space. This space is defined by two properties:– Nonnegativity:– Countable additivity: are disjoint

sets for i = 1, 2, …, then the measure of the union of is equal to the sum of the measures of

Page 17: Lecture 2: Measures and Data Collection/Cleaning CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

Example Measure

• Does the ordinary concept of volume satisfy these two properties?

– Nonnegativity:– Countable additivity: are disjoint

sets for i = 1, 2, …, then the measure of the union of is equal to the sum of the measures of

Page 18: Lecture 2: Measures and Data Collection/Cleaning CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

Two Special Kinds of Measures

• Signed measure – can be negative• Probability measure – defined over a

probability space with a probability measure.– A probability measure, P, has the normal

properties of a measure, but it is also normalized such that:

Page 19: Lecture 2: Measures and Data Collection/Cleaning CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

Sets of Measure Zero

• A set of measure zero is some set

• For a probability measure, any set of measure zero can never occur as it has probability of zero. – It can thus be ignored when stating things about

the collection of sets B.

Page 20: Lecture 2: Measures and Data Collection/Cleaning CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

Borel Sets

• A common σ-algebra is the Borel σ-algebra. A Borel set is an element of a Borel σ-algebra.– Almost any set you can describe on the real line is

a Borel set, for example, the unit line segment [0,1]. Irrational numbers, etc.

– The Borel σ-algebra on the real line is a collection of sets that is the smallest σ-algebra that includes the open subsets of the real line.

Page 21: Lecture 2: Measures and Data Collection/Cleaning CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

Borel Sets

• For some space X, the collection of all Borel sets on X forms a σ-algebra known as the Borel algebra (or Borel σ-algebra) on X.

• Important!

• Why? Any measure defined on the open set of a space, or closed sets of a space, must also be defined on all Borel sets of that space.

Page 22: Lecture 2: Measures and Data Collection/Cleaning CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

Borel Sets

• Borel sets are powerful because if you know what a probability measure does on every interval, then you know what it does on all the Borel sets.

• Allows us to define equivalence of measures.

Page 23: Lecture 2: Measures and Data Collection/Cleaning CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

Borel Sets

• Let’s say we have two measures: • To show they are equivalent we just need to show

that:– They are equivalent on all intervals

• By definition they are then equivalent for all Borel sets, and hence over the measurable space.

• Example: Given probability distributions A, and B, with equivalent cumulative distribution functions, then the probability distributions must also be equal.

Page 24: Lecture 2: Measures and Data Collection/Cleaning CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

Measure Theory and Data Science

• Data Science is about working with, and deriving observations or features from data.

• Features are effectively measures of some sort, but often not for the underlying space of interest.

• Important to realize the limitations of measurable spaces for metrics of interest, and what can and cannot be measured.

Page 25: Lecture 2: Measures and Data Collection/Cleaning CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

Example

Bearcats Elementary School had 300 students in their 5th grade class. 77% of them graduated to middle school. 12% failed their mathematics Standards Of Learning, 11% failed their reading Standards of Learning.The new class of 1st graders had interventions in mathematics and grammar, their graduation rates improved to 88%, with 7% failing mathematics, and 5% failing reading.

What can we infer? How does measure theory relate?

Page 26: Lecture 2: Measures and Data Collection/Cleaning CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

Measure Theory: Further Reading

• M. Capinski and E. Kopp, “Measure, Integral, and Probability”, Springer Undergraduate Mathematics Series, 2004

• S. I. Resnick, “A probability path”, Birkhauser, 1999.

• A. Gut, “Probability: A Graduate Course”, Springer, 2005.

• R. M. Gray, “Entropy and Information Theory”, Springer Verlag (available free online), 1990.

Page 27: Lecture 2: Measures and Data Collection/Cleaning CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

The Data Science Pipeline

• Metric identification• Data collection• Data exploration and summary statistics• Feature generation• Feature importance testing• Modeling• Validation

Page 28: Lecture 2: Measures and Data Collection/Cleaning CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

Automating the Data Pipeline

Drake – Like make for data.

Page 29: Lecture 2: Measures and Data Collection/Cleaning CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

Getting your environments set for Data Science

• Over the next few weeks we will be introducing the projects and getting started with data science projects.

• Need to get the right tools installed!

Page 30: Lecture 2: Measures and Data Collection/Cleaning CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

Anaconda

• https://store.continuum.io/cshop/anaconda/

• Grab the free distribution– Helps you maintain the

appropriate python distributions.

Page 31: Lecture 2: Measures and Data Collection/Cleaning CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

iPython/Jupyter

• Interactive Python with documentation features

• Installs easily with Anaconda– http://

jupyter.readthedocs.org/en/latest/install.html

Page 32: Lecture 2: Measures and Data Collection/Cleaning CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

Markdown

• Markdown Syntax– http://daringfireball.net/projects/markdown/

syntax• Markdown Basics

– http://daringfireball.net/projects/markdown/basics

Page 33: Lecture 2: Measures and Data Collection/Cleaning CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

Compute Lab

• Compute Server Minerva– Each group will get an account on Minerva with

space and compute power for their project

– Cloud-based Ubuntu server, similar to AWS, but private and secure.

Page 34: Lecture 2: Measures and Data Collection/Cleaning CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

For next time• No homework this week, work

on HWK 3 presentations• Work with Jupyter examples on

Minerva once accounts are set up.

• Learn Markdown Basics• No class Thursday