31
INTRODUCTION TO DATA SCIENCE NIKO VUOKKO JYVÄSKYLÄ SUMMER SCHOOL AUGUST 2013

Introduction to Data Science

Embed Size (px)

DESCRIPTION

Two hour lecture I gave at the Jyväskylä Summer School. The purpose of the talk is to give a quick non-technical overview of concepts and methodologies in data science. Topics include a wide overview of both pattern mining and machine learning. See also Part 2 of the lecture: Industrial Data Science. You can find it in my profile (click the face)

Citation preview

Page 1: Introduction to Data Science

INTRODUCTION TO DATA SCIENCENIKO VUOKKO

JYVÄSKYLÄ SUMMER SCHOOL

AUGUST 2013

Page 2: Introduction to Data Science

DATA SCIENCE WITH A BROAD BRUSH

Concepts and methodologies

Page 3: Introduction to Data Science

DATA SCIENCE IS AN UMBRELLA, A FUSION

• Databases and infrastructure

• Pattern mining

• Statistics

• Machine learning

• Numerical optimization

• Stochastic modeling

• Data visualization

… of specialties needed

for data-driven

business optimization

Page 4: Introduction to Data Science

DATA SCIENTIST

• Data scientist is defined as DS : business problem data solution

• Combination of strong programming, math, computational and business skills

• Recipe for success

1. Convert vague business requirements into measurable technical targets

2. Develop a solution to reach the targets

3. Communicate business results

4. Deploy the solution in production

Page 5: Introduction to Data Science

UNDERSTANDING DATAMonday 19 August 2013

Page 6: Introduction to Data Science

PATTERN MINING AND DATA ANALYSIS

Page 7: Introduction to Data Science

UNSUPERVISED LEARNING

• Could be called pattern recognition or structure discovery

• What kind of a process could have produced this data?

• Discovery of “interesting” phenomena in a dataset

• Now how do you define interesting?

• Learning algorithms exist for a huge collection of pattern types

• Analogy: You decide if you want to see westerns or comedies,

but the machine picks the movies

• But does “interesting” imply useful and significant?

Page 8: Introduction to Data Science

EXAMPLES OF STRUCTURES IN DATA

• Clustering and mixture models: separation of data into parts

• Dictionary learning: a compact grammar of the dataset

• Single class learning: learn the natural boundaries of data

Example: Early detection of machine failure or network intrusion

• Latent allocation: learn hidden preferences driving purchase decisions

• Source separation: find independent generators of the data

Example: Independent phenomena affecting exchange rates

Page 9: Introduction to Data Science

MORE EXAMPLES OF “INTERESTING” PATTERNS

• { charcoal, mustard } ⇒ sausage

• Grocery customer types with differing paths around the trading floor

• Pricing trend change in a web ad exchange

• Communities and topics in a social network

• Distinct features of a person’s face and fingerprints

• Objects emerging in front of a moving car

Page 10: Introduction to Data Science

KNOW YOUR EIGENS AND SINGULARS

• Eigenvalue and singular value decompositions are central data analysis tools

• They describe the energy distribution and static core structures of data

Examples

• Face detection, speaker adaptation

• Google PageRank is basically just the world’s largest EVD

• Zombie outbreak risk is determined by its eigenvalues

• As a sub-component in every second learning algorithm

Page 11: Introduction to Data Science

DIMENSION REDUCTION

• Some applications encounter large dimension counts up to millions

• Dimension reduction may either

1. Retain space: preserve the most “descriptive” dimensions

2. Transform space: trade interpretability for powerful rendition

• Usually transformations work oblivious to data (they are simple functions)

• Curvilinear transformations try to see how the data is “folded” and build new

dimensions specific to the given dataset

Page 12: Introduction to Data Science

DIMENSION REDUCTION EXAMPLE

• Singular value decomposition is commonly used to remove the “noise

dimensions” with little energy

• Example: gene expression data and movie preferences have lots of these

• After this more complex methods can be used for unfolding the data

Page 13: Introduction to Data Science

DIMENSION REDUCTION EXAMPLE

Page 14: Introduction to Data Science

BLIND SOURCE SEPARATION

• Find latent sources that generated the data

• Tries to discover the real truth beneath all noise and convolution

• Examples:

• Air defense missile guidance systems

• Error-correcting codes

• Language modeling

• Brain activity factors

• Industrial process dynamics

• Factors behind climate change

Page 15: Introduction to Data Science

(STATISTICAL) SIGNIFICANCE TESTING

• Example: Rejection rate increase in a manufacturing plant

• “What is the probability of observing this increase if everything was OK?”

• “What is the probability of having a valid alert if there really was something

wrong?”

• Reliability of significance testing results is wholly dependent on correct

modeling of the data source and pattern type

• Statistical significance is different from material significance

Page 16: Introduction to Data Science

CORRELATION IS NOT CAUSALITY

A correlation may hide an almost arbitrary truth

• Cities with more firemen have more fires

• Companies spending more in marketing have higher revenues

• Marsupials exist mainly in Australia

• However, making successful predictions does not require causality

Page 17: Introduction to Data Science

MACHINE LEARNING

Basics

Page 18: Introduction to Data Science

SUPERVISED LEARNING

• Simplistically task is to find function f : f(input) = output

• Examples: spam filtering, speech recognition, steel strength estimation

• Risks for different types of errors can be very skewed

• Complex inputs may confuse or slow down models

• Unsupervised methods often useful in improving results by simplifying the input

Page 19: Introduction to Data Science

SEMI-SUPERVISED LEARNING

• Only a part of data is labeled

• Needed when labeling data is expensive

• Understanding the structure of unlabeled data enhances learning by bringing

diversity and generalization and by constraining learning

• Relates to multi-source learning, some sources labeled, some not

• Examples:

• Object detection from a video feed

• Web page categorization

• Sentiment analysis

• Transfer learning between domains

Page 20: Introduction to Data Science

TRAINING, TESTING, VALIDATION

• A model is trained using a training dataset

• The quality of the model is measured by using it on a separate testing dataset

• A model often contains hyper-parameters chosen by the user

• A separate validation dataset is split off from the training data

• Validation data is used for testing and finding good hyper-parameter values

• Cross-validation is common practice and asymptotically unbiased

Page 21: Introduction to Data Science

BIAS AND VARIANCE

• Squared error of predictions consists of bias and variance (and noise)

• BIAS Model incapability of approximating the underlying truth

• VARIANCE Model reliance on whims of the observed data

• Complex models often have low bias and high variance

• Simple models often have high bias and low variance

• Having more data instances (rows) may reduce variance

• Having more detailed data (variables) may reduce bias

• Testing different types of models can explain how to improve your data

Page 22: Introduction to Data Science

TRAINING AND TESTING, BIAS AND VARIANCE

Complex modelSimple model

Minimal testing error

Minimal training error

Page 23: Introduction to Data Science

MACHINE LEARNING

Learning new tricks

Page 24: Introduction to Data Science

THE KERNEL TRICK

• Many learning methods rely on inner products of data points

• The “kernel trick” maps the data to an implicitly defined, high dimension space

• Kernel is the matrix of the new inner products in this space

• Mapping itself often left unknown

• Example: Gaussian kernel associates local Euclidean neighborhoods to similarity

• Example: String kernels are used for modeling DNA sequence structure

• Kernels can be combined and custom built to match expert knowledge

A kernel is a dataset-specific space transformation,

success depends on good understanding of the dataset

Page 25: Introduction to Data Science

ENSEMBLE LEARNING

• The power of many: combine multiple models into one

• Wide and strong proof of superior performance

• Extra bonus: often trivially parallelizable

OUR EXPERIENCE IS THAT MOST EFFORTS SHOULD BE CONCENTRATED IN

DERIVING SUBSTANTIALLY DIFFERENT APPROACHES, RATHER THAN REFINING

A SINGLE TECHNIQUE.

Netflix $1M prize winner (ensemble of 107 models)

Page 26: Introduction to Data Science

ENSEMBLE LEARNING IN PRACTICE

• Boosting: weigh (⇒ low bias) focused (⇒ low bias) simple models (⇒ low bias)

• Bagging: average (⇒ low variance) results of simple models (⇒ low bias)

• What aspect of the data am I still missing?

• Variable mixing, discretized jumps, independent factors, transformations, etc.

• Questions about practical implementability and ROI

• Failure: Netflix winner solution never taken to production

• Success: Official US hurricane model is an ensemble of 43

Page 27: Introduction to Data Science

RANDOMIZED LEARNING

• Motivation: random variation beats expert guidance surprisingly often

• Introducing randomness can improve generalization performance (smaller

variance)

• Randomness allows methods to discover unexpected success

• Examples: genetic models, simulated annealing, parallel tempering

• Increasingly useful to allow scale-out for large datasets

• Many successful methods combine random models as an ensemble

• Example: combining random projections or transformations can often beat optimized

unsupervised models

Page 28: Introduction to Data Science

ONLINE LEARNING

• Instead of ingesting a training dataset, adjust the data model after every

incoming (instance, label) pair

• Allows quick adaptation and “always-on” operation

• Finds good models fast, but may miss the great one

⟹ suitable also as a burn-in for other models

• Useful especially for the present trend towards analyzing data streams

Page 29: Introduction to Data Science

BAYESIAN BASICS

• Bayesians see data as fixed and parameters as distributions

• Parameters have prior assumptions that can encode expert knowledge

• Data is used as evidence for possible parameter values

• Final output is a set of posterior distributions for the parameters

• Models may employ only the most probable parameter values or their full

probability distribution

• Variational Bayes approximates the posterior with a simpler distribution

Page 30: Introduction to Data Science

MODEL COMPLEXITY

• Limiting model size and complexity can be used to avoid excessive bias

• Minimum description length and Akaike/Bayesian information criteria are the

Occam’s razor of data science

• VC dimension of a model provides a theoretical limit for generalization error

• Regularization can limit instance weights or parameter sizes

• Bayesian models use hyper-parameters to limit parameter overfit

Page 31: Introduction to Data Science

THE END