39
Machine Learning (CSE 446): Practical Issues Noah Smith c 2017 University of Washington [email protected] October 18, 2017 1 / 39

Machine Learning (CSE 446): Practical IssuesMachine Learning (CSE 446): Practical Issues Noah Smith c 2017 University of Washington [email protected] October 18, 2017 1/39

  • Upload
    others

  • View
    10

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Machine Learning (CSE 446): Practical IssuesMachine Learning (CSE 446): Practical Issues Noah Smith c 2017 University of Washington nasmith@cs.washington.edu October 18, 2017 1/39

Machine Learning (CSE 446):Practical Issues

Noah Smithc© 2017

University of [email protected]

October 18, 2017

1 / 39

Page 2: Machine Learning (CSE 446): Practical IssuesMachine Learning (CSE 446): Practical Issues Noah Smith c 2017 University of Washington nasmith@cs.washington.edu October 18, 2017 1/39

scary words

2 / 39

Page 3: Machine Learning (CSE 446): Practical IssuesMachine Learning (CSE 446): Practical Issues Noah Smith c 2017 University of Washington nasmith@cs.washington.edu October 18, 2017 1/39

Outline of CSE 446We’ve already covered stuff in blue!

I Problem formulations: classification, regression

I Supervised techniques: decision trees, nearest neighbors, perceptron, linearmodels, probabilistic models, neural networks, kernel methods

I Unsupervised techniques: clustering, linear dimensionality reduction

I “Meta-techniques”: ensembles, expectation-maximization

I Understanding ML: limits of learning, practical issues, bias & fairness

I Recurring themes: (stochastic) gradient descent, bullshit detection

3 / 39

Page 4: Machine Learning (CSE 446): Practical IssuesMachine Learning (CSE 446): Practical Issues Noah Smith c 2017 University of Washington nasmith@cs.washington.edu October 18, 2017 1/39

Today: (More) Best Practices

You already know:

I Separating training and test data

I Hyperparameter tuning on development data

Understanding machine learning is partly about knowing algorithms and partly aboutthe art of mapping abstract problems to learning tasks.

4 / 39

Page 5: Machine Learning (CSE 446): Practical IssuesMachine Learning (CSE 446): Practical Issues Noah Smith c 2017 University of Washington nasmith@cs.washington.edu October 18, 2017 1/39

Features Matter

5 / 39

Page 6: Machine Learning (CSE 446): Practical IssuesMachine Learning (CSE 446): Practical Issues Noah Smith c 2017 University of Washington nasmith@cs.washington.edu October 18, 2017 1/39

Features Matter

6 / 39

Page 7: Machine Learning (CSE 446): Practical IssuesMachine Learning (CSE 446): Practical Issues Noah Smith c 2017 University of Washington nasmith@cs.washington.edu October 18, 2017 1/39

Features Matter

7 / 39

Page 8: Machine Learning (CSE 446): Practical IssuesMachine Learning (CSE 446): Practical Issues Noah Smith c 2017 University of Washington nasmith@cs.washington.edu October 18, 2017 1/39

Features Matter

8 / 39

Page 9: Machine Learning (CSE 446): Practical IssuesMachine Learning (CSE 446): Practical Issues Noah Smith c 2017 University of Washington nasmith@cs.washington.edu October 18, 2017 1/39

Irrelevant Features

I Decision trees (not too deep)?

I K-nearest neighbors?

I Perceptron?

9 / 39

Page 10: Machine Learning (CSE 446): Practical IssuesMachine Learning (CSE 446): Practical Issues Noah Smith c 2017 University of Washington nasmith@cs.washington.edu October 18, 2017 1/39

Irrelevant Features

One irrelevant feature isn’t a big deal; what we’re worried about is when irrelevantfeatures outnumber useful ones!

I Decision trees (not too deep)?

I K-nearest neighbors?

I Perceptron?

10 / 39

Page 11: Machine Learning (CSE 446): Practical IssuesMachine Learning (CSE 446): Practical Issues Noah Smith c 2017 University of Washington nasmith@cs.washington.edu October 18, 2017 1/39

Irrelevant Features

One irrelevant feature isn’t a big deal; what we’re worried about is when irrelevantfeatures outnumber useful ones!

I Decision trees (not too deep)?

I K-nearest neighbors?

I Perceptron?

11 / 39

Page 12: Machine Learning (CSE 446): Practical IssuesMachine Learning (CSE 446): Practical Issues Noah Smith c 2017 University of Washington nasmith@cs.washington.edu October 18, 2017 1/39

Irrelevant Features

One irrelevant feature isn’t a big deal; what we’re worried about is when irrelevantfeatures outnumber useful ones!

I Decision trees (not too deep)?Somewhat protected, but beware spurious correlations!

I K-nearest neighbors?

I Perceptron?

12 / 39

Page 13: Machine Learning (CSE 446): Practical IssuesMachine Learning (CSE 446): Practical Issues Noah Smith c 2017 University of Washington nasmith@cs.washington.edu October 18, 2017 1/39

Irrelevant Features

One irrelevant feature isn’t a big deal; what we’re worried about is when irrelevantfeatures outnumber useful ones!

I Decision trees (not too deep)?Somewhat protected, but beware spurious correlations!

I K-nearest neighbors?

I Perceptron?

13 / 39

Page 14: Machine Learning (CSE 446): Practical IssuesMachine Learning (CSE 446): Practical Issues Noah Smith c 2017 University of Washington nasmith@cs.washington.edu October 18, 2017 1/39

Irrelevant Features

One irrelevant feature isn’t a big deal; what we’re worried about is when irrelevantfeatures outnumber useful ones!

I Decision trees (not too deep)?Somewhat protected, but beware spurious correlations!

I K-nearest neighbors? /

I Perceptron?

14 / 39

Page 15: Machine Learning (CSE 446): Practical IssuesMachine Learning (CSE 446): Practical Issues Noah Smith c 2017 University of Washington nasmith@cs.washington.edu October 18, 2017 1/39

Irrelevant Features

One irrelevant feature isn’t a big deal; what we’re worried about is when irrelevantfeatures outnumber useful ones!

I Decision trees (not too deep)?Somewhat protected, but beware spurious correlations!

I K-nearest neighbors? /I Perceptron?

15 / 39

Page 16: Machine Learning (CSE 446): Practical IssuesMachine Learning (CSE 446): Practical Issues Noah Smith c 2017 University of Washington nasmith@cs.washington.edu October 18, 2017 1/39

Irrelevant Features

One irrelevant feature isn’t a big deal; what we’re worried about is when irrelevantfeatures outnumber useful ones!

I Decision trees (not too deep)?Somewhat protected, but beware spurious correlations!

I K-nearest neighbors? /I Perceptron? ,

16 / 39

Page 17: Machine Learning (CSE 446): Practical IssuesMachine Learning (CSE 446): Practical Issues Noah Smith c 2017 University of Washington nasmith@cs.washington.edu October 18, 2017 1/39

Irrelevant Features

One irrelevant feature isn’t a big deal; what we’re worried about is when irrelevantfeatures outnumber useful ones!

I Decision trees (not too deep)?Somewhat protected, but beware spurious correlations!

I K-nearest neighbors? /I Perceptron? ,

What about redundant features φj and φj′ such that φj ≈ φj′?

17 / 39

Page 18: Machine Learning (CSE 446): Practical IssuesMachine Learning (CSE 446): Practical Issues Noah Smith c 2017 University of Washington nasmith@cs.washington.edu October 18, 2017 1/39

Technique: Feature Pruning

If a binary feature is present in too small or too large a fraction of D, remove it.

18 / 39

Page 19: Machine Learning (CSE 446): Practical IssuesMachine Learning (CSE 446): Practical Issues Noah Smith c 2017 University of Washington nasmith@cs.washington.edu October 18, 2017 1/39

Technique: Feature Pruning

If a binary feature is present in too small or too large a fraction of D, remove it.

Example: φ(x) = Jthe word the occurs in document xK

19 / 39

Page 20: Machine Learning (CSE 446): Practical IssuesMachine Learning (CSE 446): Practical Issues Noah Smith c 2017 University of Washington nasmith@cs.washington.edu October 18, 2017 1/39

Technique: Feature Pruning

If a binary feature is present in too small or too large a fraction of D, remove it.

Example: φ(x) = Jthe word the occurs in document xK

Generalization: if a feature has variance (in D) lower than some threshhold value,remove it.Note: in lecture, I mistakenly said to remove high-variance features. Mea culpa.

sample mean(φ;D) =1

N

N∑n=1

φ(xn) (call it “φ”)

sample variance(φ;D) =1

N − 1

N∑n=1

(φ(xn)− φ

)2(call it “Var(φ)”)

20 / 39

Page 21: Machine Learning (CSE 446): Practical IssuesMachine Learning (CSE 446): Practical Issues Noah Smith c 2017 University of Washington nasmith@cs.washington.edu October 18, 2017 1/39

Technique: Feature Normalization

Center a feature:

φ(x)→ φ(x)− φ

(This was a required step for principal components analysis!)

Scale a feature. Two choices:

φ(x)→ φ(x)√Var(φ)

“variance scaling”

φ(x)→ φ(x)

maxn|φ(xn)|

“absolute scaling”

21 / 39

Page 22: Machine Learning (CSE 446): Practical IssuesMachine Learning (CSE 446): Practical Issues Noah Smith c 2017 University of Washington nasmith@cs.washington.edu October 18, 2017 1/39

Technique: Feature Normalization

Center a feature:

φ(x)→ φ(x)− φ

(This was a required step for principal components analysis!)

Scale a feature. Two choices:

φ(x)→ φ(x)√Var(φ)

“variance scaling”

φ(x)→ φ(x)

maxn|φ(xn)|

“absolute scaling”

Remember that you’ll need to normalize test data before you test!

22 / 39

Page 23: Machine Learning (CSE 446): Practical IssuesMachine Learning (CSE 446): Practical Issues Noah Smith c 2017 University of Washington nasmith@cs.washington.edu October 18, 2017 1/39

Techniques: Creating New Features

1. Consider two binary features, φj and φj′ . A new conjunction feature can bedefined by:

φj∧j′(x) = φj(x) ∧ φj′(x)

2. Even more generally, we can create conjunctions (or products) using as manyfeatures as we’d like.

3. Transformations on features can be useful. For example,

φ(x)→ sign(φ(x)) · log (1 + |φ(x)|)

Example: φ(x) is the count of the word cool in document x.

23 / 39

Page 24: Machine Learning (CSE 446): Practical IssuesMachine Learning (CSE 446): Practical Issues Noah Smith c 2017 University of Washington nasmith@cs.washington.edu October 18, 2017 1/39

Techniques: Creating New Features1. Consider two binary features, φj and φj′ . A new conjunction feature can be

defined by:

φj∧j′(x) = φj(x) ∧ φj′(x)

The classic “xor” problem: these points are not linearly separable.

2. Even more generally, we can create conjunctions (or products) using as manyfeatures as we’d like.

3. Transformations on features can be useful. For example,

φ(x)→ sign(φ(x)) · log (1 + |φ(x)|)Example: φ(x) is the count of the word cool in document x.

24 / 39

Page 25: Machine Learning (CSE 446): Practical IssuesMachine Learning (CSE 446): Practical Issues Noah Smith c 2017 University of Washington nasmith@cs.washington.edu October 18, 2017 1/39

Techniques: Creating New Features1. Consider two binary features, φj and φj′ . A new conjunction feature can be

defined by:

φj∧j′(x) = φj(x) ∧ φj′(x)

Define x[3] = x[1] ∧ x[2].

2. Even more generally, we can create conjunctions (or products) using as manyfeatures as we’d like.

3. Transformations on features can be useful. For example,

φ(x)→ sign(φ(x)) · log (1 + |φ(x)|)Example: φ(x) is the count of the word cool in document x.

25 / 39

Page 26: Machine Learning (CSE 446): Practical IssuesMachine Learning (CSE 446): Practical Issues Noah Smith c 2017 University of Washington nasmith@cs.washington.edu October 18, 2017 1/39

Techniques: Creating New Features1. Consider two binary features, φj and φj′ . A new conjunction feature can be

defined by:

φj∧j′(x) = φj(x) ∧ φj′(x)

Rotating the view.

2. Even more generally, we can create conjunctions (or products) using as manyfeatures as we’d like.

3. Transformations on features can be useful. For example,

φ(x)→ sign(φ(x)) · log (1 + |φ(x)|)Example: φ(x) is the count of the word cool in document x.

26 / 39

Page 27: Machine Learning (CSE 446): Practical IssuesMachine Learning (CSE 446): Practical Issues Noah Smith c 2017 University of Washington nasmith@cs.washington.edu October 18, 2017 1/39

Techniques: Creating New Features

1. Consider two binary features, φj and φj′ . A new conjunction feature can bedefined by:

φj∧j′(x) = φj(x) ∧ φj′(x)

2. Even more generally, we can create conjunctions (or products) using as manyfeatures as we’d like.

3. Transformations on features can be useful. For example,

φ(x)→ sign(φ(x)) · log (1 + |φ(x)|)

Example: φ(x) is the count of the word cool in document x.

27 / 39

Page 28: Machine Learning (CSE 446): Practical IssuesMachine Learning (CSE 446): Practical Issues Noah Smith c 2017 University of Washington nasmith@cs.washington.edu October 18, 2017 1/39

Techniques: Creating New Features1. Consider two binary features, φj and φj′ . A new conjunction feature can be

defined by:

φj∧j′(x) = φj(x) ∧ φj′(x)

2 · x[1] + 2 · x[2]− 4 · x[3]− 1 = 0

2. Even more generally, we can create conjunctions (or products) using as manyfeatures as we’d like.

3. Transformations on features can be useful. For example,

φ(x)→ sign(φ(x)) · log (1 + |φ(x)|)Example: φ(x) is the count of the word cool in document x.

28 / 39

Page 29: Machine Learning (CSE 446): Practical IssuesMachine Learning (CSE 446): Practical Issues Noah Smith c 2017 University of Washington nasmith@cs.washington.edu October 18, 2017 1/39

Techniques: Creating New Features

1. Consider two binary features, φj and φj′ . A new conjunction feature can bedefined by:

φj∧j′(x) = φj(x) ∧ φj′(x)

Generalization: take the product of two features.

2. Even more generally, we can create conjunctions (or products) using as manyfeatures as we’d like.

3. Transformations on features can be useful. For example,

φ(x)→ sign(φ(x)) · log (1 + |φ(x)|)

Example: φ(x) is the count of the word cool in document x.

29 / 39

Page 30: Machine Learning (CSE 446): Practical IssuesMachine Learning (CSE 446): Practical Issues Noah Smith c 2017 University of Washington nasmith@cs.washington.edu October 18, 2017 1/39

Techniques: Creating New Features

1. Consider two binary features, φj and φj′ . A new conjunction feature can bedefined by:

φj∧j′(x) = φj(x) ∧ φj′(x)

Generalization: take the product of two features.

2. Even more generally, we can create conjunctions (or products) using as manyfeatures as we’d like.

3. Transformations on features can be useful. For example,

φ(x)→ sign(φ(x)) · log (1 + |φ(x)|)

Example: φ(x) is the count of the word cool in document x.

30 / 39

Page 31: Machine Learning (CSE 446): Practical IssuesMachine Learning (CSE 446): Practical Issues Noah Smith c 2017 University of Washington nasmith@cs.washington.edu October 18, 2017 1/39

Techniques: Creating New Features

1. Consider two binary features, φj and φj′ . A new conjunction feature can bedefined by:

φj∧j′(x) = φj(x) ∧ φj′(x)

Generalization: take the product of two features.

2. Even more generally, we can create conjunctions (or products) using as manyfeatures as we’d like.This is one view of what decision trees are doing!

I Every leaf’s path (from root) is a conjunction feature.I Why not build decision trees, extract the features and toss them into the perceptron?

3. Transformations on features can be useful. For example,

φ(x)→ sign(φ(x)) · log (1 + |φ(x)|)

Example: φ(x) is the count of the word cool in document x.

31 / 39

Page 32: Machine Learning (CSE 446): Practical IssuesMachine Learning (CSE 446): Practical Issues Noah Smith c 2017 University of Washington nasmith@cs.washington.edu October 18, 2017 1/39

Techniques: Creating New Features

1. Consider two binary features, φj and φj′ . A new conjunction feature can bedefined by:

φj∧j′(x) = φj(x) ∧ φj′(x)

Generalization: take the product of two features.

2. Even more generally, we can create conjunctions (or products) using as manyfeatures as we’d like.This is one view of what decision trees are doing!

I Every leaf’s path (from root) is a conjunction feature.I Why not build decision trees, extract the features and toss them into the perceptron?

3. Transformations on features can be useful. For example,

φ(x)→ sign(φ(x)) · log (1 + |φ(x)|)

Example: φ(x) is the count of the word cool in document x.

32 / 39

Page 33: Machine Learning (CSE 446): Practical IssuesMachine Learning (CSE 446): Practical Issues Noah Smith c 2017 University of Washington nasmith@cs.washington.edu October 18, 2017 1/39

EvaluationAccuracy:

A(f) =∑x

D(x, f(x))

=∑x,y

D(x, y) ·{

1 if f(x) = y0 otherwise

=∑x,y

D(x, y) · Jf(x) = yK

where D is the true distribution over data. Error is 1−A; earlier we denoted error“ε(f).”This is estimated using a test dataset 〈x1, y2〉, . . . , 〈xN ′ , yN ′〉:

A(f) =1

N ′

N ′∑i=1

Jf(xi) = yiK

33 / 39

Page 34: Machine Learning (CSE 446): Practical IssuesMachine Learning (CSE 446): Practical Issues Noah Smith c 2017 University of Washington nasmith@cs.washington.edu October 18, 2017 1/39

Issues with Test-Set Accuracy

I Class imbalance: if D(∗, not spam) = 0.99, then you can get A ≈ 0.99 by alwaysguessing “not spam.”

I Relative importance of classes or cost of error types.

I Variance due to the test data.

34 / 39

Page 35: Machine Learning (CSE 446): Practical IssuesMachine Learning (CSE 446): Practical Issues Noah Smith c 2017 University of Washington nasmith@cs.washington.edu October 18, 2017 1/39

Issues with Test-Set Accuracy

I Class imbalance: if D(∗, not spam) = 0.99, then you can get A ≈ 0.99 by alwaysguessing “not spam.”

I Relative importance of classes or cost of error types.

I Variance due to the test data.

35 / 39

Page 36: Machine Learning (CSE 446): Practical IssuesMachine Learning (CSE 446): Practical Issues Noah Smith c 2017 University of Washington nasmith@cs.washington.edu October 18, 2017 1/39

Issues with Test-Set Accuracy

I Class imbalance: if D(∗, not spam) = 0.99, then you can get A ≈ 0.99 by alwaysguessing “not spam.”

I Relative importance of classes or cost of error types.

I Variance due to the test data.

36 / 39

Page 37: Machine Learning (CSE 446): Practical IssuesMachine Learning (CSE 446): Practical Issues Noah Smith c 2017 University of Washington nasmith@cs.washington.edu October 18, 2017 1/39

Issues with Test-Set Accuracy

I Class imbalance: if D(∗, not spam) = 0.99, then you can get A ≈ 0.99 by alwaysguessing “not spam.”

I Relative importance of classes or cost of error types.

I Variance due to the test data.

37 / 39

Page 38: Machine Learning (CSE 446): Practical IssuesMachine Learning (CSE 446): Practical Issues Noah Smith c 2017 University of Washington nasmith@cs.washington.edu October 18, 2017 1/39

Evaluation in the Two-Class CaseSuppose we have two classes, and one of them, t, is a “target.”

I E.g., given a query, find relevant documents.

Precision and recall encode the goals of returning a “pure” set of targeted instancesand capturing all of them.

actually in the target

class;y = t

believed to be in the target

class;f(x) = t

correctly labeled

as t

A BC

P(f) =|C||B|

=|A ∩B||B|

R(f) =|C||A|

=|A ∩B||A|

F1(f) = 2 · P · RP + R

38 / 39

Page 39: Machine Learning (CSE 446): Practical IssuesMachine Learning (CSE 446): Practical Issues Noah Smith c 2017 University of Washington nasmith@cs.washington.edu October 18, 2017 1/39

Another View: Contingency Table

y = t y 6= t

f(x) = t C (true positives) B \ C (false positives) B

f(x) 6= t A \ C (false negatives) (true negatives)

A

39 / 39