Machine Learning (CSE 446): Practical IssuesMachine Learning (CSE 446): Practical Issues Noah Smith c 2017 University of Washington [email protected] October 18, 2017 1/39

Machine Learning (CSE 446):Practical Issues

Noah Smithc© 2017

University of [email protected]

October 18, 2017

1 / 39

scary words

2 / 39

Outline of CSE 446We’ve already covered stuff in blue!

I Problem formulations: classification, regression

I Supervised techniques: decision trees, nearest neighbors, perceptron, linearmodels, probabilistic models, neural networks, kernel methods

I Unsupervised techniques: clustering, linear dimensionality reduction

I “Meta-techniques”: ensembles, expectation-maximization

I Understanding ML: limits of learning, practical issues, bias & fairness

I Recurring themes: (stochastic) gradient descent, bullshit detection

3 / 39

Today: (More) Best Practices

You already know:

I Separating training and test data

I Hyperparameter tuning on development data

Understanding machine learning is partly about knowing algorithms and partly aboutthe art of mapping abstract problems to learning tasks.

4 / 39

Features Matter

5 / 39

Features Matter

6 / 39

Features Matter

7 / 39

Features Matter

8 / 39

Irrelevant Features

I Decision trees (not too deep)?

I K-nearest neighbors?

I Perceptron?

9 / 39

Irrelevant Features

One irrelevant feature isn’t a big deal; what we’re worried about is when irrelevantfeatures outnumber useful ones!



I Perceptron?

10 / 39

Irrelevant Features




I Perceptron?

11 / 39

Irrelevant Features


I Decision trees (not too deep)?Somewhat protected, but beware spurious correlations!


I Perceptron?

12 / 39

Irrelevant Features




I Perceptron?

13 / 39

Irrelevant Features



I K-nearest neighbors? /

I Perceptron?

14 / 39

Irrelevant Features



I K-nearest neighbors? /I Perceptron?

15 / 39

Irrelevant Features



I K-nearest neighbors? /I Perceptron? ,

16 / 39

Irrelevant Features



I K-nearest neighbors? /I Perceptron? ,

What about redundant features φj and φj′ such that φj ≈ φj′?

17 / 39

Technique: Feature Pruning

If a binary feature is present in too small or too large a fraction of D, remove it.

18 / 39



Example: φ(x) = Jthe word the occurs in document xK

19 / 39



Example: φ(x) = Jthe word the occurs in document xK

Generalization: if a feature has variance (in D) lower than some threshhold value,remove it.Note: in lecture, I mistakenly said to remove high-variance features. Mea culpa.

sample mean(φ;D) =1

N

N∑n=1

φ(xn) (call it “φ”)

sample variance(φ;D) =1

N − 1

N∑n=1

(φ(xn)− φ

)2(call it “Var(φ)”)

20 / 39

Technique: Feature Normalization

Center a feature:

φ(x)→ φ(x)− φ

(This was a required step for principal components analysis!)

Scale a feature. Two choices:

φ(x)→ φ(x)√Var(φ)

“variance scaling”

φ(x)→ φ(x)

maxn|φ(xn)|

“absolute scaling”

21 / 39

Technique: Feature Normalization

Center a feature:

φ(x)→ φ(x)− φ

(This was a required step for principal components analysis!)

Scale a feature. Two choices:

φ(x)→ φ(x)√Var(φ)

“variance scaling”

φ(x)→ φ(x)

maxn|φ(xn)|

“absolute scaling”

Remember that you’ll need to normalize test data before you test!

22 / 39

Techniques: Creating New Features

1. Consider two binary features, φj and φj′ . A new conjunction feature can bedefined by:

φj∧j′(x) = φj(x) ∧ φj′(x)

2. Even more generally, we can create conjunctions (or products) using as manyfeatures as we’d like.

3. Transformations on features can be useful. For example,

φ(x)→ sign(φ(x)) · log (1 + |φ(x)|)

Example: φ(x) is the count of the word cool in document x.

23 / 39

Techniques: Creating New Features1. Consider two binary features, φj and φj′ . A new conjunction feature can be

defined by:


The classic “xor” problem: these points are not linearly separable.



φ(x)→ sign(φ(x)) · log (1 + |φ(x)|)Example: φ(x) is the count of the word cool in document x.

24 / 39


defined by:


Define x[3] = x[1] ∧ x[2].




25 / 39


defined by:


Rotating the view.




26 / 39






φ(x)→ sign(φ(x)) · log (1 + |φ(x)|)


27 / 39


defined by:


2 · x[1] + 2 · x[2]− 4 · x[3]− 1 = 0




28 / 39




Generalization: take the product of two features.



φ(x)→ sign(φ(x)) · log (1 + |φ(x)|)


29 / 39







φ(x)→ sign(φ(x)) · log (1 + |φ(x)|)


30 / 39





2. Even more generally, we can create conjunctions (or products) using as manyfeatures as we’d like.This is one view of what decision trees are doing!

I Every leaf’s path (from root) is a conjunction feature.I Why not build decision trees, extract the features and toss them into the perceptron?


φ(x)→ sign(φ(x)) · log (1 + |φ(x)|)


31 / 39





2. Even more generally, we can create conjunctions (or products) using as manyfeatures as we’d like.This is one view of what decision trees are doing!

I Every leaf’s path (from root) is a conjunction feature.I Why not build decision trees, extract the features and toss them into the perceptron?


φ(x)→ sign(φ(x)) · log (1 + |φ(x)|)


32 / 39

EvaluationAccuracy:

A(f) =∑x

D(x, f(x))

=∑x,y

D(x, y) ·{

1 if f(x) = y0 otherwise

=∑x,y

D(x, y) · Jf(x) = yK

where D is the true distribution over data. Error is 1−A; earlier we denoted error“ε(f).”This is estimated using a test dataset 〈x1, y2〉, . . . , 〈xN ′ , yN ′〉:

A(f) =1

N ′

N ′∑i=1

Jf(xi) = yiK

33 / 39

Issues with Test-Set Accuracy

I Class imbalance: if D(∗, not spam) = 0.99, then you can get A ≈ 0.99 by alwaysguessing “not spam.”

I Relative importance of classes or cost of error types.

I Variance due to the test data.

34 / 39





35 / 39





36 / 39





37 / 39

Evaluation in the Two-Class CaseSuppose we have two classes, and one of them, t, is a “target.”

I E.g., given a query, find relevant documents.

Precision and recall encode the goals of returning a “pure” set of targeted instancesand capturing all of them.

actually in the target

class;y = t

believed to be in the target

class;f(x) = t

correctly labeled

as t

A BC

P(f) =|C||B|

=|A ∩B||B|

R(f) =|C||A|

=|A ∩B||A|

F1(f) = 2 · P · RP + R

38 / 39

Another View: Contingency Table

y = t y 6= t

f(x) = t C (true positives) B \ C (false positives) B

f(x) 6= t A \ C (false negatives) (true negatives)

A

39 / 39

Documents

Machine Learning (CSE 446): Practical IssuesMachine Learning (CSE 446): Practical Issues Noah Smith c 2017 University of Washington [email protected] October 18, 2017 1/39